Submitted by t3_xuogm3 in MachineLearning

Hi all,

I've been reading up on neural networks, primarily for image processing applications. Given the current capabilities of the neural networks, it seems a little simplistic to think that in the end, we are learning a bunch of linear functions (hyperplanes). Why not use more complex functions to represent neurons or higher-order functions?

Thanks, MLNoober

------------------------------------------------------------------------------------------

Thank you for the replies.

I understand that neural networks can represent non-linear complex functions.

To clarify more,

My question is that a single neuron still computes F(X) = WX + b, which is a linear function.

Why not use a higher order function F(X) = WX^n + W1 X^(n-1) + ... +b.

I can imagine the increase in computational needed to implement this, but neural networks were considered to be time-consuming until we started using GPUs for parallel computations.

So if we ignore the implementation details to accomplish this for large networks, are there any inherent advantages to using higher-order neurons?

--------------------------------------------------------------------------------------------

Update:

I did some searching and found a few papers that are relatively new that utilize quadratic neurons. Some have even successfully incorporated them even into convolutional layers, and show a significant improvement in performance. However, they report the need for significantly large number of parameters (may be the reason why I could not find anything higher than order 2). So, I wonder

  1. How a combination of quadratic and linear neurons incorporated in each layer would perform?
  2. Are there a different set of activation functions that are suitable for quadratic neurons?

[1] Fenglei Fan, Wenxiang Cong and Ge Wang, "A new type of neurons for machine learning", International journal for numerical methods in biomedical engineering, vol. 34, no. 2, pp. e2920, 2018.

[2] Fenglei Fan, Hongming Shan, Lars Gjesteby and Ge Wang, "Quadratic neural networks for ct metal artifact reduction" in Developments in X-Ray Tomography XII, W. International Society for Optics and Photonics, vol. 11113, pp. 111130, 2019.

[3] Yaparla Ganesh, Rhishi Pratap Singh and Garimella Rama Murthy, "Pattern classification using quadratic neuron: An experimental study", 2017 8th International Conference on Computing Communication and Networking Technologies (ICCCNT), pp. 1-6, 2017.

[4] P. Mantini and S. K. Shah, "CQNN: Convolutional Quadratic Neural Networks," 2020 25th International Conference on Pattern Recognition (ICPR), 2021, pp. 9819-9826, doi: 10.1109/ICPR48806.2021.9413207.

​

33

Comments

You must log in or register to comment.

t1_iqwnq5x wrote

In deep learning, neurons are not represented as a linear function. The output of a neuron is implemented by taking a linear combination of the inputs and then feeding that into a non-linear function, e.g. ReLU. The non-linearity is critical, because without it, you can't approximate non-linear functions well, even with deep networks.

64

t1_iqwq1aw wrote

Also, a linear transform of a linear transform is just a linear transform. You need those activation functions in between your layers, otherwise multiple layers is pointless.

42

t1_iqwjiua wrote

  1. Layered linear nodes can model non-linear behaviours.
  2. Computational complexity. Its more efficient to use the aforementioned layers of linears than it is to use non-linear functions directly
24

t1_iqx6bhc wrote

The activation functions are key. A linear combination of linear combinations is probably equal to a linear combination, so 10 layers would equate to a single layer, which is only capable of so much. The activation functions destroy the linearity though and are the key ingredient there

13

t1_iqxdhw3 wrote

Without activation functions an mlp would just be y=sum(m•x) + b

7

t1_iqybakm wrote

Lol thank you! Totally dropped the ball by missing that crucial element out.

3

t1_iqxcyg9 wrote

From the edited question I can see that you understand that it is possible but unnecessary, yet you are still wondering whether it has any drawbacks or inertia.

One thing to consider is that modern deep learning relies on very efficient parallel hardwares such as GPUs. They are usually made to carry out simple instructions in a massively parallel manner. A widely known metaphor is that CPU is several highly educated people, while GPU is thousand of ten year olds. If it is the best we have, we as well may should utilize what they can do - performing simple instructions(matrix mult etc).

If using polynomial neurons or such stuff has added benefits such as extending theoretical results, it might have been considered. However, we have Univ. Approx Thm and rich results that makes such effort less exciting.

So yes, I think if you can somehow design some next generation GPU which can evaluate cubic polynomials extremely fast, by all means ML will learn to adapt using cubics as building blocks.

19

t1_iqx9kbo wrote

This is a bit like why computers happen to use binary and not ternary. Everything has been tried before.

There's a long and rich history for artificial neural networks but everybody seems to gravitate toward fewer moving parts in their already uncertain and difficult work.

I guess eventually the paved road of today's MLPs with GPUs became so convenient to use that very few have the time or means to try something radically different without good reason.

This is a fun read by the way: https://stats.stackexchange.com/questions/468606/why-arent-neural-networks-used-with-rbf-activation-functions-or-other-non-mono

14

t1_iqwoucu wrote

Assuming you mean how the weights/biasses are summed up, we don't. In neural networks the sum of weights is passed through a (usually) nonlinear activation function. Because of the nonlinearity we get the universal approximation theorem which says a 1 layer neural network is suffecient to model any continuous function, with some caveats.

Basically the nonlinearity in the activations is assumed to be suffecient for all applications, therefore it is not necessary to have a more complex structure.

However, when in comes to discontinuous functions things get more interesting, 1 layer NNs cannot, for example, model the XOR function. But with a more complex or Nonlinear construction this might be possible.

3

OP t1_iqwyu29 wrote

Thank you for the replies.

I understand that neural networks can represent non-linear complex functions.

To clarify more,

My question is that a single neuron still computes F(X) = WX + b, which is a linear function.

Why not use a higher order function F(X) = WX^n + W1 X^(n-1) + ... +b.

I can imagine the increase in computational needed to implement this, but neural networks were considered to be time-consuming until we started using GPUs for parallel computations.

So if we ignore the implementation details to accomplish this for large networks, are there any inherent advantages to using higher-order neurons?

3

t1_iqx4ze2 wrote

If I'm understanding correctly, you're proposing each link (dendrite) could have a polynomial transfer function as a way to introduce additional nonlinearity. Is that correct?

First of all, there's the significantly increased computational costs (no free lunch). Second, what is it buying you? Neural nets as they're currently formulated can already approximate any function to arbitrary precision. Your method would do that in a different way, but it would be much more inefficient while not adding any additional expressive power. Making the activation function non-monotonic seems like a bad idea for obvious reasons (at least for typical neural nets), and making it more complex than a sigmoid seems pointless. The success of ReLU units relative to sigmoids shows that reducing the complexity of the activation function has benefits without significant drawbacks.

It's not a bad question, but I think there's a clear answer.

14

t1_iqxg1vm wrote

It’s not a clear answer. Our neurons actually have multiplicative effects, not only additive. The paper that talks about it I think is Active Dendrites, something Catastrophoc Forgetting. The real reason we don’t use polynomial is because of the combinatoric scaling of a d variable polynomial. However, a mlp cannot approximate y=x^2 to an arbitrary accuracy on (-inf, inf) no matter how large the size of your network. I can think of a proof of this for sigmoid, tanh, and Relu activations. A polynomial kernel (x^0, x^1, …, x^n) could fit y=x^2 perfectly however. An mlp that allowed you to multiply two inputs to each neuron could also learn the function perfectly. I’d be interested in papers that use multiple activation function and allow input interaction to enforce Occams Razor through weight regularization or something. Sure nets like that would generalize better.

6

t1_iqxklya wrote

What's the benefit of neural nets being able to approximate analytic functions perfectly on (-inf, inf)? Standard neural nets can approximate to arbitrary accuracy on a bounded range, and training data will always be bounded. If you want to deal with unbounded ranges, there are various ways of doing symbolic regression that are designed for that.

7

t1_iqxuph2 wrote

Generalization out of distribution might be the biggest thing holding back ML rn. It’s worth thinking about whether the priors we encode in nns now are to blame. A large mlp is required just to approximate a single neuron. Maybe the unit additive nonlinearity we are using now is too simple. I’m sure there is a sweet spot between complex interactions/few neurons and simple interactions/many neurons.

6

t1_iqzr880 wrote

Taylor series are famously bad at generalizing and making predictions on out-of-distribution data. But you are absolutely free to add feature engineering on your inputs. It is very common to take the log of a numeric input and you always standardize your inputs in some way, either trying to bound between 0 and 1 or giving the data mean 0 and std 1. In the same way you could totally look at x*y effects. If you don't have reason why two values should be multiplied with each other then you could try all combinations and feed to a decision forest or logistic regression and see if any come out as being very important.

1

t1_iqx2g66 wrote

>So if we ignore the implementation details to accomplish this for large networks, are there any inherent advantages to using higher-order neurons?

I don't know what that might be, but there is an inherent advantage in stacking layers of act(WX+b) where act is some non-linear function. Instead of guessing what higher level function you should use for each neuron, you can learn the higher order function by stacking many simpler non-linear functions. That way the solution is general and can work over many different datasets and modalities.

3

t1_ir0n6hb wrote

You are missing the activation function, which is part of the neuron. They're sometimes passed a separate layer, but it's just a way to represent nested functions. So it isnt:

F(X) = WX + b

It is:

F(X) = A(WX + b), where A is a nonlinear function.

You could make A a polynomial function and it would be equivalent to your suggestion. However polynomials have poor convergence properties and are expensive to compute. Early neural nets used sigmoid activations for non-linearity, now various versions of ReLU are most popular. It turns out that basically any non-linear function gives the model enough freedom to approximate any non-linear relationship, because so many neurons are then recombined. In the case of ReLU, it's like using the Epcot Ball to approximate a sphere.

1

t1_iqzixrs wrote

A common theme in these topics is people observe the status quo design decisions, such as linear layers connected by relu, and then try and backwards rationalize it with relatively hand-wavy mathematical justification. Citing things such as the universal approximation theorem which are not particularly relevant.

The reality is that this field is heavily driven by empirical results, and I would be highly skeptical of anyone saying that "xyz is the clear best way to do it".

3

t1_ir0a4p5 wrote

Here is my answer: First, let’s ask why not an exponential or logarithmic function instead of a quadratic or a higher order polynomial? Or maybe a sinusoid function? The thing is, we might be needing one of such nonlinearities, or maybe another kind of nonlinearity, based on the problem, and, we don’t know it. The idea with neural networks is that it combines many simple neurons that can learn any of these nonlinearities inside it during training. If such a quadratic relationship is a relevant feature for your problem, it will learn it, meaning that some of the neurons will end up with simulating that quadratic relationship. This is a much more flexible way than hard coding the nonlinearity right away from the beginning.

2

t1_iqwsnee wrote

First it is not linear. Relu functions (rectified linear functions) are piecewise linear model. By combining them you can make approximate any non linear function with a finite number of discontinuities as close as you want. Think of it as some form of tesselation. If you add more face, you can match the function better. Note that it is note mandatory to use piecewise linear function. You may use other nonlinearity such as tanh. But using such bounded linearities, the learning process becomes subject to vanishing gradient that makes the learning process very difficult. As ReLu are unbounded they avoid this problem.

1

t1_iqyl8rl wrote

Is this going to make the function approximation strength of NNs any better than what it already is?

Probably not. It is already quite good. I don't think not using polynomial building blocks is the bottleneck.

I have seen very complicated high dimensional manifolds in different settings (physics, finance etc.) being learnt by simple but sometimes huge MLPs. Unless there is a strong inductive reason/bias in how a polynomial layer can help contribute towards in an ML problem, there isn't any strong reason to use it. Indeed, overfitting isn't function approximation gone bad rather the opposite as anyone who has trained a NN will know.

1

t1_iqz48i8 wrote

Non-linear doesn’t mean polynomial. There are many non-linear functions polynomials cannot approximate well, such as softmax() or even just sin(). Ultimately, it’s enough to approximate complex functions with simple functions without having to worry about picking the correct complex function.

1

t1_iqzb8y2 wrote

Spiking Neural Networks are a thing too, if you want to explore a different approach.

1

t1_iqze3a1 wrote

As many others have mentioned, the decision boundaries from piecewise linear models are actually quite smooth in the end, given a sufficient amount of layers.

But to get to the core of your question, why would you prefer many stupid neurons over few smart ones. I believe there is a relatively simple explanation why the former is better. Having more complex neurons would mean that the computational complexity goes up while the number of parameters stays the same. I.e. with the same compute, you can train bigger (number of params) models if the neurons are simple. A high number of parameters is important for optimization as extra dimensions can be helpful in getting out of local minima. Not sure if this has been fully explained, but it is in part the reason why pruning works so well: we wouldnt need that many parameters to represent a good fit, but it is much easier to find in high dimensions, from where we can prune down to simpler models (only 5% of parameters with almost same performance).

1

t1_iqzeu3o wrote

There are a few papers researching this (effect of high dimensions for SGD), but I cant seem to find any right now. Maybe someone can help me out :)

1

t1_iqzvo04 wrote

Nobody else has mentioned resnets, yet. They have something like higher order weights with f(x) = σ(W1σ(W0x+b0)+b1) + σ(W0x+b0). Highway networks take it a step further with f(x) = σ(W0x+b0)σ(W1x+b1) + xσ(W2x+b2). However, both are done to resolve gradient issues.

1

t1_ir0hr1m wrote

The outputs of neurons are passed through a nonlinearity which is essential to any (complex) learning process. If we didn't do this, then the NN would be a composition of linear functions which is itself a linear function (pretty boring).

As for why we choose to operate on inputs with an affine transformation before putting them through a nonlinearity, I see two reasons. The first is that linear transformations are well understood and succinct to use theoretically. The second is that computers (in particular GPUs) are very good with matrix multiplication, so we do a lot of "heavy lifting" with them and then just pass the result through a nonlinearity so we don't get a boring learning process.

Just my 2 cents, happy for input/feedback!

1

t1_ir0kep9 wrote

A linear system is NOT necessarily composed of linear functions. Nonlinear systems are very difficult and sometimes impossible to solve numerically.

1

t1_iqx446n wrote

F(x) = s(WX+b) isn’t all of deep learning.

You may have heard of transformers, which are closer to s(X W X^T) (but actually more involved than that). They’re an extremely popular model right now.

0