Lugi

Lugi t1_itiq8ir wrote

>Otherwise, what does model complexity even mean?

People are generally referring to bigger models (#parameters) as more complex.

Come to think of it, redundancy in networks with more parameters can act as a regularizer, by making similar branches have essentially higher learning rate and be less prone to overfitting. Let me give you an example of what I have in mind: a simple network with just one parameter - y = wx. You can pass some data through it, calculate loss, backpropagate to get gradient, and update the weight with it.

But see what happens if we reparametrize w as w1+w2: the gradient for these is gonna be the same as in case of only one parameter, but after the weight update step we will essentially end up moving twice as far, which would be equal to original one parameter case with 2 times bigger learning rate.

Another thing that could be somehow linked to this phenomenon is that on one hand the parameter space of a 1 hidden layer neural network grows exponentially with the number of neurons, and on the other hand the number of equivalent minimums grows factorially, so at some certain number of neurons the factorial takes over and your optimization problem becomes much simpler, because you are always close to your desired minimum. But I don't know shit about high-dimensional math so don't quote me on that.

1

Lugi OP t1_iqoa9pp wrote

>The alpha used in the paper is the inverse of the frequency of the class. So class1 is scaled by 4 (i.e. 1 / 0.25) and class2 is scaled by 1.33 (1/0.75).

They say it CAN be set like that, but they explicitly set it to 0.25. This is why I am confused, they put that statement in and did something completely opposite.

3