Submitted by hardmaru t3_ys36do in MachineLearning
Comments
maybelator t1_ivxgacq wrote
> Is the derivative of ReLU at 0.0 equal to NaN, 0 or 1?
The derivative of ReLu is not defined at 0, but its subderivative is and is the set [0,1].
You can pick any value in this set, and you end up with (stochastic) subgradient descent, which converges for small enough learning rates (to a critical point).
For ReLu, the discontinuity are of mass 0 and are not "attractive", ie there is no reason for the iterate to end up exactly at 0, so it can be safely ignored. This is not the case for the L1 norm for example, whose subgradient at 0 is [-1,1]. It present a "kink" at 0 as the subderivative contains a neighborhood of 0, and hence is attractive: your iterate will get stuck there. In these cases, it is recommended to use proximal algorithms, typically forward-backward schemes.
Phoneaccount25732 t1_ivydmgs wrote
I want more comments like this.
9182763498761234 t1_ivy1mud wrote
Cool, thanks for sharing :-)
robbsc t1_ivypqg0 wrote
Thanks for taking the time to type this out
samloveshummus t1_iw1o1jg wrote
This has to be one of the most useful comments I've read in nearly ten years on Reddit! You must be a gifted teacher.
[deleted] t1_iw2kgbt wrote
[deleted]
zimonitrome t1_iwbmzoq wrote
Huber loss let's go.
maybelator t1_iwbpkjo wrote
Not if you want true sparsity !
zimonitrome t1_iwbst8p wrote
Can you elaborate?
maybelator t1_iwbxutj wrote
The Huber loss encourages the regularized variable to be close to 0. However, this loss is also smooth: the amplitude of the gradient decreases as the variable nears its stationary point. In consequence, it will have many coordinates close to 0 but not exactly. Achieving true sparsity requires thresholding which adds a a lot of other complications.
In contrast the amplitude of the gradient of the L1 norm (absolute value in dim 1) remain the same no matter how close it gets to 0. The functional has a kink (the subgradient contains a neighborhood of 0). In consequence, if you used a well-suited optimization algorithm, the variable will have true sparsity, i.e. a lot of exact 0.
zimonitrome t1_iwc14i5 wrote
Wow thanks for the explanation, it does make sense.
I had a pre-conception that all optimizers dealing with any linear functions (kinda like L1 norm) still produce values close to 0.
I can see someone disregarding tiny values when using said sparsity (pruning, quantization) but didn't think that it would be exactly 0.
ThisIsMyStonerAcount t1_ivy34sr wrote
Knowing about subgradients (see other answers) is nice and all, but in the real world what matters is what your framework does. Last time I checked, both pytorch and jax say that the derivative of max(x, 0)
is 0 when x=0.
samloveshummus t1_iw1ofup wrote
Good point. But it's not the end of the world; those frameworks are open source, after all!
Bot-69912020 t1_ivxbxml wrote
I don't know about each specific implementation, but via the definition of subgradients you can get 'derivatives' of convex but non-differentiable functions (which ReLU is).
More formally: A subgradient at a point x of a convex function f is any x' such that f(y) >= f(x) + < x', y - x > for all y. The set of all possible subgradients at a point x is called the subdifferential of f at x.
For more details, see here.
[deleted] t1_ivxslaq wrote
[deleted]
elcric_krej t1_ivy6jf4 wrote
This is awesome in that it potentially removes a lot of random variance from the process of training, I think the rest of the benefits are comparatively small and safely ignorable.
I would love if it were picked up as a standard, it seems like the kind of thing that might get rid of a lot of the worst seed hacking out there.
But I'm an idiot, so I'm curios what well-informed people think about it.
master3243 t1_iw1h2h7 wrote
> potentially removes a lot of random variance from the process of training
You don't need the results of this paper for that.
One of my teams had a pipeline where every single script would initialize the seed of all random number generators (numpy, torch, pythons radom) to 42.
This essentially removed non-machine-precision stochasticity between different training iterations with the same inputs.
bluevase1029 t1_iw1khv8 wrote
I believe it's still difficult to be absolutely certain you have same initialisation across multiple machines, versions of pytorch etc. I could be wrong though.
master3243 t1_iw1mpgb wrote
Definitely if each person has a completely different setup.
But that's why we contenirize our setups and use a shared environment setup
elcric_krej t1_iw7hss0 wrote
I guess so, but that doesn't scale to more than one team (we did something similar) and arguably you want to test across multiple seeds, assume some init + model are just very odd minima.
This seems to yield higher uniformity without constraining us on the rng.
But see /u/DrXaos for why not really
DrXaos t1_iw7o3ef wrote
In my typical use, I’ve found that changing random init seeds (and also random seeds for shuffling examples during training, don’t forget that one) in many cases induces a larger variance on performance than many algorithmic or hyper parameter changes. Most prominently with imbalanced classification, which if often the reality of the valuable problem.
I guess it’s better to be lucky than smart.
Avoiding looking at the results of random init can make you think you’re smarter than you are and will tell yourselves false stories.
DrXaos t1_iw03k6k wrote
I’m not entirely convinced it eliminates every random choice. There is usually a permutation symmetry on tabular inputs, and among hidden nodes.
If I’m reading it correctly, then for a single scalar output of a regressor or classifier coming from hiddens or inputs directly (logreg), it would set the coefficient of the first node to 1 and 0 to all others being a truncated identity.
But what’s so special about that first element. Nothing. Same applies to the Hadamard matrices, it’s making one choice from an arbitrary ordering.
In my opinion, there still could/should be a random permutation of columns on interior weights and I might init the final linear layer of the classifier to equal but nonzero values like 1/sqrt(Nh), and with random sign if hidden activations are nonnegative like relu or sigmoid, instead of symmetric like tanh.
Maybe also random +1/-1 signs times random permutation times identity?
By that matter, any orthogonal rotation also preserves dynamical isometry, and so a random orthogonal before truncated identity should also work as init, and we’re back to an already existing suggested init method.
Training for enhanced sparsity is interesting, though.
samloveshummus t1_iw1oyhf wrote
>I would love if it were picked up as a standard, it seems like the kind of thing that might get rid of a lot of the worst seed hacking out there.
I don't want to be facetious, but what's wrong with "seed hacking"? Maybe that's a fundamental part of making a good model.
If we took someone other than Albert Einstein, and gave them the same education, the same career, the same influences and stresses, would that other person be equally as likely to realise how to explain the photoelectric effect, Brownian motion, blackbody radiation, general relativity and E=mc^(2)? Or was there something special about Einstein's genes meaning we need those initial conditions and that training schedule for it to work.
machinelearner77 t1_iw21k83 wrote
I guess the problem with "seed hacking" is just that it reduces trust in the proposed method. People want to build on methods that aren't brittle and if presented model performance depends (too) much on random seed it lowers trust in the method and makes people less likely to want to build on it
samloveshummus t1_iwhwh8y wrote
Sure, but maybe it's inescapable.
When we recruit for a job, we first select a candidate from CVs and interviews, and only once we've chosen a candidate do we begin training them.
Do you think it makes sense to strive for a recruitment process that will get perfect results from any candidate, so we can stop wasting time on interviews and just hire whoever? Or is it inevitable that we have to select among candidates before we begin the training? Why should it be different for computers?
canbooo t1_ivx9yjn wrote
Very interesting stuff, just skimmed through and will definitely read more in depth but how does this break symmetry?
jimmiebtlr t1_ivy59yw wrote
Haven’t read it yet, but wouldnt symmetry only exist for 2 node if the input and output weights have the same 1s and 0s?
canbooo t1_ivydtlt wrote
You are right and what I ask may be practically irrelevant and I really should rtfp. However, think about the edge case of 1 Layer with 1 input and 1 output. Each node having 1 as input weight sees the same gradient, similar to the nodes having 0. Increasing the number of inputs make it combinatorially improbable to have the same configuration but increasing the number of nodes in a layer makes it likelier. So, it could be relevant for low dimensions or models with a narrow bottleneck. I am sure that the authors already thought about this problem and either discarded it as it is quite unlikely in their tested settings or they already have a solution/analysis somewhere in the paper, hence my question.
vjb_reddit_scrap t1_ivymo0p wrote
IIRC Hinton et al had a paper about initializing RNNs with identity and it solved many problems that LSTM solves.
DrXaos t1_iw04agd wrote
That’s a different scenario and clearly dynamically justified.
Any recursive neural network is like a nonlinear dynamical system. Learning happens best on the boundary of dissipation vs chaos (exploding or vanishing gradients).
The additive incorporation of new info in LSTM/GRU greatly ameliorates that usual problem of RNNs with random transition matrices where perturbations evolve multiplicatively. RNN initted to zero Lyapunov exponent through identity is helpful.
AnimaAnandkumar t1_iwe93vq wrote
Thank you for posting our paper. These slides sum up our work and how it removes degeneracy arising from identity initialization https://twitter.com/AnimaAnandkumar/status/1590963759954423810?s=20&t=8V3J8VOrbn1w-rZY_Lplqg
https://twitter.com/AnimaAnandkumar/status/1590963759954423810?s=20&t=8V3J8VOrbn1w-rZY_Lplqg
martinkunev t1_ivxtknz wrote
The abstract looks very promising. I'm wondering why there is just 1 citation in 4 months. Is there a caveat?
new_name_who_dis_ t1_ivy2et6 wrote
Getting lots of citations a few month after your paper comes out only happens with papers written by famous researchers. Normal people need to work to get people to notice their research (which is they are sharing it here now).
And usually a paper starts getting citations after it’s already been presented at a conference where you can do the most easiest promotion of it.
terranop t1_ivyafa1 wrote
While what you are saying here is true, it doesn't really apply in this case because Anima Anandkumar is a famous researcher.
new_name_who_dis_ t1_ivybdq7 wrote
Oh I didn’t know them. Still if it’s only been out a few months for it to be cited it would have needed to be noticed by someone who is writing their next research paper and have that paper already published.
Unless preprints on arxiv count. But even then it takes weeks if not months to do research and write a paper. So that leaves such a small window for possible citations at this point.
samloveshummus t1_iw1qaft wrote
As well as what the other commenters are saying, sometimes deeper stuff takes longer to have an impact. If you look through the history of science (and human endeavor more generally), there are many famous examples of people whose work revolutionized our modern world, but who weren't recognized in their lifetime - society needed time to catch up.
Now I think we can do a lot better than that. We're a global civilization that communicates at lightspeed. However, we are still also big hairless apes with CPUs made of electric jelly, so we take a while to process things. The more unexpected, the more processing we need.
lynnharry t1_iw9rfek wrote
Multiple reviewers pointed out that the empirical study is only limited to a modified ResNet and two datasets.
mikeful t1_ivxqhgi wrote
Neat. You could try to initialize them to 0.1 or 0.9 as it's unlikely that weights will stay at zero or one after training anyway.
VinnyVeritas t1_iw0w76i wrote
Seems useless, why not simply fix the seed of the random generator for reproducibility?
master3243 t1_iw1hggt wrote
The problem is not random variance between trained models.
Check out the abstract, it answers why this work is useful.
VinnyVeritas t1_iw1s40n wrote
Like what? Training ultra-deep neural networks without batchnorm? But in their experiments the accuracy gets worse with deeper networks, what's the point of going deeper to get worse results?
master3243 t1_iw1x35r wrote
> They theoretically show that, different from naive identity mapping, their initialization methods can avoid training degeneracy when the network dimension increases. In addition, they empirically show that they can achieve better performance than random initializations on image classification tasks, such as CIFAR-10 and ImageNet. They also show some nice properties of the model trained by their initialization methods, such as low-rank and sparse solutions.
VinnyVeritas t1_iw9ajwe wrote
The performance is not better: the results are the same within the margin of error for standard (not super-deep networks). Here I copied from their table:
Cifar10
ZerO Init 5.13 ± 0.08
Kaiming Init 5.15 ± 0.13
Imagenet
ZerO Init 23.43 ± 0.04
Kaiming Init 23.46 ± 0.07
PredictorX1 t1_iw2dy5h wrote
How does this compare to Murray Smith's weight initialization (1993)?
starfries t1_iw90r3p wrote
What is that? I can't find a copy online.
finitearth t1_ivylqpb wrote
Guess who's back
jrkirby t1_ivx9xjl wrote
What happens when all the weights to a ReLU neuron are 0? The ReLU function's derivative is discontinuous at zero. I figure in most practical situations this doesn't matter because the odds of many floating point numbers adding up to exactly 0.0 floating point is negligible. But this paper begs the question of what that would do. Is the derivative of ReLU at 0.0 equal to NaN, 0 or 1?