alterframe

alterframe t1_jcn87ue wrote

Do you have any explanation why on Figure 9 the training loss decrease slower for the early dropout? The previous sections are all about how reducing variance in the mini-batch gradients, allows us to travel longer distance in the hyperparameter space (Figure 1 from the post). It seems that it is not reflected in the value of the loss.

Any idea why? It catches up very quickly after the dropout is turned off, but I'm still curious about this behavior.

1

alterframe t1_jb9i70h wrote

Interesting. With many probabilistic approaches, where we have some intermediate variables in a graph like X -> Z -> Y, we need to introduce sampling on Z to prevent mode collapse. Then we also decay the entropy of this sampler with temperature.

This is quite similar to this early dropout idea, because there we also have some sampling process that effectively works only at the beginning of the training. However, in those other scenarios, we rather attribute it to something like exploration vs. exploitation.

If we had an agent that almost immediately assigns very high probability to a bad initial actions, then it may be never able find a proper solution. On a loss landscape in worst case scenario we can also end up in a local minimum very early on, so we use higher lr at the beginning to make it less likely.

Maybe in general random sampling could be safer than using higher lr? High lr can still fail for some models. If, by parallel, we do it just to boost early exploration, then maybe randomness could be a good alternative. That would kind of counter all claims based on analysis of convex functions...

2

alterframe t1_jb6ye5w wrote

Anyone noticed this with weight decay too?

For example here: GIST

It's like larger weight decay provide regularization which lead to slower training as we would expect, but setting lower weight decay makes the training even faster, than the one without any decay at all. I wonder if it may be related.

1

alterframe t1_jb5oel8 wrote

RL is one of those concepts where it's easy to fool ourselves that we get it, but in reality we don't. We have this fuzzy notion of what RL is and what it is good for, so in our imagination this is going to be a perfect match for our problem. In reality, our problem may look like those RL-friendly tasks on the surface, but we are lacking several important properties or challenges to really make it reasonable.

It doesn't mean that this is not useful at all. Quite opposite. People are wrongly discouraged from RL, based on experience with projects where it didn't actually make sense, and draw false conclusions about it's practicality.

1

alterframe t1_ja2f7xu wrote

Part of the answer is probably that DL is not a single algorithm or a class of algorithms, but rather a framework or a paradigm for building such algorithms.

Sure, you can take a SOTA model for ImageNet and apply it to similar image classification problems, by tuning some hyperparameters and maybe replacing certain layers. However, if you want to apply it to a completely different task, you need to build a different neural network.

1

alterframe t1_iz0ags7 wrote

I like how flexible they are about different compilation approaches. In TF2 the problem was that you always need to wrap everything in tf.function to get the performance improvements. Debugging it was a nightmare since for more complicated pipelines it could take several minutes just to compile the graph.

2

alterframe t1_ixhvkgo wrote

It all boils down to how would you behave when something goes wrong. The weights of your layer do not converge? Try some more or less random hyperparameter changes and maybe they finally will. Sometimes that's the only thing you can come up with. Frameworks are just fine for that.

Maybe you have some extra intuition about the problem and want to try something more sophisticated to probe the problem better? You'd be fine with a framework as long as you deeply understand how it works, because the change you are going to do may be outside of its typical usage. Otherwise, you'd just get frustrated when something doesn't work as you expected.

I get the sentiment against using high-level frameworks. At the beginning all of them look like toys for newbies that compete with each other in making the shortest MNIST example. However, as more and more people use them they are more and more refined. I think that at this point Lightning may be worth giving it a try. I myself, would be strongly against it few years ago, and I was quite annoyed with its rise in popularity, but ultimately it turned out to be sort of a standard now.

1

alterframe t1_ixhmyr3 wrote

That's why it's so difficult to invest in something like Lightning. If you find a fine torch repository for your project you should go with it. You are not going to move everything to Lightning just because you are more comfortable with it.

On the other hand, Lightning is actually doing a decent job being modular, so it's mostly fine. TorchMetrics is a great example of how it should be done.

5