tysam_and_co OP t1_j6gzgf9 wrote 2 years ago

Reply to comment by McCheng_ in [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!) by tysam_and_co

I've created a writeup for each release in https://github.com/tysam-code/hlb-CIFAR10/releases

Maybe I could do more at some point (like the original "bag of tricks" blogpost) once the smoke all clears, but I spent ~8-10 hours straight manually tuning hyperparameters yesterday night so I am smoked! Though usually I am pretty beat on release days due to trying to keep up with everything in the dopamine rush. :D

oh__boy t1_j6h7zc2 wrote 2 years ago

Do you have a dedicated development set apart from your test set to tune these hyperparameters? Or am I missing the point that this is not meant to be a general improvement but rather to see just how fast you can train with on this single dataset.

tysam_and_co OP t1_j6h8h8n wrote 2 years ago

I'm sure there's the classical leakage of the val set into the network design via val set performance for super tight tweaks. Thankfully some of the compression from simplifying things in the network seems to be a defending principle against that, but if I was hard-pressed, doing a final round with k-fold validation would probably be really good for final tuning runs.

There might be a CIFAR10 test set (see, goes to show how little I know about the CIFAR10 split, lol), and there has been a lot of work put into studying some aspects (and flaws, like mislabeling, etc) of the structure of CIFAR10 in general.

Mainly -- this task is primarily a proxy for "how fast can we ingest the information encoded in dataset A into a compressed form B so that it will perform adequately well on a task for a dataset we call C". It starts getting down to the engineering barebones of information flow at that point, and a lot of other really involved stuff that I don't want to break into while our training times are 4-5 seconds or so.

The concepts, if any usable ones, distilled from this, would apply to nearly any dataset and any problem -- say, training GPT on an absolutely massive oneshot dataset, for example. The math is the same anywhere you go.

I don't know if that answers your question or not, my apologies if I didn't understand properly.

oh__boy t1_j6m3sah wrote 2 years ago

Interesting, thanks for the detailed answer. This is cool work, I also love to work on projects which squeeze out every last ounce of performance possible to solve a problem. I am somewhat skeptical of how much this applies to other architectures / datasets / problems, since you seem to only have worked on one network and one dataset. I hope you try to find general concepts and show that they apply to more than just that network and dataset and prove me wrong though. Good luck with everything!