Submitted by tysam_and_co t3_10op6va in MachineLearning
tysam_and_co OP t1_j6g0mvc wrote
Hello everyone,
We're continuing our journey to training CIFAR10 to 94% in under 2 seconds, carrying on the lovely work that David Page began when he took that one single-GPU dawnbench entry from over 10 minutes to 24 seconds. Things are getting much, much tighter now as there is not as much left to trim, but we do have a "comfortable" road ahead still, provided enough sweat, blood, and tears are put in to make certain methods work under the (frankly ridiculous) torrent of information being squeezed into this network. Remember, we're breaking 90% having only seen each training set image 5 times during training. 5. times! Then 94% at 10 times. To me, that is hard to believe.
I am happy to answer any questions, please be sure to read the v0.3.0 patch notes if you would like a more verbose summary of the changes that we've made to bring this network from ~12.34-12.38 seconds in the last patch to ~9.91-9.96 seconds in the current one. The baseline of this implementation started at around ~18.1 seconds total, so incredibly we have almost halved our starting speed, and that is only within a few months of the project's start back in October/November of last year.
Please do ask or say anything if it's on your mind, this project hasn't gotten a lot of attention and I'd love to talk to some like-minded people about it. This is pretty darn cool stuff!
Many thanks,
Tysam&co
unhealthySQ t1_j6g2h65 wrote
So just to be sure I read things correctly, this project is about optimizing training speed for Transformer neural networks?
tysam_and_co OP t1_j6g3e49 wrote
Hello! Thanks so much for comment, I really appreciate it. This is a convnet-based architecture, so it's carrying on the torch of some of the old DawnBench entries.
Transformers have the best top-end of all of the neural networks, and convolutional networks tend to have an edge in the smaller/tiny regime, IIRC. One could maximize training speed for a transformer architecture, but the cost of just 1-2 layers could be several times the cost of an entire forward pass through this very tiny convnet. I even tried to just add a really tiny 16x16 attention multiply at the end of the network and it totally tanked the training speed.
However, that said, I'd really like to pick up the work of https://arxiv.org/abs/2212.14034 and continue from there, the concept of getting an algorithm to really compress that info can start opening up the horizon to some of the hard laws that underlie neural network training in the limit. For example, somewhere along the way now, apparently we have really strong consistency with scaling laws on the convnet for this project. I'm not sure why.
But in any case -- language models are hopefully next (if I get the time and have the interest/don't burn myself out on this project in the meantime!). I'll probably be focused on picking up some part-time research work in the field between here and then first, as that's my first priority right now (aside from a few community code contributions. This codebase is my living resume after all, and I think a good one at that! :D)
Hope that helped answer your question, and if not, please let me know and I'll give you my best shot! :D
unhealthySQ t1_j6g4zsz wrote
Thank you for the answer!
your work is highly impressive and I wish you continued success in your efforts; as I could see the work you do here having very appealing applications down the line.
tysam_and_co OP t1_j6g5fb8 wrote
Thank you very much, I appreciate your kind words. Good luck to you in all of your future endeavors as well! :D :) <3 <3 :))))
Viewing a single comment thread. View all comments