Comments

You must log in or register to comment.

tysam_and_co OP t1_j6g0mvc wrote

Hello everyone,

We're continuing our journey to training CIFAR10 to 94% in under 2 seconds, carrying on the lovely work that David Page began when he took that one single-GPU dawnbench entry from over 10 minutes to 24 seconds. Things are getting much, much tighter now as there is not as much left to trim, but we do have a "comfortable" road ahead still, provided enough sweat, blood, and tears are put in to make certain methods work under the (frankly ridiculous) torrent of information being squeezed into this network. Remember, we're breaking 90% having only seen each training set image 5 times during training. 5. times! Then 94% at 10 times. To me, that is hard to believe.

I am happy to answer any questions, please be sure to read the v0.3.0 patch notes if you would like a more verbose summary of the changes that we've made to bring this network from ~12.34-12.38 seconds in the last patch to ~9.91-9.96 seconds in the current one. The baseline of this implementation started at around ~18.1 seconds total, so incredibly we have almost halved our starting speed, and that is only within a few months of the project's start back in October/November of last year.

Please do ask or say anything if it's on your mind, this project hasn't gotten a lot of attention and I'd love to talk to some like-minded people about it. This is pretty darn cool stuff!

Many thanks,

Tysam&co

36

unhealthySQ t1_j6g2h65 wrote

So just to be sure I read things correctly, this project is about optimizing training speed for Transformer neural networks?

5

tysam_and_co OP t1_j6g3e49 wrote

Hello! Thanks so much for comment, I really appreciate it. This is a convnet-based architecture, so it's carrying on the torch of some of the old DawnBench entries.

Transformers have the best top-end of all of the neural networks, and convolutional networks tend to have an edge in the smaller/tiny regime, IIRC. One could maximize training speed for a transformer architecture, but the cost of just 1-2 layers could be several times the cost of an entire forward pass through this very tiny convnet. I even tried to just add a really tiny 16x16 attention multiply at the end of the network and it totally tanked the training speed.

However, that said, I'd really like to pick up the work of https://arxiv.org/abs/2212.14034 and continue from there, the concept of getting an algorithm to really compress that info can start opening up the horizon to some of the hard laws that underlie neural network training in the limit. For example, somewhere along the way now, apparently we have really strong consistency with scaling laws on the convnet for this project. I'm not sure why.

But in any case -- language models are hopefully next (if I get the time and have the interest/don't burn myself out on this project in the meantime!). I'll probably be focused on picking up some part-time research work in the field between here and then first, as that's my first priority right now (aside from a few community code contributions. This codebase is my living resume after all, and I think a good one at that! :D)

Hope that helped answer your question, and if not, please let me know and I'll give you my best shot! :D

25

unhealthySQ t1_j6g4zsz wrote

Thank you for the answer!
your work is highly impressive and I wish you continued success in your efforts; as I could see the work you do here having very appealing applications down the line.

8

tysam_and_co OP t1_j6g5fb8 wrote

Thank you very much, I appreciate your kind words. Good luck to you in all of your future endeavors as well! :D :) <3 <3 :))))

7

JamesBaxter_Horse t1_j6h8tib wrote

If I understand correctly, you're hyperparameter training with the intent to minimise training speed. What do you see as the purpose of this? Presumably, all you're achieving is successfully minimising the inductive space and optimising the learning parameters so as to converge as quickly as possible, but these results are completely specific to cifar and would not be reproducible on a different dataset.

19

LeanderKu t1_j6hd6hp wrote

Well, this is also an assumption. It would be interesting which lessons do translate and which won’t. I wouldn’t dismiss it so quickly. Also, It’s a fun game to play and interesting in its own!

6

tysam_and_co OP t1_j6hfdaj wrote

And, safe to say, there's stuff I'm not sharing here (yet?) that I've found as a result of that. Some hyperparameters are more network-specific, some are dataset specific. And some behave in ways just weirdly enough that you might get an entirely new adaptive method out of it... ;))))

I hadn't thought about that in the exact words you'd put it for a long while, but I think you're right very much so! It is quite a fun game, and very interesting to play in its own right. There's very much this bizzare, high-dimensional "messy zen" to it all. ;D

Thanks again for your comment, it warmed my heart and made me smile seeing it. Have a good evening/night/etc! :D :))) <3 <3 :D

3

tysam_and_co OP t1_j6hx45k wrote

Thanks for sharing. I think you might be missing some of the bigger picture here! Most of the changes and performance improvements did indeed come by changing the architecture, memory format, execution order, network width/etc in the right places. These are from about five previous years of experience where my primary task was architecting networks like this. I actually transferred a number of personal lessons learned into this network to get a lot of the benefits that we have here. So I'm not quite sure why they would not scale to other problems all of a sudden! ;P I guess that said, there might be some tweaks in line in order to line up with the inductive biases of the network on different datasets (in this case, say, for Imagenet 1-2 more downscaling blocks or something like that).

I also wouldn't focus in on the hyperparameter twiddling that much -- though it is important and definitely can be a trap. At the front of being a world record, every option is on the table and hyperparameters promise results but are exponentially more expensive to work with. But the 'good enough' parameter space should be pretty flat outside of it, so it's likely not too bad of a starting place.

I'm a bit curious about how this would not be reproducible on another dataset (especially if we're narrowing our inductive space -- this should increase generalization, not reduce it!). Similar to Transformers, the simpler and more scalable this architecture is, the better. One of my go-tos for people newer to the field is to encourage them to keep things as simple as possible. It pays off!

In this case, for example, before release, I just added 70 epochs and doubled the base width, and went from 94.08% to 95.77%. That's a good sign! It should at least have good basic performance on other datasets, and if something has to be changed, it's probably just a few hyperparameters, and not all of them, if that makes sense.

2

McCheng_ t1_j6gwm0w wrote

Can you summarise the tricks that you have used to make it fast?

11

tysam_and_co OP t1_j6gzgf9 wrote

I've created a writeup for each release in https://github.com/tysam-code/hlb-CIFAR10/releases

Maybe I could do more at some point (like the original "bag of tricks" blogpost) once the smoke all clears, but I spent ~8-10 hours straight manually tuning hyperparameters yesterday night so I am smoked! Though usually I am pretty beat on release days due to trying to keep up with everything in the dopamine rush. :D

17

oh__boy t1_j6h7zc2 wrote

Do you have a dedicated development set apart from your test set to tune these hyperparameters? Or am I missing the point that this is not meant to be a general improvement but rather to see just how fast you can train with on this single dataset.

2

tysam_and_co OP t1_j6h8h8n wrote

I'm sure there's the classical leakage of the val set into the network design via val set performance for super tight tweaks. Thankfully some of the compression from simplifying things in the network seems to be a defending principle against that, but if I was hard-pressed, doing a final round with k-fold validation would probably be really good for final tuning runs.

There might be a CIFAR10 test set (see, goes to show how little I know about the CIFAR10 split, lol), and there has been a lot of work put into studying some aspects (and flaws, like mislabeling, etc) of the structure of CIFAR10 in general.

Mainly -- this task is primarily a proxy for "how fast can we ingest the information encoded in dataset A into a compressed form B so that it will perform adequately well on a task for a dataset we call C". It starts getting down to the engineering barebones of information flow at that point, and a lot of other really involved stuff that I don't want to break into while our training times are 4-5 seconds or so.

The concepts, if any usable ones, distilled from this, would apply to nearly any dataset and any problem -- say, training GPT on an absolutely massive oneshot dataset, for example. The math is the same anywhere you go.

I don't know if that answers your question or not, my apologies if I didn't understand properly.

5

oh__boy t1_j6m3sah wrote

Interesting, thanks for the detailed answer. This is cool work, I also love to work on projects which squeeze out every last ounce of performance possible to solve a problem. I am somewhat skeptical of how much this applies to other architectures / datasets / problems, since you seem to only have worked on one network and one dataset. I hope you try to find general concepts and show that they apply to more than just that network and dataset and prove me wrong though. Good luck with everything!

2

jobeta t1_j6gj0er wrote

Fun!

5

tysam_and_co OP t1_j6gjzzf wrote

Many thanks! I've found that the 'speed hunger' for me is truly insatiable -- we're almost at half the time it takes to train as when we started, and I find myself just as hungry to make it faster and faster. The Sisyphean hill is real, though I suppose it is more easily justified with a goal in mind! 😄😁

9

jobeta t1_j6gpgqj wrote

What’s the trick you’re most proud of?

5

tysam_and_co OP t1_j6h0z6b wrote

Thanks for asking, great question! I'd say it's really hard to pick at this point -- mostly it's just a hardcore case of "do the basics and do them really, _really_ well" as best as I can, with a few smaller tricks along the way. There may be some much more exotic things later on, but experience has taught me to try to delay that for as long as is humanly possible! Plus, the bonus is that things get to be simpler. Arguably, some aspects of this code are actually simpler than the baseline, believe it or not!

That said, if I had to pick a trick, I think it would be 'upgrading' the whitening convolution to be 2x2 from 3x3 or so. I think that saved like maybe just over or around a full second and a half alone or so, when combined with the 'padding=0' change at the start. Most of the in-practice things here are pretty simple, but what's happening here is that we're projecting from the input image to a whitened feature space, the 3x3 convs are going to result in a 3*3*3 = 27 depth input feature space without any downstriding, this can be horribly slow as the spatially large layers always are the slowest compute-wise -- deeper layers without much spatial width or height are by comparison very snappy (correct me if I'm wrong, I think this has to do with the SIMD architecture of GPUs -- in any case, spatial stuff with 2d convolutions at least tends to be hilariously ineffecient).

Not padding cuts off a potentially expensive kernel call (I don't know if it's fused or not...), and reduces the height IIRC from 32x32->30x30. This is actually a deceptively large (roughly ~12%) savings in spatial pixel count, but not everything is lost as that lovely 2x2 convolution is still going to touch everything (I could theorize about the efficiency of processing the edges of pictures but I could also be horribly wrong so I'm going to keep my mouth shut here). So in any case, now summing it up, we move from a 3*3*3=27 dimensional input feature space to a new 2*2*3=12 dimensional input feature space, remove 12% of the pixels without directly necessarily deleting that information, and most importantly we only have to run with 2*2/3*3 = 4/9 = 44% of the input kernel cost.

And that's why I'm proud of that little trick. It's very unassuming, since it's just:

Conv(input_depth, output_depth, kernel_size=3, padding='same') -> Conv(input_depth, output_depth, kernel_size=2, padding=0)

Now of course there's a bit of a hit to accuracy, but the name of the game here is leverage, and that's what the squeeze-and-excite layers are for. They're very fast but add a huge amount of accuracy, though (and I unfortunately don't think I've noted this anywhere else) for some reason they are very sensitive to the compression dimension -- 16 here in this case.

Though to be frank, I started with squeeze-and-excite and got my accuracy increase, then pulled this off the shelf to 'cash in' the speed increase. I have been sitting on this one since before even the last release, I've found it's good to not be like (noise warning) https://www.youtube.com/watch?v=ndVhgq1yHdA on projects like these. Taking time to be good and slow is good!

I hope that helps answer your question, I know this was a really long answer, paradoxically I get far more verbose the more tired I get and poor next-day-me has to deal with it, lol.

Again, to get below 2 seconds, we're going to have to get more progressively fancy and "flashy" but for now, it's build a really, really freaking solid core of a network, then get into the more 'exotic' stuff. And even then, hopefully the more mundane exotic stuff while we're at it.

Hope that helps answer your question, feel free to let me know if you have any others (or follow-ups, or if this wasn't what you were looking for, etc)! :D

5

DisWastingMyTime t1_j6hh4f2 wrote

Is there anywhere I could see a summary of the decisions taken/changes made?

I saw you linked to the original paper that started this, I'll look into it, but I hope there's a more readable way to go over your experiments and insights than browsing the code.

Very interesting though, thanks for sharing!

4

batrobin t1_j6h1o7u wrote

I am surprized to see most of the work you have done are on hyperparameter tunings and model tricks. Have you tried any HPC/MLHPC techniques, profiling or code optimizations? Are they in a future roadmap, not the goal of this project, or are there just not much to improve in that direction?

3

tysam_and_co OP t1_j6h2qlh wrote

That's a good question, and I'm somewhat curious what you mean by HPC/MLHPC techniques or code optimizations. Do you mean something like distributed computing? (which that is an interesting rabbit-hole alone to get into....)

Regardless -- yep! I'm a professional in this industry, and there's a lot of detail underlying a ton of seemingly-simple changes (and potentially even more frustrating if simple, understandable changes shave off large chunks and swathes of what was previously the world record). So basically everything I'm doing is informed by, well, years of doing basically this exact same thing over and over again. Something that I've found is that the younger/newer ML engineers (myself included when I was at that point) are often really attracted to the "new shiny", when in reality, good HPC on a smaller scale is like writing a really nice, quiet, tight haiku. Less is more, but a single 'syllable' equivalent can make or break the whole thing.

Lots of people use models inefficiently. This model is still somewhat inefficient in its own ways, though it definitely I think is more efficient by far than most nearly all of the ones it's currently competing with. When I design a model, I'm thinking about keeping the GPU occupancy high, utilizing tensor cores as much as possible, mathematically fusing operations to reduce overhead, managing memory layout to make sure the right paths get activated in the GPU (like tensor cores, etc), and seeing if there are good approximations or alternatives to some things that are much more mathematically cheap (or if there are alternate paths with specialized kernels that I can boutique-design the network around).

I'll have to cut myself short early, but I'll leave you with a singular example, which is a technical breakdown which was behind what was in practice a very 'simple' change in the backend. I also made another comment in here, this reddit thread (https://www.reddit.com/r/MachineLearning/comments/10op6va/comment/j6h0z6b/?utm_source=share&utm_medium=web2x&context=3), with a technical breakdown behind one other very 'simple' change. Don't get pulled away by the shiny fancy techniques that are slow/etc, sometimes the simplest is the best!

Here's the breakdown: https://github.com/tysam-code/hlb-CIFAR10/issues/2#issuecomment-1379711156

Let me know if this answered your question at all or if you have any follow-ups, much love, cheers, and thanks! <3 :D :D :D :D :))))

2

batrobin t1_j6h8d9d wrote

Thank you. You have answered what I had in mind. I was thinking about techniques like changing memory access pattern, changing memory layout, custom cuda kernels, fusing operations, reducing overheads etc. which some of them are mentioned in this paper: https://arxiv.org/abs/2007.00072. I also see that you have done some profiling in your issue, it should be interesting to read into.

I was previously working on some large scale transformer code optimization, seems like this repo would be good to learn from, thanks a lot.

3

tysam_and_co OP t1_j6h8nhh wrote

Excellent, and thank you very much for sharing that paper, I shall have to take a look at it! :D

I might need to do some operator fusion manually at some point in the future, though I'm hoping the torch.compile() command does it well (but I am somewhat scared because compiling territory can be more rigid and error-prone).

1

shellyturnwarm t1_j6hw7yq wrote

In your dataloaders, why do you set persistent_workers to False. And why do you choose 2 for num_workers?

Also, what does self.se stand for in ConvGroup and what is it doing there?

Finally what is whitening, and what are you trying to achieve with it?

2

tysam_and_co OP t1_j6hxgzk wrote

Hi hi hiya there! Great questions, thanks so much for asking them! :D

For the dataloaders, that dataloading only happens once -- after that, it's just saved on disk as a tensor array in fp16. It's wayyyyy faster for experimentation this way. We only need to load the data once, then we move it to GPU, then we just dynamically slice it on the GPU each time! :D

As for self.se, that used to be a flag for the squeeze_and_excite layers. I think it's redundant now as it's just a default thing -- this is a one person show and I'm moving a lot of parts fast so there's oftentimes little extraneous bits and pieces hanging around. I'll try to clean that up on the next pass, very many thanks for pointing that out and asking!

I'm happy to answer any other questions that you might have! :D

1

fnbr t1_j6j9f11 wrote

Have you looked at some of the architectures that get rid of BatchNorm (e.g. NFNets)? In my experience, BatchNorm tends to be quite slow, so I wonder if there's some speed to be gained there.

2

arhetorical t1_j6nhean wrote

Hiya, great work again! Maybe I'm outing myself a little here, but the code doesn't work on Windows machines, apparently because the processes are spawned instead of forked. I'm not sure it's an easy fix and maybe not worth the time (it works fine on WSL) but just thought I'd mention in case you weren't aware!

On the ML side, should this scale up pretty straightforwardly to CIFAR100 or are there things to be aware of?

2

tysam_and_co OP t1_j6o25u4 wrote

Oh, I see! Yeah, I probably will want to leave process spawning/forking stuff to the side as that can require some bug-resistant refactoring IIRC. However! I believe that would only require some change around the dataloaders and maybe some stuff at the beginning of the file. I am unfortunately terribly rusty on this, but you might be able to get away with changing the num_dataloaders=2 -> num_dataloaders->0 in your file, and I believe that would run 'much' more slowly the first time, then the same after, without any forking issues?

As for CIFAR100, I made the absolute minimum number of changes/additions to it, which was 3 characters. I added 0s to each of the two main dataloaders, and then one 0 to the num_classes parameter. On the first run with this, I'm averaging about 75.49% validation accuracy, which matches roughly what the 2015 SOTA was for CIFAR100. The 2015 SOTA for CIFAR10 was roughly 94%, so I believe that we are in very good hands here! This bodes quite well, I think, but I am unsure. This also was the first blind run (well, I had to do it again on the right notebook base as I accidentally pulled an older version that was about ~.8% below this one -- and in over 10 seconds! Interestingly to me, we're still running at right about ~9.91-9.94 seconds or so, I would have thought the extra 90 classes would have added some appreciable overhead to this! Creazy! :D That opens a lot of cool avenues (Imagenet?!?!) that I've been sorta hardcore ignoring as a result. Goes to show I guess that there's basically no replacement for really good testing! :D :)))) ), no other tuning or anything. I wouldn't be surprised if one could get more performance with more tuning -- though it would be surprising if we were simply at a local maxima already! Either way, I find it somewhat validating.

Thank you for being the first person to comment on and support my work. You really made my day back then, and as of yesterday the project was being tweeted by Karpathy. I am appreciative at about the same level to both of you for your support and kindness -- much love! <3 :)))) <3 :D

2

tysam_and_co OP t1_j6o72ma wrote

Okay, I ran some other experiments and I'm starting to get giddy (you're the first 'ta know! :D). It appears that for most hyperparameters, twiddling on CIFAR100 is just a flat response, or a slight downward trend (!!!) I haven't messed with them all yet, though, but that bodes very, very well (!!!!).

Also, doing the classical range boost of changing from depth 64->128 and num_epochs 10->80 results in a boost to about 80% in 3 minutes of training or so, which is about where CIFAR100 was in early 2016 or so. It's harder for CIFAR10 as I think that was slightly more popular and there was a monstrous jump, then a long flat area during that period, but if you do some linear/extremely coarse piecewise interpolation from the average starting point of CIFAR10 to the current day of CIFAR10 as far as accuracy goes on PapersWithCode, and do the same roughly for CIFAR100, adding this extra capacity+training time moves them both from ~2015 SOTA numbers to ~early 2016 SOTA numbers. Wow!! That's incredible! This is starting to make me really giddy, good grief.

I'm curious if cutout or anything else will help, we'll see! There's definitely a much bigger train<->eval % gap here, but adding more regularization may not help as much as it would seem up front.

2