tysam_and_co

tysam_and_co t1_je8wno6 wrote

This will be one of the key things that I think will keep people up and going in their respective subfields. Perhaps not ironically at all, DougDoug, a popular YT creator has a great video on how to be a content creator that I find to be pretty exceptional and well-targeted to the current state of ML research, including some unique strategies on how to pick a unique fusion that only the person doing the researching is able to compete well in (while still contributing somewhat to the field), if one assumes that the research or software created is content that people might want

It's helped me as a researcher in producing research, I've recommended it to a number of people, and he just won a decently-sized award recognizing him doing in the way he's doing it. May not seem at all related but it has been one of the best guidebooks so far for me, and it has not failed me quite yet.

1

tysam_and_co t1_jdkqv3e wrote

I would presume that it's a bolt-on external method that utilizes a pretrained model with its own inputs as a dynamically-generated information sieve of sorts. Of course, the inductive prior is encoded in the Reflexion algorithm itself so we are bringing some new information to the table here (not that GPT4+ couldn't somehow do this itself someday, either).

2

tysam_and_co t1_jb3i6eq wrote

Right, right, right, though I don't see how dropout introduces bias into the network. Sure, we're subsampling the network in general, but overall the information integrated with respect to a minibatch should be less on the whole due to gradient noise, right? So the bias should be less and as a result we have more uncertainty, then more steps equals more integration time of course and on we go from there towards that elusive less-biased estimator.

I guess the sticking point is _how_ they're saying that dropout induces bias. I feel like fitting quickly in a non-regularized setting has more bias by default, because I believe the 0-centered noise should end up diluting the loss signal. I think. Right? I find this all very strange.

6

tysam_and_co t1_jb2kgo2 wrote

Hold on a minute. On reading through the paper again, this section stood out to me:

>Bias-variance tradeoff. This analysis at early training can be viewed through the lens of the bias-variance tradeoff. For no-dropout models, an SGD mini-batch provides an unbiased estimate of the whole-dataset gradient because the expectation of the mini-batch gradient is equal to the whole-dataset gradient. However, with dropout, the estimate becomes more or less biased, as the mini-batch gradients are generated by different sub-networks, whose expected gradi- ent may not match the full network’s gradient. Nevertheless, the gradient variance is significantly reduced, leading to a reduction in gradient error. Intuitively, this reduction in variance and error helps prevent the model from overfitting to specific batches, especially during the early stages of training when the model is undergoing significant changes

Isn't this backwards? It's because of dropout that we should receive _less_ information from each iteration update, which means that we should be _increasing_ the variance of the model with respect to the data, not decreasing it. We've seen in the past that dropout greatly increases the norm of the gradients over training -- more variance. And we can't possibly add more bias to our training data with random I.I.D. noise, right? Shouldn't this effectively slow down the optimization of the network during the critical period, allowing it to integrate over _more_ data, so now it is a better estimator of the underlying dataset?

I'm very confused right now.

17

tysam_and_co t1_jabuusq wrote

Hey, do what works for you, kid. (I say that respectfully). If you're passionate about writing a neural network in Fortran...

...then do it. If you follow one of the practical solutions here and it makes you miserable, then don't do that. There are easier ways than Fortran, but if you accept that it's not necessarily the most efficient or easy, like, at all, way to go about things, and this is a passion project for fun, then that's something you can learn with. Whatever fills your sails and gets you _learning things_.

I think you want to ask yourself why Fortran specifically (it's a low level language, you like it in particular, older supercomputing history, etc?)

Once you have written down all of the _why_ for what you want to do, make sure that you've chosen the right building blocks (the _what_) for it.

I hope this helps you on your journey. Much love and care from this fair enby over here! <3 <3 <3 :DDDD

1

tysam_and_co t1_j8x4sjv wrote

I have been torn about Huggingface. They provide some wonderful services to the community, but unfortunately the API design is very unintuitive and hard to work with, as well as the documentation being outdated. Also, much of the design tries to accommodate too many standards at once, I think, and switching between them or doing other likewise things requires doing in-place operations or setting markers that permanently become part of an object instead of a chain that I can update with normal control flow operations.

This also includes that there are far too many external libraries as well that are installed with any hf stuff, and the library is very slow to load and to work with. I avoid it like the plague unless I'm required to use it, because it usually takes the most debugging time. For example, I spent well over half the time implementing a new method trying to debug huggingface before just shutting down the server because I had already spent an hour, hour and a half on tracing through the source code to try to fix it. And when I did, it was incredibly slow.

Now, that said, they also provide free models, and free access to datasets, like Imagenet. Do I wish it was an extremely light, fast, and simple wrapper? Yes. That would be great. But they do provide what they provide, and they put in a lot of effort to try to make it accessible to everyone. That's something that should not be ignored because of any potential personal beefs with the library.

All in all, it's a double-edged sword, and I wish there was a bit more simplicity, focus, self-containment, understandability and speed with respect to the hf codebase at large. But at the same time, I sincerely appreciate the models and datasets services that they offer to the community, regardless of the hoops one might have to add to get it. If one stays within the HF ecosystem, certain things are indeed pretty easy.

I hope if anyone from HF is reading this that this doesn't feel like a total dunk or anything like that. Only that I'm very torn because it's a mixed bag, and I think I can see that a lot of care really did go into a lot of this codebase, and that I think it really could be tightened down a ton for the future. There are positives about HF despite my beefs with the code (HF spaces included within this particular calculus at hand).

4

tysam_and_co t1_j8cf8al wrote

I...I...this is the first time I've heard this. Machine learning is often used as the hype-shelter word for "AI", because it triggers very few people (in the hype sense -- or at least it used to).

I'm not quite sure what to say, this is very confusing to me.

11

tysam_and_co t1_j8cf1o9 wrote

That is a really good point.

Though, minor contention, it seems like most of the comments in the post are pretty well-informed. I see the main difference is batchnorm before or after the activation, which oddly enough years-later seems to be better in the form of being before the activation due to the efficiency increases offered by fusing.

I'm surprised they were so on the mark even 6 years ago about being skeptical of this internal covariate shift business. I guess keeping the statistics centered and such is helpful but as we've seen since then, batchnorm seems to do so much more than just that (and is a frustratingly utilitarian, if limiting tool, in my experience, unfortunately).

6

tysam_and_co OP t1_j6o72ma wrote

Okay, I ran some other experiments and I'm starting to get giddy (you're the first 'ta know! :D). It appears that for most hyperparameters, twiddling on CIFAR100 is just a flat response, or a slight downward trend (!!!) I haven't messed with them all yet, though, but that bodes very, very well (!!!!).

Also, doing the classical range boost of changing from depth 64->128 and num_epochs 10->80 results in a boost to about 80% in 3 minutes of training or so, which is about where CIFAR100 was in early 2016 or so. It's harder for CIFAR10 as I think that was slightly more popular and there was a monstrous jump, then a long flat area during that period, but if you do some linear/extremely coarse piecewise interpolation from the average starting point of CIFAR10 to the current day of CIFAR10 as far as accuracy goes on PapersWithCode, and do the same roughly for CIFAR100, adding this extra capacity+training time moves them both from ~2015 SOTA numbers to ~early 2016 SOTA numbers. Wow!! That's incredible! This is starting to make me really giddy, good grief.

I'm curious if cutout or anything else will help, we'll see! There's definitely a much bigger train<->eval % gap here, but adding more regularization may not help as much as it would seem up front.

2

tysam_and_co OP t1_j6o25u4 wrote

Oh, I see! Yeah, I probably will want to leave process spawning/forking stuff to the side as that can require some bug-resistant refactoring IIRC. However! I believe that would only require some change around the dataloaders and maybe some stuff at the beginning of the file. I am unfortunately terribly rusty on this, but you might be able to get away with changing the num_dataloaders=2 -> num_dataloaders->0 in your file, and I believe that would run 'much' more slowly the first time, then the same after, without any forking issues?

As for CIFAR100, I made the absolute minimum number of changes/additions to it, which was 3 characters. I added 0s to each of the two main dataloaders, and then one 0 to the num_classes parameter. On the first run with this, I'm averaging about 75.49% validation accuracy, which matches roughly what the 2015 SOTA was for CIFAR100. The 2015 SOTA for CIFAR10 was roughly 94%, so I believe that we are in very good hands here! This bodes quite well, I think, but I am unsure. This also was the first blind run (well, I had to do it again on the right notebook base as I accidentally pulled an older version that was about ~.8% below this one -- and in over 10 seconds! Interestingly to me, we're still running at right about ~9.91-9.94 seconds or so, I would have thought the extra 90 classes would have added some appreciable overhead to this! Creazy! :D That opens a lot of cool avenues (Imagenet?!?!) that I've been sorta hardcore ignoring as a result. Goes to show I guess that there's basically no replacement for really good testing! :D :)))) ), no other tuning or anything. I wouldn't be surprised if one could get more performance with more tuning -- though it would be surprising if we were simply at a local maxima already! Either way, I find it somewhat validating.

Thank you for being the first person to comment on and support my work. You really made my day back then, and as of yesterday the project was being tweeted by Karpathy. I am appreciative at about the same level to both of you for your support and kindness -- much love! <3 :)))) <3 :D

2

tysam_and_co OP t1_j6hxgzk wrote

Hi hi hiya there! Great questions, thanks so much for asking them! :D

For the dataloaders, that dataloading only happens once -- after that, it's just saved on disk as a tensor array in fp16. It's wayyyyy faster for experimentation this way. We only need to load the data once, then we move it to GPU, then we just dynamically slice it on the GPU each time! :D

As for self.se, that used to be a flag for the squeeze_and_excite layers. I think it's redundant now as it's just a default thing -- this is a one person show and I'm moving a lot of parts fast so there's oftentimes little extraneous bits and pieces hanging around. I'll try to clean that up on the next pass, very many thanks for pointing that out and asking!

I'm happy to answer any other questions that you might have! :D

1

tysam_and_co OP t1_j6hx45k wrote

Thanks for sharing. I think you might be missing some of the bigger picture here! Most of the changes and performance improvements did indeed come by changing the architecture, memory format, execution order, network width/etc in the right places. These are from about five previous years of experience where my primary task was architecting networks like this. I actually transferred a number of personal lessons learned into this network to get a lot of the benefits that we have here. So I'm not quite sure why they would not scale to other problems all of a sudden! ;P I guess that said, there might be some tweaks in line in order to line up with the inductive biases of the network on different datasets (in this case, say, for Imagenet 1-2 more downscaling blocks or something like that).

I also wouldn't focus in on the hyperparameter twiddling that much -- though it is important and definitely can be a trap. At the front of being a world record, every option is on the table and hyperparameters promise results but are exponentially more expensive to work with. But the 'good enough' parameter space should be pretty flat outside of it, so it's likely not too bad of a starting place.

I'm a bit curious about how this would not be reproducible on another dataset (especially if we're narrowing our inductive space -- this should increase generalization, not reduce it!). Similar to Transformers, the simpler and more scalable this architecture is, the better. One of my go-tos for people newer to the field is to encourage them to keep things as simple as possible. It pays off!

In this case, for example, before release, I just added 70 epochs and doubled the base width, and went from 94.08% to 95.77%. That's a good sign! It should at least have good basic performance on other datasets, and if something has to be changed, it's probably just a few hyperparameters, and not all of them, if that makes sense.

2

tysam_and_co OP t1_j6hfdaj wrote

And, safe to say, there's stuff I'm not sharing here (yet?) that I've found as a result of that. Some hyperparameters are more network-specific, some are dataset specific. And some behave in ways just weirdly enough that you might get an entirely new adaptive method out of it... ;))))

I hadn't thought about that in the exact words you'd put it for a long while, but I think you're right very much so! It is quite a fun game, and very interesting to play in its own right. There's very much this bizzare, high-dimensional "messy zen" to it all. ;D

Thanks again for your comment, it warmed my heart and made me smile seeing it. Have a good evening/night/etc! :D :))) <3 <3 :D

3

tysam_and_co OP t1_j6h8nhh wrote

Excellent, and thank you very much for sharing that paper, I shall have to take a look at it! :D

I might need to do some operator fusion manually at some point in the future, though I'm hoping the torch.compile() command does it well (but I am somewhat scared because compiling territory can be more rigid and error-prone).

1

tysam_and_co OP t1_j6h8h8n wrote

I'm sure there's the classical leakage of the val set into the network design via val set performance for super tight tweaks. Thankfully some of the compression from simplifying things in the network seems to be a defending principle against that, but if I was hard-pressed, doing a final round with k-fold validation would probably be really good for final tuning runs.

There might be a CIFAR10 test set (see, goes to show how little I know about the CIFAR10 split, lol), and there has been a lot of work put into studying some aspects (and flaws, like mislabeling, etc) of the structure of CIFAR10 in general.

Mainly -- this task is primarily a proxy for "how fast can we ingest the information encoded in dataset A into a compressed form B so that it will perform adequately well on a task for a dataset we call C". It starts getting down to the engineering barebones of information flow at that point, and a lot of other really involved stuff that I don't want to break into while our training times are 4-5 seconds or so.

The concepts, if any usable ones, distilled from this, would apply to nearly any dataset and any problem -- say, training GPT on an absolutely massive oneshot dataset, for example. The math is the same anywhere you go.

I don't know if that answers your question or not, my apologies if I didn't understand properly.

5

tysam_and_co OP t1_j6h2qlh wrote

That's a good question, and I'm somewhat curious what you mean by HPC/MLHPC techniques or code optimizations. Do you mean something like distributed computing? (which that is an interesting rabbit-hole alone to get into....)

Regardless -- yep! I'm a professional in this industry, and there's a lot of detail underlying a ton of seemingly-simple changes (and potentially even more frustrating if simple, understandable changes shave off large chunks and swathes of what was previously the world record). So basically everything I'm doing is informed by, well, years of doing basically this exact same thing over and over again. Something that I've found is that the younger/newer ML engineers (myself included when I was at that point) are often really attracted to the "new shiny", when in reality, good HPC on a smaller scale is like writing a really nice, quiet, tight haiku. Less is more, but a single 'syllable' equivalent can make or break the whole thing.

Lots of people use models inefficiently. This model is still somewhat inefficient in its own ways, though it definitely I think is more efficient by far than most nearly all of the ones it's currently competing with. When I design a model, I'm thinking about keeping the GPU occupancy high, utilizing tensor cores as much as possible, mathematically fusing operations to reduce overhead, managing memory layout to make sure the right paths get activated in the GPU (like tensor cores, etc), and seeing if there are good approximations or alternatives to some things that are much more mathematically cheap (or if there are alternate paths with specialized kernels that I can boutique-design the network around).

I'll have to cut myself short early, but I'll leave you with a singular example, which is a technical breakdown which was behind what was in practice a very 'simple' change in the backend. I also made another comment in here, this reddit thread (https://www.reddit.com/r/MachineLearning/comments/10op6va/comment/j6h0z6b/?utm_source=share&utm_medium=web2x&context=3), with a technical breakdown behind one other very 'simple' change. Don't get pulled away by the shiny fancy techniques that are slow/etc, sometimes the simplest is the best!

Here's the breakdown: https://github.com/tysam-code/hlb-CIFAR10/issues/2#issuecomment-1379711156

Let me know if this answered your question at all or if you have any follow-ups, much love, cheers, and thanks! <3 :D :D :D :D :))))

2

tysam_and_co OP t1_j6h0z6b wrote

Thanks for asking, great question! I'd say it's really hard to pick at this point -- mostly it's just a hardcore case of "do the basics and do them really, _really_ well" as best as I can, with a few smaller tricks along the way. There may be some much more exotic things later on, but experience has taught me to try to delay that for as long as is humanly possible! Plus, the bonus is that things get to be simpler. Arguably, some aspects of this code are actually simpler than the baseline, believe it or not!

That said, if I had to pick a trick, I think it would be 'upgrading' the whitening convolution to be 2x2 from 3x3 or so. I think that saved like maybe just over or around a full second and a half alone or so, when combined with the 'padding=0' change at the start. Most of the in-practice things here are pretty simple, but what's happening here is that we're projecting from the input image to a whitened feature space, the 3x3 convs are going to result in a 3*3*3 = 27 depth input feature space without any downstriding, this can be horribly slow as the spatially large layers always are the slowest compute-wise -- deeper layers without much spatial width or height are by comparison very snappy (correct me if I'm wrong, I think this has to do with the SIMD architecture of GPUs -- in any case, spatial stuff with 2d convolutions at least tends to be hilariously ineffecient).

Not padding cuts off a potentially expensive kernel call (I don't know if it's fused or not...), and reduces the height IIRC from 32x32->30x30. This is actually a deceptively large (roughly ~12%) savings in spatial pixel count, but not everything is lost as that lovely 2x2 convolution is still going to touch everything (I could theorize about the efficiency of processing the edges of pictures but I could also be horribly wrong so I'm going to keep my mouth shut here). So in any case, now summing it up, we move from a 3*3*3=27 dimensional input feature space to a new 2*2*3=12 dimensional input feature space, remove 12% of the pixels without directly necessarily deleting that information, and most importantly we only have to run with 2*2/3*3 = 4/9 = 44% of the input kernel cost.

And that's why I'm proud of that little trick. It's very unassuming, since it's just:

Conv(input_depth, output_depth, kernel_size=3, padding='same') -> Conv(input_depth, output_depth, kernel_size=2, padding=0)

Now of course there's a bit of a hit to accuracy, but the name of the game here is leverage, and that's what the squeeze-and-excite layers are for. They're very fast but add a huge amount of accuracy, though (and I unfortunately don't think I've noted this anywhere else) for some reason they are very sensitive to the compression dimension -- 16 here in this case.

Though to be frank, I started with squeeze-and-excite and got my accuracy increase, then pulled this off the shelf to 'cash in' the speed increase. I have been sitting on this one since before even the last release, I've found it's good to not be like (noise warning) https://www.youtube.com/watch?v=ndVhgq1yHdA on projects like these. Taking time to be good and slow is good!

I hope that helps answer your question, I know this was a really long answer, paradoxically I get far more verbose the more tired I get and poor next-day-me has to deal with it, lol.

Again, to get below 2 seconds, we're going to have to get more progressively fancy and "flashy" but for now, it's build a really, really freaking solid core of a network, then get into the more 'exotic' stuff. And even then, hopefully the more mundane exotic stuff while we're at it.

Hope that helps answer your question, feel free to let me know if you have any others (or follow-ups, or if this wasn't what you were looking for, etc)! :D

5

tysam_and_co OP t1_j6gzgf9 wrote

I've created a writeup for each release in https://github.com/tysam-code/hlb-CIFAR10/releases

Maybe I could do more at some point (like the original "bag of tricks" blogpost) once the smoke all clears, but I spent ~8-10 hours straight manually tuning hyperparameters yesterday night so I am smoked! Though usually I am pretty beat on release days due to trying to keep up with everything in the dopamine rush. :D

17

tysam_and_co OP t1_j6gjzzf wrote

Many thanks! I've found that the 'speed hunger' for me is truly insatiable -- we're almost at half the time it takes to train as when we started, and I find myself just as hungry to make it faster and faster. The Sisyphean hill is real, though I suppose it is more easily justified with a goal in mind! 😄😁

9

tysam_and_co OP t1_j6g3e49 wrote

Hello! Thanks so much for comment, I really appreciate it. This is a convnet-based architecture, so it's carrying on the torch of some of the old DawnBench entries.

Transformers have the best top-end of all of the neural networks, and convolutional networks tend to have an edge in the smaller/tiny regime, IIRC. One could maximize training speed for a transformer architecture, but the cost of just 1-2 layers could be several times the cost of an entire forward pass through this very tiny convnet. I even tried to just add a really tiny 16x16 attention multiply at the end of the network and it totally tanked the training speed.

However, that said, I'd really like to pick up the work of https://arxiv.org/abs/2212.14034 and continue from there, the concept of getting an algorithm to really compress that info can start opening up the horizon to some of the hard laws that underlie neural network training in the limit. For example, somewhere along the way now, apparently we have really strong consistency with scaling laws on the convnet for this project. I'm not sure why.

But in any case -- language models are hopefully next (if I get the time and have the interest/don't burn myself out on this project in the meantime!). I'll probably be focused on picking up some part-time research work in the field between here and then first, as that's my first priority right now (aside from a few community code contributions. This codebase is my living resume after all, and I think a good one at that! :D)

Hope that helped answer your question, and if not, please let me know and I'll give you my best shot! :D

25