Viewing a single comment thread. View all comments

tysam_and_co OP t1_j6h0z6b wrote

Thanks for asking, great question! I'd say it's really hard to pick at this point -- mostly it's just a hardcore case of "do the basics and do them really, _really_ well" as best as I can, with a few smaller tricks along the way. There may be some much more exotic things later on, but experience has taught me to try to delay that for as long as is humanly possible! Plus, the bonus is that things get to be simpler. Arguably, some aspects of this code are actually simpler than the baseline, believe it or not!

That said, if I had to pick a trick, I think it would be 'upgrading' the whitening convolution to be 2x2 from 3x3 or so. I think that saved like maybe just over or around a full second and a half alone or so, when combined with the 'padding=0' change at the start. Most of the in-practice things here are pretty simple, but what's happening here is that we're projecting from the input image to a whitened feature space, the 3x3 convs are going to result in a 3*3*3 = 27 depth input feature space without any downstriding, this can be horribly slow as the spatially large layers always are the slowest compute-wise -- deeper layers without much spatial width or height are by comparison very snappy (correct me if I'm wrong, I think this has to do with the SIMD architecture of GPUs -- in any case, spatial stuff with 2d convolutions at least tends to be hilariously ineffecient).

Not padding cuts off a potentially expensive kernel call (I don't know if it's fused or not...), and reduces the height IIRC from 32x32->30x30. This is actually a deceptively large (roughly ~12%) savings in spatial pixel count, but not everything is lost as that lovely 2x2 convolution is still going to touch everything (I could theorize about the efficiency of processing the edges of pictures but I could also be horribly wrong so I'm going to keep my mouth shut here). So in any case, now summing it up, we move from a 3*3*3=27 dimensional input feature space to a new 2*2*3=12 dimensional input feature space, remove 12% of the pixels without directly necessarily deleting that information, and most importantly we only have to run with 2*2/3*3 = 4/9 = 44% of the input kernel cost.

And that's why I'm proud of that little trick. It's very unassuming, since it's just:

Conv(input_depth, output_depth, kernel_size=3, padding='same') -> Conv(input_depth, output_depth, kernel_size=2, padding=0)

Now of course there's a bit of a hit to accuracy, but the name of the game here is leverage, and that's what the squeeze-and-excite layers are for. They're very fast but add a huge amount of accuracy, though (and I unfortunately don't think I've noted this anywhere else) for some reason they are very sensitive to the compression dimension -- 16 here in this case.

Though to be frank, I started with squeeze-and-excite and got my accuracy increase, then pulled this off the shelf to 'cash in' the speed increase. I have been sitting on this one since before even the last release, I've found it's good to not be like (noise warning) https://www.youtube.com/watch?v=ndVhgq1yHdA on projects like these. Taking time to be good and slow is good!

I hope that helps answer your question, I know this was a really long answer, paradoxically I get far more verbose the more tired I get and poor next-day-me has to deal with it, lol.

Again, to get below 2 seconds, we're going to have to get more progressively fancy and "flashy" but for now, it's build a really, really freaking solid core of a network, then get into the more 'exotic' stuff. And even then, hopefully the more mundane exotic stuff while we're at it.

Hope that helps answer your question, feel free to let me know if you have any others (or follow-ups, or if this wasn't what you were looking for, etc)! :D

5