tysam_and_co t1_j8cf1o9 wrote on February 13, 2023 at 7:00 AM

Reply to comment by ArnoF7 in [D] Quality of posts in this sub going down by MurlocXYZ

That is a really good point.

Though, minor contention, it seems like most of the comments in the post are pretty well-informed. I see the main difference is batchnorm before or after the activation, which oddly enough years-later seems to be better in the form of being before the activation due to the efficiency increases offered by fusing.

I'm surprised they were so on the mark even 6 years ago about being skeptical of this internal covariate shift business. I guess keeping the statistics centered and such is helpful but as we've seen since then, batchnorm seems to do so much more than just that (and is a frustratingly utilitarian, if limiting tool, in my experience, unfortunately).

starfries t1_j8fp30e wrote on February 13, 2023 at 11:31 PM

What's the current understanding of why/when batch norm works? I haven't kept up with the literature but I had the impression there was no real consensus.