Submitted by MurlocXYZ t3_110swn2 in MachineLearning
ArnoF7 t1_j8azbzj wrote
Discussion in this subreddit is always a bit hit and miss. After all, reddit as a community has almost no gate keeping. While this could be a good thing, there are of course downsides to it.
If you look at this post about batch norm, you see that there are people who brought up interesting insights, and there are a good chunk of people who clearly have never even read the paper carefully. And this post is 5 years ago.
tysam_and_co t1_j8cf1o9 wrote
That is a really good point.
Though, minor contention, it seems like most of the comments in the post are pretty well-informed. I see the main difference is batchnorm before or after the activation, which oddly enough years-later seems to be better in the form of being before the activation due to the efficiency increases offered by fusing.
I'm surprised they were so on the mark even 6 years ago about being skeptical of this internal covariate shift business. I guess keeping the statistics centered and such is helpful but as we've seen since then, batchnorm seems to do so much more than just that (and is a frustratingly utilitarian, if limiting tool, in my experience, unfortunately).
starfries t1_j8fp30e wrote
What's the current understanding of why/when batch norm works? I haven't kept up with the literature but I had the impression there was no real consensus.
Viewing a single comment thread. View all comments