Submitted by MurlocXYZ t3_110swn2 in MachineLearning
tysam_and_co t1_j8cf1o9 wrote
Reply to comment by ArnoF7 in [D] Quality of posts in this sub going down by MurlocXYZ
That is a really good point.
Though, minor contention, it seems like most of the comments in the post are pretty well-informed. I see the main difference is batchnorm before or after the activation, which oddly enough years-later seems to be better in the form of being before the activation due to the efficiency increases offered by fusing.
I'm surprised they were so on the mark even 6 years ago about being skeptical of this internal covariate shift business. I guess keeping the statistics centered and such is helpful but as we've seen since then, batchnorm seems to do so much more than just that (and is a frustratingly utilitarian, if limiting tool, in my experience, unfortunately).
starfries t1_j8fp30e wrote
What's the current understanding of why/when batch norm works? I haven't kept up with the literature but I had the impression there was no real consensus.
Viewing a single comment thread. View all comments