Viewing a single comment thread. View all comments

tysam_and_co t1_j8cf1o9 wrote

That is a really good point.

Though, minor contention, it seems like most of the comments in the post are pretty well-informed. I see the main difference is batchnorm before or after the activation, which oddly enough years-later seems to be better in the form of being before the activation due to the efficiency increases offered by fusing.

I'm surprised they were so on the mark even 6 years ago about being skeptical of this internal covariate shift business. I guess keeping the statistics centered and such is helpful but as we've seen since then, batchnorm seems to do so much more than just that (and is a frustratingly utilitarian, if limiting tool, in my experience, unfortunately).

6

starfries t1_j8fp30e wrote

What's the current understanding of why/when batch norm works? I haven't kept up with the literature but I had the impression there was no real consensus.

5