Viewing a single comment thread. View all comments

canbooo t1_ivydtlt wrote

You are right and what I ask may be practically irrelevant and I really should rtfp. However, think about the edge case of 1 Layer with 1 input and 1 output. Each node having 1 as input weight sees the same gradient, similar to the nodes having 0. Increasing the number of inputs make it combinatorially improbable to have the same configuration but increasing the number of nodes in a layer makes it likelier. So, it could be relevant for low dimensions or models with a narrow bottleneck. I am sure that the authors already thought about this problem and either discarded it as it is quite unlikely in their tested settings or they already have a solution/analysis somewhere in the paper, hence my question.

2