Submitted by hardmaru t3_ys36do in MachineLearning
VinnyVeritas t1_iw0w76i wrote
Seems useless, why not simply fix the seed of the random generator for reproducibility?
master3243 t1_iw1hggt wrote
The problem is not random variance between trained models.
Check out the abstract, it answers why this work is useful.
VinnyVeritas t1_iw1s40n wrote
Like what? Training ultra-deep neural networks without batchnorm? But in their experiments the accuracy gets worse with deeper networks, what's the point of going deeper to get worse results?
master3243 t1_iw1x35r wrote
> They theoretically show that, different from naive identity mapping, their initialization methods can avoid training degeneracy when the network dimension increases. In addition, they empirically show that they can achieve better performance than random initializations on image classification tasks, such as CIFAR-10 and ImageNet. They also show some nice properties of the model trained by their initialization methods, such as low-rank and sparse solutions.
VinnyVeritas t1_iw9ajwe wrote
The performance is not better: the results are the same within the margin of error for standard (not super-deep networks). Here I copied from their table:
Cifar10
ZerO Init 5.13 ± 0.08
Kaiming Init 5.15 ± 0.13
Imagenet
ZerO Init 23.43 ± 0.04
Kaiming Init 23.46 ± 0.07
Viewing a single comment thread. View all comments