Submitted by Blutorangensaft t3_11wmpoj in MachineLearning

I am working with Resnets consisting of feedforward networks. Additionally, I am using Kaiming-He weight initialisation and ReLU as an activation function. Extending the network to more than 10 layers leads to vanishing gradients. I cannot use batch normalization because that would violate assumptions of a gradient penalty. What should I do? Should I form residual connections over longer steps? Should I implement artificial derivatives? What's the common remedy here?

3

Comments

You must log in or register to comment.

IntelArtiGen t1_jcyqmdu wrote

Do you use another kind of normalization? You can try InstanceNorm / LayerNorm if you can't use batchnorm.

5

YouAgainShmidhoobuh t1_jd2n2v5 wrote

ResNets do not tackle the vanishing gradient problem. The authors specifically mention that the issue of vanishing gradients was already fixed because of BatchNorm in particular. So removing BatchNorm from the equation will most likely lead to vanishing gradients.

I am assuming you are doing a WGAN approach since that would explain the gradient penalty violation. In this case, use LayerNorm as indicated here: https://github.com/LynnHo/DCGAN-LSGAN-WGAN-GP-DRAGAN-Tensorflow-2/issues/3

3

Blutorangensaft OP t1_jd7jaor wrote

Thank you for your comment. I have not worked with ResNets before, and the paper I used as a basis erroneously stated that they chose this architecture because of vanishing gradients. Wikipedia has the same error it seems.

Indeed, I am working with WGAN-GP. Unfortunately, implementing layer norm, while enabling me to scale the depth, completely changes the training dynamics. Training both G and C with the same learning rate and the same schedule (1:1), the critic seems to win, a behaviour I have never seen before in GANs. I suppose I will have to retune learning rates.

1