suflaj t1_j0drngf wrote on December 15, 2022 at 10:21 PM

Reply to comment by rubbledubbletrubble in Why does adding a smaller layer between conv and dense layers break the model? by rubbledubbletrubble

It should, but how much it's tough to say and it depends on the rest of the model and where this bottleneck is. If, say, you're doing this is the first layers, the whole model basically has to be retrained from scratch, and performance similar to previous one is not guaranteed.

rubbledubbletrubble OP t1_j0dsv9t wrote on December 15, 2022 at 10:29 PM

I am doing this at the last layer. That is why it doesn’t make sense to me. I’d assume with 950 I should get similar results.

suflaj t1_j0dt970 wrote on December 15, 2022 at 10:32 PM

Not really, 950 is smaller than 1000 so not only are you destroying information, but you are potentially getting into a really bad local minimum.

When you add that intermediate layer, what you are essentially doing is random hashing your previous distribution. If your random hash kills the relations between data your model learned, then of course it will not perform.

Now, because Xavier and Kaiming-He initializations aren't exactly initializations to get the functionality of a universal random hash, they might not kill all your relations, but they are still random enough to have the potential depending on the task and data. You might get lucky, but on average, you will almost never get lucky.

If I was in your place I would train with linear warmup to a fairly large learning rate, like 10x higher than previous maximum. This will make very bad weights shoot out of their bad minima once LR reaches the max and hopefully you'll get better results once they settle down as the LR falls down. Just make sure you clip your gradients so your weights don't go to NaN, because this is the equivalent of driving your car into a wall in hopes of the crash turning it into a Ferrari.

As for how long you should train it... Well, the best would be to add the layer without any nonlinear function and see how much you need to reach original performance. Since there is no non-linear function the new network is equally as expressive as the original. Once you get the number of epochs, add like 25% to that number and train the one with the non-linear transformation after your bottleneck that long.

rubbledubbletrubble OP t1_j0du040 wrote on December 15, 2022 at 10:37 PM

Thank! I’ll give this a shot!