I was recently messing around with architectures.

I tried this: Conv layer (2000 dim output) Dense layer (500 dim) Dense layer (1000 dim)

Doing this completely broke my results. The model no longer trained on the dataset.

Can this be explained or do I have a bug in my code?

I assume this is because by adding a smaller middle layer, we reduce the amount of information.

Edit: Tried this for middle layer output ranging from 100 to 950 and all the models give the same results. About 0.5% accuracy

Edit 2: the model trains well if you remove any activation from the middle layer. Not sure why

Comments

suflaj t1_j0dojq2 wrote on December 15, 2022 at 10:00 PM

You introduced a bottleneck. Either you needed to train it longer, or your bottleneck destroyed part of the information needed for better performance.

rubbledubbletrubble OP t1_j0drg5e wrote on December 15, 2022 at 10:20 PM

Yes, but shouldn’t the model still train and learn something?

I currently have an accuracy of 0.5% with the middle layer ranging from 100 to 950.

suflaj t1_j0drngf wrote on December 15, 2022 at 10:21 PM

It should, but how much it's tough to say and it depends on the rest of the model and where this bottleneck is. If, say, you're doing this is the first layers, the whole model basically has to be retrained from scratch, and performance similar to previous one is not guaranteed.

rubbledubbletrubble OP t1_j0dsv9t wrote on December 15, 2022 at 10:29 PM

I am doing this at the last layer. That is why it doesn’t make sense to me. I’d assume with 950 I should get similar results.

suflaj t1_j0dt970 wrote on December 15, 2022 at 10:32 PM

Not really, 950 is smaller than 1000 so not only are you destroying information, but you are potentially getting into a really bad local minimum.

When you add that intermediate layer, what you are essentially doing is random hashing your previous distribution. If your random hash kills the relations between data your model learned, then of course it will not perform.

Now, because Xavier and Kaiming-He initializations aren't exactly initializations to get the functionality of a universal random hash, they might not kill all your relations, but they are still random enough to have the potential depending on the task and data. You might get lucky, but on average, you will almost never get lucky.

If I was in your place I would train with linear warmup to a fairly large learning rate, like 10x higher than previous maximum. This will make very bad weights shoot out of their bad minima once LR reaches the max and hopefully you'll get better results once they settle down as the LR falls down. Just make sure you clip your gradients so your weights don't go to NaN, because this is the equivalent of driving your car into a wall in hopes of the crash turning it into a Ferrari.

As for how long you should train it... Well, the best would be to add the layer without any nonlinear function and see how much you need to reach original performance. Since there is no non-linear function the new network is equally as expressive as the original. Once you get the number of epochs, add like 25% to that number and train the one with the non-linear transformation after your bottleneck that long.

rubbledubbletrubble OP t1_j0du040 wrote on December 15, 2022 at 10:37 PM

Thank! I’ll give this a shot!

BrotherAmazing t1_j0fai9p wrote on December 16, 2022 at 5:23 AM

In this case, I don’t think anyone can tell you wtf is going on without a copy of your code and dataset. There are just so many unknowns, but is this 1000 dim dense layer the last layer before a softmax?

Are you training the other layers then adding this new layer with new weight initialization in between the trained layers, or are you adding it in as a new architecture and re-initializing the weights everywhere and starting from scratch again?

rubbledubbletrubble OP t1_j0iib4p wrote on December 16, 2022 at 10:05 PM

The 1000 layer is the softmax layer. I am using a pretrained model and training the classification layers. My logic is to reduce the number of output layers the feature extractor to reduce the number of total parameters.

For example: If mobilenet outputs 1280 and I had a 1000 unit dense layer. The parameters would be 1.28 million. But if I added a 500 unit layer in the middle, it would make the network smaller.

I know the question is bit vague. I was just curious