ObjectManagerManager t1_iz5xous wrote on December 6, 2022 at 6:46 PM

(Confession: I haven't read the paper yet). I have a couple of questions:

If each layer has its own objective function, couldn't you train layers back-to-front? e.g., train the first layer to convergence, then train the second layer, and so on. I doubt this would be faster than training it end-to-end, but a) as the early layers adapt, they screw up the representations being fed to the later layers anyways, so it probably wouldn't be too much slower than training it end-to-end, and b) it would use significantly less memory (e.g., if you pre-compute the inputs to a layer just before you begin training it, you could imagine training any arbitrarily deep model with a finite amount of memory).
What's the motivation behind "goodness"? Suppose we're talking about classification. Why doesn't each layer just minimize cross entropy? I guess that'd require each layer to have its own flatten + linear projection layers. But then you wouldn't have to concatenate the label and the input data, and so inference complexity would be (mostly) independent of the number of classes. Thinking of a typical CNN, a layer could be organized:
1. Batch norm
2. Activation (e.g., ReLU)
3. Convolution (the output of which is fed into the next layer)
4. Pooling
5. Flatten
6. Linear projection
7. Cross entropy loss

Can anyone (who has read the paper) answer these questions?

Batsev t1_iz6yzbn wrote on December 6, 2022 at 10:49 PM

For the first question: https://conferences.miccai.org/2022/papers/233-Paper1173.html They basically train a layer at a time in a "back to front" fashion. They use a reconstruction loss and a classification loss as layer's objectives.

[deleted]