Viewing a single comment thread. View all comments

ObjectManagerManager t1_iz5xous wrote

(Confession: I haven't read the paper yet). I have a couple of questions:

  1. If each layer has its own objective function, couldn't you train layers back-to-front? e.g., train the first layer to convergence, then train the second layer, and so on. I doubt this would be faster than training it end-to-end, but a) as the early layers adapt, they screw up the representations being fed to the later layers anyways, so it probably wouldn't be too much slower than training it end-to-end, and b) it would use significantly less memory (e.g., if you pre-compute the inputs to a layer just before you begin training it, you could imagine training any arbitrarily deep model with a finite amount of memory).
  2. What's the motivation behind "goodness"? Suppose we're talking about classification. Why doesn't each layer just minimize cross entropy? I guess that'd require each layer to have its own flatten + linear projection layers. But then you wouldn't have to concatenate the label and the input data, and so inference complexity would be (mostly) independent of the number of classes. Thinking of a typical CNN, a layer could be organized:
    1. Batch norm
    2. Activation (e.g., ReLU)
    3. Convolution (the output of which is fed into the next layer)
    4. Pooling
    5. Flatten
    6. Linear projection
    7. Cross entropy loss

Can anyone (who has read the paper) answer these questions?

2