gwern t1_izxv8bw wrote on December 12, 2022 at 5:37 PM

Reply to comment by farmingvillein in [D] G. Hinton proposes FF – an alternative to Backprop by mrx-ai

Yeah, it obviously doesn't have a gradient, but what I don't quite get how the blackbox component trains without a gradient being computed by anything. Is it a finite difference equivalent? Does it reduce down to basically REINFORCE? What is it, and is it really low-variance enough to care about or is it merely a curiosity?

midasp t1_izybqp8 wrote on December 12, 2022 at 7:21 PM

You are right. Intuitively, it's just rewarding correct inputs and penalizing wrong inputs. Which is largely similar to how many RL policies learn. FF seem like it will be able to discriminate, but it won't be able to encode and embed features the way back prop does. It would not identify common features. If you try to train a typical back prop based u-net architecture network, my instincts say it likely would not work since the discriminating information is not distributed across the entire network.

Akrenion t1_j011yxv wrote on December 13, 2022 at 8:27 AM

U-net is specifically designed for backprop. It the skip connections are helpful for bp. We might need to rethink architectures for other approaches as well.

ChuckSeven t1_j016l2h wrote on December 13, 2022 at 9:33 AM

That's actually a fair point. The optimisation lottery if you will. Where architectures are biased because they are designed around the algorithms that can be scaled and have shown to "work".

farmingvillein t1_izy22t6 wrote on December 12, 2022 at 6:20 PM

Got it.

I'm going to guess that the author meant that you could stick a black box in the middle and all of the neurons could still be trained (but not the black box itself).

mgostIH t1_j02vbuy wrote on December 13, 2022 at 6:11 PM

All the layers are trained independently at the same time, you can use gradients but you don't need backprop because you can use explicit descriptions since each layer will have as a problem maximizing ||W * x||^2 for good samples, minimizing it for bad samples (each layer gets a normalized version of the previous output).

The issue I find in this is (besides generating good contrastive examples) that I don't understand how this would lead a big network to discover interesting structure: circuits require multiple layers to do something interesting, but here each layer greedily optimizes its own evaluation. In some sense we are hoping that the output of the past layers will orient things in a way that doesn't make it too hard for the next layers, which have only linear dynamics.