Submitted by abhitopia t3_ytbky9 in MachineLearning
maizeq t1_iwcv5uh wrote
Reply to comment by liukidar in [Project] Erlang based framework to replace backprop using predictive coding by abhitopia
Thanks, and yes I agree, this might be useful to others.
As an aside, I have no qualms against standard generative PC (such as the paper you linked, and any other papers they have realised in that vein, indeed I'm a fan!). However, the discussion in this thread is about the equating of BP with PC, and in this regard, arguing "PC approximates backpropagation" when you really mean "this other heavily modified algorithm that was inspired by PC approximates backprop", is misleading. It is akin to saying an apple looks like an orange, if you throw away the apple and buy another orange.
It feels particularly egregious, when it turns out this modified algorithm is computationally equivalent to backpropagation, and as such the various neuroscientific justifications one applies may no longer hold (e.g. generative modelling is more sample efficient, or cortical hierarchies in the brain are characterised by top-down non-linear effects).
>In relation to the accuracy, I'm not sure about what reported byKinghorn, but already in Whittington 2017, you can see that they get a98% accuracy on MNIST with standard PC. So the performance of PC onthose it's not to be doubted.
Yes, this is the 97% value I referred to in my comment, if you look at the Whittington 2017 paper you will see this refers to an inverted architecture. In this case for a small ANN trained with standard PC without the FPA assumption.
Again, it's important to distinguish between the BP=PC literature, which this thread is related to, and other PC literature. I have no doubt plenty of interesting papers and insights exist in the latter!
Ambitious_Smile_981 t1_iwdrmam wrote
I don't see the problem of differentiating inverted and non-inverted architectures, as they are both generative models. The difference lies in what you are generating. In one case, you generate the label, and give as prior information the image, in the other, you generate the image giving the label as prior information.
Both have their advantages and disadvantages, but I don't see why the 'inverted' one is not interesting.
As of the BP = PC literature, I think that showing that by simply introducing a temporal scheduling for the weight updates of PC, we are able to obtain exact BP is interesting. I agree that this variation of PC loses all the advantages that PC has over BP, but it is still important to know that it is possible to derive exact backprop from a variational free energy.
BerenMillidge t1_iy814ur wrote
Hi, author of some of the papers linked here. Broadly, Maizeq is right to distinguish between FPA-PC and ‘standard PC’ (the ‘inverted vs generative direction of the PC net is a different orthogonal direction). The equivalence between PC and BP only holds exactly in the case with the FPA (or some equivalent set of assumptions — for instance in the original Whittington paper they use the precision ratio tending to 0. Of course all of these limits are in some sense extreme and eliminate some (but not all) of the major advantages of PC (in some sense this was inevitable since if they exactly equal BP then they must very roughly have the same advantages/disadvantages as it). The way to view these works, at least as I have come to view them, is as a idealised exploration of a specific limit of PC. In recent work (https://arxiv.org/pdf/2206.02629), we expand on this limit idea and show that all current EBM approximations to BP, such as PC, Equilibrium-prop and Contrastive Hebbian learning, can be expressed as a single ‘infinitesimal inference limit’.
Overall I disagree that the work in this vein is particularly misleading, although this is a subjective assessment. It is upfront about the assumptions you need to make to obtain equivalence to backprop, as well as how this departs from standard PC.
Of course, from a neuroscientific perspective, this limit is perhaps not the most realistic and so we are also exploring the ML performance of more ‘standard’ PC versions which are more biologically plausible and which don’t approximate backdrop (, as well as specifically understanding the special advantages and disadvantages of these algorithms. For instance, in a recent paper -- https://www.biorxiv.org/content/biorxiv/early/2022/05/18/2022.05.17.492325.full.pdf --, we propose a new understanding of standard PC as ‘prospective configuration’ and demonstrate how this version of PC can outperform backdrop in a number of its properties. We also have a more theoretical analysis of standard PC (https://arxiv.org/pdf/2207.12316) where we show that although it differs from backdrop, it can also converge to minima of a supervised loss function, and has close links to target-propagation and hence Gauss-Newton optimization. Our groups have also explored other potential advantages of PC over BP including the ability for it to learn arbitrary recurrent computation graphs (https://arxiv.org/pdf/2201.13180), the fact that you can significantly speed it up with incremental variants, and that you can get PC to perform a mix of iterative and amortised inference https://arxiv.org/pdf/2204.02169.
In terms of the hardware, I have also looked into this a little, and my feeling is that while PC has better parallelism properties than PC, it is unlikely to outperform BP on a GPU due to the need to iteratively perform the inference phase while BP just has a sequential forward and backward. GPUs are now getting very highly optimised for the exacts style of computations needed in BP for large scale ANNs. PC does possess a much higher degree of parallelism and locality than BP and on a sufficiently distributed architecture may eventually prove better, especially once we start building proper ‘neuromorphic’ processor-in-memory architectures. However this seems likely to be many years away. I haven’t read much about Erlang so I’m not sure if it possesses the degree of necessary parallelism. One possibility is that Erlang with Pc might allow you to move to a different point on the Pareto frontier of having lots of CPUs and developing learning algorithms comparable in performance with doing BP on a single GPU. I haven’t run any fermi-style estimates of whether this is feasible or not. We have some calculations about this in a forthcoming paper but this is on a highly abstract computation model of ‘parallel matrix multiplications’ and I haven’t figured out what the actual equivalent calculations for realistic hardware would look like.
Viewing a single comment thread. View all comments