Submitted by abhitopia t3_ytbky9 in MachineLearning
maizeq t1_iw5mh0v wrote
I will save you a significant amount of wasted time and tell you now that predictive coding (as it has been described more or so for 20 years in the neuroscience literature) is not equivalent to backpropagation in the way that Millidge, Tschantz, Song and co have been suggesting for the last two years.
It is extremely disheartening to see them continue to make this claim when they are clearly using a heavily modified version of predictive coding (called FPA PC, or fixed predicted assumption PC), which is so distinct to PC it is a significant stretch to lend it the same name.
For one predictive coding under the FPA no longer corresponds to MAP estimation on a probabilistic model (gradient descent on the log joint probability), so it loses its interpretation as a variational Bayes algorithm (something that afaik has not been explicitly mentioned by them thus far).
Secondly, if you spend any appreciable time on predictive coding you will realise that the computational complexity of FPA PC is guaranteed to be at best equal to backpropagation (and in most cases significantly worse).
Thirdly, FPA-PC requires "inverted" PC models in order to form this connection with backpropagation. These are models where high dimensional observations (such as images), parameterise latent states - no longer rendering them generative models in the traditional sense.
FPA PC can really be understood as just a dynamic implementation of backprop (with very little actual connection to predictive coding). This implementation of backpropagation is in many ways practically inefficient and meaningless. Let me use an analogy to make this more clear: Let's say you want to assign the variable a to f(x). You could either do a = f(x). Or you could set up a to update based on da/dt = a - f(x). The fixed/convergence point of which results in a = f(x). But if you think about it, if you already have the value 25, this is just a round about method of assigning a.
In the case of backpropagation "a" corresponds to backpropagated errors, and the dynamical update equation corresponds to the recursive equations which defines backpropagation. I.e. we are assigning "a" to the value of dL/dz, for a loss L. (it's a little more than this, but I'm drunk so I'll leave that to you to discern). If you look at the equations more closely you find that it basically can not be any more efficient than backpropagation because the error information still has to propagate backwards, albeit indirectly. I would check out this paper by Robert Rosenbaum which I think is quite fantastic if you want more nitty gritty details, and which deflates a lot of the connections espoused between the two works, particularly from a practical perspective.
I don't mean to be dismissive of the work of Millidge and co! Indeed, I think the original 2017 paper by Whittington and Bogacz was extremely interesting and a true nugget of insight (in terms of how PC with certain variance relationships between layers can approximate backprop etc. - something which makes complete sense when you think about it), but the flurry of subsequent work that has capitalised on this subtle relationship has been (in my honest opinion) very misleading.
Also, I would also not take any of what I've said as a dismissal of predictive coding in general. PC for generative modeling (in the brain) is extremely interesting, and may be promising still.
abhitopia OP t1_iw6l77r wrote
Thanks for the response.
I am yet to read in details the work of Millidge, Tschantz, Song in detail. I agree that this is not PC in the sense that came out from NeuroScience literature. I have only thoroughly read Bogatz 2017 paper.
and next on my list is Can the Brain Do Backpropagation? —Exact Implementation of Backpropagation in Predictive Coding Networks (also from Bogatz).
>If you look at the equations more closely you find that it basically can not be any more efficient than backpropagation
The interesting bit for me is not the exact correspondence with PC (as described in Neuroscience) but rather following properties that lend it suitable for asynchronous paralellisation is Local Synaptic Plasticity which I believe still holds good. The problem with backprop is not that it is not efficient, in fact it is highly efficient. I just cannot see how backprop systems can be scaled, and do online and continual learning.
>In the case of backpropagation "a" corresponds to backpropagated errors, and the dynamical update equation corresponds to the recursive equations which defines backpropagation. I.e. we are assigning "a" to the value of dL/dt, for a loss L. (it's a little more than this, but I'm drunk so I'll leave that to you to discern). If you look at the equations more closely you find that it basically can not be any more efficient than backpropagation because the error information still has to propagate backwards, albeit indirectly.
Can't we make first order approximation, like we do in any gradient descent algorithm? Again emphasing that the issue is not only speed of learning.
I will certainly checkout the paper by Robert Rosenbaum and thanks for sharing that. I will comment more once I have read this paper.
liukidar t1_iwc3rkb wrote
> The interesting bit for me is not the exact correspondence with PC (as described in Neuroscience) but rather following properties that lend it suitable for asynchronous paralellisation is Local Synaptic Plasticity which I believe still holds good
Indeed this still holds with all the definitions of PC out there (I guess that's why very different implementations such as FPA are still called PC). In theory, therefore, it is possible to parallelise all the computations across different layers.
However, it seems that deep learning frameworks such as PyTorch and JAX are not able to do this kind of parallelization on a single GPU (I would be very very glad if someone who knows more about this would like to have a chat on the topic; maybe I'm lucky and some JAX/Pytorch/Cuda developers stumble upon this comment :P)
miguelstar98 t1_iwoeafv wrote
🖒Noted. I'll take a look at it when I get some free time. Although someone should probably make a discord for this....
liukidar t1_iwc2tbm wrote
Hello. Since it may be relevant for the conversation, I'd like to specify that the work by Song doesn't use FPA (except here where they mathematically prove the identity between fpa PC and BP) and all the experimental results in others of his papers are obtained via "normal" PC, where the prediction is updated at every iteration using gradient descent on the log joint probability (so, as far as my understatement of the theory is correct, it corresponds to the MAP on a probabilistic model). I'm not 100% sure about which papers by Millidge do and don't, but I'm quite confident that the majority don't (like here the predictions seem to be updated at every iteration; however, in the paper cited by abhitopia, apparently, they use FPA). Unfortunately, I'm not familiar with the work by Tschantz, so I cannot comment on that.
maizeq t1_iwcimaq wrote
Thanks for the reply, there was some nuance left out of my comment since it was getting long enough, but if you take a closer look you'll find they all more or less adopt similar assumptions to make the two equivalent, and all suffer from the same points.
To be more specific:
The Millidge paper, which most of the BP = PC literature is based on uses the FPA assumption, and is not a descent on the log joint. (It also uses inverted models as I mentioned).
This paper by Song which was published in NeurIPS doesn't use the FPA-PC "directly", but achieves effectively the same thing by requiring the weight update to occur at a precise inference step, and requires that the modes are initialised to a feed-forward pass, and also requires the inference learning rate to be exactly 1. (All required for equivalence)
Does this sound familiar? That's right, this is literally computationally equivalent to backprop! (a forward pass and a sequential coordinated backward pass). This is intuitively obvious if you read the paper but you can see the Rosenbaum paper to see it play out experimentally also.
The Salvatori paper you linked uses the algorithm from the aforementioned Song paper, and so the same points apply. Note how they do not empirically evaluate "IL", which, in their terminology, corresponds to the actual PC algorithm.
Finally the Kinghorn paper you linked refers to standard uninverted (generative) PC, and isn't part of the BP=PC literature. (Note how label accuracy for MNIST is 80%, whereas in the inverted PC=BP models it can reach 97%).
From my practical experience in implementing a PC library the subpar performance of supervised generative PC for classification remains a difficulty. What's more, when using standard PC (in both inverted and uninverted settings), you have to be far more careful (vs. FPA) on account of the dynamics during inference being more complex; since standard PC takes in to account the current top-down beliefs at every time-step, something that is not done by the FPA.
As such you can easily experience divergence, or a failure to converge. This is likely why I haven't seen a single example of standard PC evaluated on a deep/complex inverted model. All the instances you see of "PC" evaluated on RNNs, CNN, deep MLPs etcs are FPA-PC (or the alternatives I mentioned above).
liukidar t1_iwcnuo2 wrote
Hello. Thank you for your reply. I will go into the details as well since I think we're creating a good review of PC that may help all different kinds of people that are interested.
I think we should divide the literature into two sets: FPA PC and PC. All the papers we cited (Salvatori, Song, Millidge) belongs indeed to the FPA PC. The aim of those papers was basically to give theoretical proof to show that PC was able to replicate BP in the brain (despite using a lot of assumptions on how this can be done).
However, note that the goal of the papers you have cited is to provide an equivalence or approximation between PC and BP, and not to use PC with FPA as a general-purpose algorithm. In fact, the same authors have then realised several papers that do NOT use FPA, and are applied to different machine learning tasks. I believe that the original idea of creating a general library to run these experiments is more focused towards applications, and not towards reimplementing the experiments that show equivalence and approximations of PC. Something interesting to replicate, still from the same authors, is the following: https://arxiv.org/pdf/2201.13180.pdf. And I am not aware of any library that has implemented something similar in an efficient way.
In relation to the accuracy, I'm not sure about what reported by Kinghorn, but already in Whittington 2017, you can see that they get a 98% accuracy on MNIST with standard PC. So the performance of PC on those it's not to be doubted.
​
I agree there's a lack of evaluations on deeper and more complex architectures. However here you can see an example of what you called IL can do: https://arxiv.org/abs/2211.03481 .
maizeq t1_iwcv5uh wrote
Thanks, and yes I agree, this might be useful to others.
As an aside, I have no qualms against standard generative PC (such as the paper you linked, and any other papers they have realised in that vein, indeed I'm a fan!). However, the discussion in this thread is about the equating of BP with PC, and in this regard, arguing "PC approximates backpropagation" when you really mean "this other heavily modified algorithm that was inspired by PC approximates backprop", is misleading. It is akin to saying an apple looks like an orange, if you throw away the apple and buy another orange.
It feels particularly egregious, when it turns out this modified algorithm is computationally equivalent to backpropagation, and as such the various neuroscientific justifications one applies may no longer hold (e.g. generative modelling is more sample efficient, or cortical hierarchies in the brain are characterised by top-down non-linear effects).
>In relation to the accuracy, I'm not sure about what reported byKinghorn, but already in Whittington 2017, you can see that they get a98% accuracy on MNIST with standard PC. So the performance of PC onthose it's not to be doubted.
Yes, this is the 97% value I referred to in my comment, if you look at the Whittington 2017 paper you will see this refers to an inverted architecture. In this case for a small ANN trained with standard PC without the FPA assumption.
Again, it's important to distinguish between the BP=PC literature, which this thread is related to, and other PC literature. I have no doubt plenty of interesting papers and insights exist in the latter!
Ambitious_Smile_981 t1_iwdrmam wrote
I don't see the problem of differentiating inverted and non-inverted architectures, as they are both generative models. The difference lies in what you are generating. In one case, you generate the label, and give as prior information the image, in the other, you generate the image giving the label as prior information.
Both have their advantages and disadvantages, but I don't see why the 'inverted' one is not interesting.
As of the BP = PC literature, I think that showing that by simply introducing a temporal scheduling for the weight updates of PC, we are able to obtain exact BP is interesting. I agree that this variation of PC loses all the advantages that PC has over BP, but it is still important to know that it is possible to derive exact backprop from a variational free energy.
BerenMillidge t1_iy814ur wrote
Hi, author of some of the papers linked here. Broadly, Maizeq is right to distinguish between FPA-PC and ‘standard PC’ (the ‘inverted vs generative direction of the PC net is a different orthogonal direction). The equivalence between PC and BP only holds exactly in the case with the FPA (or some equivalent set of assumptions — for instance in the original Whittington paper they use the precision ratio tending to 0. Of course all of these limits are in some sense extreme and eliminate some (but not all) of the major advantages of PC (in some sense this was inevitable since if they exactly equal BP then they must very roughly have the same advantages/disadvantages as it). The way to view these works, at least as I have come to view them, is as a idealised exploration of a specific limit of PC. In recent work (https://arxiv.org/pdf/2206.02629), we expand on this limit idea and show that all current EBM approximations to BP, such as PC, Equilibrium-prop and Contrastive Hebbian learning, can be expressed as a single ‘infinitesimal inference limit’.
Overall I disagree that the work in this vein is particularly misleading, although this is a subjective assessment. It is upfront about the assumptions you need to make to obtain equivalence to backprop, as well as how this departs from standard PC.
Of course, from a neuroscientific perspective, this limit is perhaps not the most realistic and so we are also exploring the ML performance of more ‘standard’ PC versions which are more biologically plausible and which don’t approximate backdrop (, as well as specifically understanding the special advantages and disadvantages of these algorithms. For instance, in a recent paper -- https://www.biorxiv.org/content/biorxiv/early/2022/05/18/2022.05.17.492325.full.pdf --, we propose a new understanding of standard PC as ‘prospective configuration’ and demonstrate how this version of PC can outperform backdrop in a number of its properties. We also have a more theoretical analysis of standard PC (https://arxiv.org/pdf/2207.12316) where we show that although it differs from backdrop, it can also converge to minima of a supervised loss function, and has close links to target-propagation and hence Gauss-Newton optimization. Our groups have also explored other potential advantages of PC over BP including the ability for it to learn arbitrary recurrent computation graphs (https://arxiv.org/pdf/2201.13180), the fact that you can significantly speed it up with incremental variants, and that you can get PC to perform a mix of iterative and amortised inference https://arxiv.org/pdf/2204.02169.
In terms of the hardware, I have also looked into this a little, and my feeling is that while PC has better parallelism properties than PC, it is unlikely to outperform BP on a GPU due to the need to iteratively perform the inference phase while BP just has a sequential forward and backward. GPUs are now getting very highly optimised for the exacts style of computations needed in BP for large scale ANNs. PC does possess a much higher degree of parallelism and locality than BP and on a sufficiently distributed architecture may eventually prove better, especially once we start building proper ‘neuromorphic’ processor-in-memory architectures. However this seems likely to be many years away. I haven’t read much about Erlang so I’m not sure if it possesses the degree of necessary parallelism. One possibility is that Erlang with Pc might allow you to move to a different point on the Pareto frontier of having lots of CPUs and developing learning algorithms comparable in performance with doing BP on a single GPU. I haven’t run any fermi-style estimates of whether this is feasible or not. We have some calculations about this in a forthcoming paper but this is on a highly abstract computation model of ‘parallel matrix multiplications’ and I haven’t figured out what the actual equivalent calculations for realistic hardware would look like.
abhitopia OP t1_ixdnki4 wrote
u/maizeq - I have finished reading the Rosenbaum paper . It is certainly very accessible and useful paper to understand the details and nuances between various PC implementations. So thank you for sharing that.
The objective of the author seems to compare various versions of the algorithm and highlight subtle difference and does a great job at it. It does not however exploit the local synaptic plasticity in its implementation (and uses loops) which is exactly where l think lies the limitation of Pytorch, Jax, and Tensorflow.
For instance, one could imagine each node and each weight in an PC (non FPA) MLP network as a standalone process communicating with other nodes and weights process only via message passing to run completely asynchronously. Furthermore, we can limit the amount of commputation by thresholding the value of error nodes (so weight updates for connected weight processes with happen) in a sense enforcing sparsity.
May be I am wrong, I do not (yet) see why in this simple MLP it should be be possible to add new nodes (in a hot fashion), for example, if the activity in any node increases by certain threshold then scale up automatically preserving 2% activity per layer.
Contrast this with GPU based backward passes, a lot of wasteful computation can be prevented. At the very least, Backward doesn't need to weight for FP in the EM like learning algorithm that PC is.
P.S. - My motivation isn't PC==BP, but rather can PC replace BP and is it worth it.
Viewing a single comment thread. View all comments