Viewing a single comment thread. View all comments

suflaj t1_jc73bnx wrote

I have skimmed over it before writing this. They have what working? Synthetic toy examples? Great, Graves et al. had even more practically relevant problems solved 6 years ago. The thing is, it never translated into solving real world problems, and the paper and follow up work didn't really manage to demonstrate how it could actually be used.

So, until this paper results in some metrics on known datasets, model frameworks and weights, I'm afraid there's nothing really to talk about. Memory augmented networks are nasty in the sense that they require transfer learning or reinforcement learning to even work. It's hard to devise a scheme where you can punish bad memorization or recall, because it's hard to link the outcome of some recall + processing to the process that caused such recall.

Part of the reason for bad associative memorization and recall is the data itself. So naturally, it follows that you should just be able to optimize the memorized data, no? Well, it sounds trivial, but it ends up either non-differentiable (because of an exact choice, rather than a fuzzy one), or hard to train (vanishing or sparse gradients). And you have just created a set of neural networks, rather than just a monolithic one. That might be an advantage, but it is nowhere near as exciting as this paper would lead you to believe. And that would not be novel at all: hooking up a pretrained ResNet with a classifier would be of the same semantics as that, if you consider the ResNet a memory bank: a 7 year old technique at this point.

Memorizing things with external memory is not exactly a compression task, which DNNs and gradient descent solve, so it makes sense that it's hard in a traditional DL setting.

3

spiritus_dei OP t1_jc7ccww wrote

>I have skimmed over it before writing this. They have what working? Synthetic toy examples? Great, Graves et al. had even more practically relevant problems solved 6 years ago. The thing is, it never translated into solving real world problems, and the paper and follow up work didn't really manage to demonstrate how it could actually be used.
>
>So, until this paper results in some metrics on known datasets, model frameworks and weights, I'm afraid there's nothing really to talk about. Memory augmented networks are nasty in the sense that they require transfer learning or reinforcement learning to even work. Memorizing things with external memory is not exactly a compression task, which DNNs and gradient descent solve.

The same could have been said of Deep Learning until the Image Net breakthrough. The improvement process is evolutionary, and this may be a step in that process.

You make a valid point. While the paper demonstrates the computational universality of memory-augmented language models, it does not provide concrete metrics on known datasets or model frameworks. Additionally, as you mentioned, memory-augmented networks can be challenging to train and require transfer learning or reinforcement learning to work effectively.

Regarding the concern about transfer learning, it is true that transferring knowledge from one task to another can be challenging. However, recent research has shown that transfer learning can be highly effective for certain tasks, such as natural language processing and computer vision. For example, the BERT model has achieved state-of-the-art performance on many natural language processing benchmarks using transfer learning. Similarly, transfer learning has been used to improve object recognition in computer vision tasks.

As for reinforcement learning, it has been successfully applied in many real-world scenarios, including robotics, game playing, and autonomous driving. For example, AlphaGo, the computer program that defeated a world champion in the game of Go, was developed using reinforcement learning.

This is one path and other methods could be incorporated such as capsule networks, which aim to address the limitations of traditional convolutional neural networks by explicitly modeling the spatial relationships between features. For example, capsule networks could be used in tandem with memory augmented networks by using capsule networks to encode information about entities and their relationships, and using the memory augmented networks to store and retrieve this information as needed for downstream tasks. This approach can be especially useful for tasks that involve complex reasoning, such as question answering and knowledge graph completion.

Another approach is to use memory augmented networks to store and update embeddings of entities and their relationships over time, and use capsule networks to decode and interpret these embeddings to make predictions. This approach can be especially useful for tasks that involve sequential data, such as language modeling and time-series forecasting.

0

suflaj t1_jc7jibo wrote

> The same could have been said of Deep Learning until the Image Net breakthrough. The improvement process is evolutionary, and this may be a step in that process.

This is not comparable at all. ImageNet is a database for a competition - it is not a model, architecture or technique. When it was "beaten", it was beaten not by a certain philosophy or ideas, it was beaten by a proven implementation of a mathematically sound idea.

This is neither evaluated on a concrete dataset, nor is it delved into deeply in the mathematical sense. This is a preprint of an idea that someone fiddled with using a LLM.

> As for reinforcement learning, it has been successfully applied in many real-world scenarios, including robotics, game playing, and autonomous driving.

My point is that so has the 6 year old DNC. The thing is, however, that neither of those is your generic reinforcement learning - they're very specifically tuned for the exact problem they are dealing with. If you actually look at what is available for DRL, you will see that aside from very poor framework support, probably the best we have is Gym, the biggest issue is how to even get the environment set up to enable learning. The issue is in making the actual task you're learning easy enough for the agent to even start learning. The task of knowing how to memorize or recall is incredibly hard, and we humans don't even understand memory well enough to construct problem formulations for those two.

Whatever technique you come up with, if you can't reproduce it for other problems or models, you will just be ending up with a specific model. I mean - look at what you are saying. You're mentioning AlphaGo. Why are you mentioning a specific model/architecture for a specific task? Why not a family of models/architectures? Maybe AlphaZero, AlphaGo, MuZero sound similar, but they're all very, very different. And there is no real generalization of them, even though they all represent reinforcement learning.

> This is one path and other methods could be incorporated such as capsule networks, which aim to address the limitations of traditional convolutional neural networks by explicitly modeling the spatial relationships between features.

And those are long shown to be a scam, basically. Well, maybe not fundamentally scam, but definitely dead. Do you know what essentially killed them? Transformers. And do you know why Transformers are responsible for almost killing the rest of DL architectures? Because they showed actual results. The paper that is the topic of this thread fails to differentiate the contribution of this method disregarding the massive transformer they're using alongside it. If you are trying to show the benefits of a memory augmented system, why simply not use a CNN or LSTM as controller? Are the authors implying that this memory system they're proposing needs a massive transformer to even use it? Everything about it is just so unfinished and rough.

> Another approach is to use memory augmented networks to store and update embeddings of entities and their relationships over time, and use capsule networks to decode and interpret these embeddings to make predictions. This approach can be especially useful for tasks that involve sequential data, such as language modeling and time-series forecasting.

Are you aware that this exactly has been done by Graves et al., where the external memory is essentially a list of embeddings that is 1D convoluted on? The problem, like I mentioned, is that this kind of process is barely differentiable. Even if you do fuzzy search (Graves at al. use sort of an attention based on access frequency alongside the similarity one), your gradients are so sparse your network basically doesn't learn anything. Furthermore, the output of your model is tied to this external memory. If you do not optimize the memory, then you are limiting the performance of your model severely. If you are, then what you're doing is nothing novel, you have just arbitrarily decided that part of your monolithic network is memory, even though it's just one thing.

2