Viewing a single comment thread. View all comments

Spiritual-Reply5896 t1_jcjz3fn wrote

Why is everyone talking about the context length, and not about some kind of memory retrieval? Is the assumption that by increasing context length we can eventually scale it to infite, thus replacing any kind of external memory?

48

-Rizhiy- t1_jck6j55 wrote

Increasing the context window is a simple albeit costly method of increasing amount of addressable information. Working with external memory is not as straightforward.

57

felheartx t1_jcli6si wrote

You said working with external memory is not as straightforward. Can you explain that?

I've read this: https://arxiv.org/abs/2301.04589# and even though I'm not super familiar with the details, to my untrained eye it seems like attaching external memory is easier than extending the context size.

Just from reading posts on this subreddit I get the feeling that getting larger and larger context sizes is very difficult. Whereas simply attaching this sort of "dictionary" thing is pretty easy to do.

5

lmericle t1_jcln487 wrote

You will find that in hype circles such as NLP there's a lot of thought-terminating cliches passed around by people who are not so deep in the weeds. Someone says something with confidence, another person doesn't know how to vet it and so just blindly passes it on, and all of a sudden a hack becomes a rumor becomes dogma. It seems to me to be this way with context vs memory.

Put another way: it's the kind of attitude that says "No, Mr. Ford, what we wanted was faster horses".

7

KerfuffleV2 t1_jclo0oh wrote

I'm not an ML person, but it seems like that paper is just teaching the LLM to simulate a Turing machine. Actually making it respond normally while doing practical stuff like answering user queries would be a different thing.

Also, suppose the LLM has access to external memory. First, you have to teach it how to interact with that external memory (via special command sequences in its tokens, most likely). Then you have to teach it/take steps to make it appropriately note which things are important or not and store/retrieve them as necessary. All of this requires tokens for input/output so it will increase processing time even when used perfectly, these tokens will also consume the existing context window.

One really big thing with LLMs now is it seems like they don't (and maybe can't) know what they know/don't know. They just predict tokens, they can't really do introspection. Of course, they can be trained to respond that they don't know certain things, but getting the LLM to decide it needs to use the external memory doesn't seem like the simplest thing.

I mean, take humans as an example: Are you effective at taking notes, organizing them in a way that lets you easily recall them in the future, etc? It's not even an easy skill for humans to develop, and we're relatively good at knowing what we don't know.

Another thing is the paper you linked to says it set the temperature to 0, to make the responses very deterministic. Generally this makes them a lot less creative as well. If you turn up temperature, you potentially increase the chances that the LLM generates malformed queries for the external memory or stuff like that.

Anyway, I don't know much about the technical side of increasing the context window but when the context window is bigger the thing can just use it as far as I know. Taking advantage of some sort of external memory system seems like it's a very, very complicated thing to solve reliably.

Again, note this is coming from someone that doesn't really know much about ML, LLMs, etc. I'm just a normal developer, so take all this with a grain of salt.

7

Art10001 t1_jcnakzz wrote

There is a github project that uses embeddings with GPT-3.5 to create infinite memory, as long as you have infinite disk space. The database grows and grows the more you talk.

1

KerfuffleV2 t1_jcncad2 wrote

You'd have to link me what you're talking about for me to say anything. I doubt it works as straightforwardly as "infinite memory" though.

2

Art10001 t1_jcnv4kf wrote

https://github.com/LagPixelLOL/ChatGPTCLIBot

There are other similar projects I found while trying to recover this one, which may also be of interest. You can find them by searching "chatgpt embeddings memory github"

1

KerfuffleV2 t1_jcp7qcz wrote

I'm not sure I fully understand it, but it seems like it's just basically adding context to the prompt it submits with requests. For obvious reasons, the prompt can only get so big. It also requires making requests to OpenAI's embedding API which isn't free: so it's both pushing in more tokens and making those extra requests.

I can definitely see how that approach could produce better results, but it's also not really unlimited memory. Note: I skimmed the source, but I'm not really a C++ person and I didn't actually set it up to use my OpenAI account via API.

2

127-0-0-1_1 t1_jcqd8se wrote

It's not unlimited memory in a single run, which remains unchanged, but that doesn't seem super relevant to what people want (nothing wrong with multiple runs!). Think about a turing machine, or heck, yourself. A turing machine only has access to a single cell of memory at at time, and in practice, modern CPUs only have access to their registers directly. For long term storage, that goes into RAM, which is accessed on demand.

Similarly, your own memory is not large enough to contain all the information you'd need to complete most complex tasks. That's why you have to write things down and actively try to remember things.

While that uses OpenAI's embedding networks, like the autoregressive LLM itself, it's not like OpenAI has a monopoly on text embeddings by any means (far from it - embeddings have a very straightforward business use and are used in practically any major site you know for things like similarity queries).

While I think OP is overhyping the degree to which this is "infinite memory" yet, in a hypothetical turing machine formulation where the network can more proactively store and restore memory, it would allow for it to be, at least, turing complete.

1

Spiritual-Reply5896 t1_jcsq4d9 wrote

Exactly, I wanted to find out whether there is some research regarding these embeddings. I really think that by efficient pruning/organization of these "memories" its possible to generate quite advanced memory. Things like embedding consistency then becomes a big player - how much does length affect the embedding, what is the optimal information content vs string size...

2

hfnuser0000 t1_jcnspad wrote

Hi there! It sounds really interesting! Could you please share the name of the project or provide a link to it? I would love to check it out. Thank you!

1

CleanThroughMyJorts t1_jck7114 wrote

I don't think the two are mutually exclusive.

The problem with retrieval though (at least current implementations) is the model can't attend to memory globally the way it does with context memory; you're bottlenecked by the retrieval process having to bring things into context through a local search.

31

super_deap OP t1_jck82rd wrote

Nuance is proportion to context.

Imagine we want to ask the language model to improve a certain module in Linux Kernel.

If I understood them correctly, memory-augmented transformers won't be able to fit together all the pieces to understand what needs to be improved and how because they need to make repeated calls to memory and search/summarize those calls to get a basic understanding and thus miss out on important details.

Compare that to huge context, they have everything they need for the memory in their context and there is no loss of details (in case of full attention).

17

Spiritual-Reply5896 t1_jckq519 wrote

Lets say Linux kernel manual is embedded as memories. If we can get accurate semantic representation of the question, then we should be able to find relevant context from the memory, and use enough context to answer the question in fewer tokens compared to providing the whole Linux manual as context. If we assume that computing the attention is as fast as vector search, then its a no-brainer that retrieving only relevant context from memory is better approach than using the whole manual. Its of course a trade off between accuracy and speed/scalability, but I argue its a good tradeoff as text isn't often that information dense.

The ability to produce semantically coherent embeddings from text is the grain and salt of LLM, so why would it be any bigger problem to retrieve these memories from external / infinite database than from context window?

Im just hypothesizing with my limited knowledge, please correct me if I make stupid assumptions :)

2

Screye t1_jcl549n wrote

Context length is also a hard limit on how many logical-hops the model can make.

If each back-n-forth takes 500-ish tokens, then the model can only reason over 16 hops over 8k tokens. With 32k tokens, it can reason over 64 hops. This might allow for emergent behaviors towards tasks that have previously been deemed impossible due to needing at least a minimum number of hops to reason about.

For what it's worth, I think memory retrieval will work just fine for 90% of scenarios and will stay relevant even for 32k tokens. Esp. if the wiki you are retrieving from is millions of lines.

3

VarietyElderberry t1_jcm4ghk wrote

Could you explain what you mean with a logical-hop and how it is dependent on a certain number of tokens? If you are referring to a paper, a link would be appreciated.

1

Screye t1_jcmpd5i wrote

This is more derived from extensive personal experience with prompt engineering / fine tuning over the last 2 years.

Simply put:

  • The model learns what it sees. Or, throw enough data of a certain type and emergent properties relating to that data will shop given enough data & compute.
  • If it has never seen data past 8k tokens in the past (due to context window limitations), the model won't need to learn to reason over more than 8k tokens.
  • The source data (humans) have limitations on the complexity of thoughts that can be captured within 8k tokens vs 32k tokens
  • That's not say that the model doesn't reason over longer windows using latent knowledge, which makes its implicit 'reasoning window' much larger than just 8k tokens. But, that is fundamentally different than explicitly reasoning over a 32k window.
  • The model today can only assemble a chain-of-thought prompt of 8k tokens. If there is never any human feedback or loss-landscape-optimization for when it fails to reason past 8k tokens, then any ability the model gains there will be purely incidental.
  • On the other hand, when you have chain-of-thought prompt chains that are 32k tokens long, we can naturally expect it to contain more axioms, postulates and relationships between those postulates/axioms.
  • Those completions will get evaluated against human feedback & just self-supervised-scenarios, which should explicitly optimize the loss landscape to reason over far more complex logical statements.

Idk if that makes sense. Our field keeps moving away from math, and as embarrassing as it is to antromorphize the model, it does make it easier to get the point across.

2

HateRedditCantQuitit t1_jcmdot7 wrote

I think of context as a end-to-end connected version of retrieval. You can backprop from loss to retrieved info, but you also want to backprop from loss to the non-retrieved info, which would basically be equivalent to having it all in context (in a handwavy way). Which is to say that just having more context is a simple solution.

I think everyone knows increasing context length is not 100% sufficient, but it sure is a simple convenient solution.

3

hfnuser0000 t1_jcl5akd wrote

How many ways are there to implement memory retrieval? Can someone explain the intuition behind it? Thank you in advance!

1