super_deap OP t1_jck82rd wrote on March 17, 2023 at 1:10 PM

Reply to comment by Spiritual-Reply5896 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

Nuance is proportion to context.

Imagine we want to ask the language model to improve a certain module in Linux Kernel.

If I understood them correctly, memory-augmented transformers won't be able to fit together all the pieces to understand what needs to be improved and how because they need to make repeated calls to memory and search/summarize those calls to get a basic understanding and thus miss out on important details.

Compare that to huge context, they have everything they need for the memory in their context and there is no loss of details (in case of full attention).

Spiritual-Reply5896 t1_jckq519 wrote on March 17, 2023 at 3:20 PM

Lets say Linux kernel manual is embedded as memories. If we can get accurate semantic representation of the question, then we should be able to find relevant context from the memory, and use enough context to answer the question in fewer tokens compared to providing the whole Linux manual as context. If we assume that computing the attention is as fast as vector search, then its a no-brainer that retrieving only relevant context from memory is better approach than using the whole manual. Its of course a trade off between accuracy and speed/scalability, but I argue its a good tradeoff as text isn't often that information dense.

The ability to produce semantically coherent embeddings from text is the grain and salt of LLM, so why would it be any bigger problem to retrieve these memories from external / infinite database than from context window?

Im just hypothesizing with my limited knowledge, please correct me if I make stupid assumptions :)