hapliniste OP t1_j50pe93 wrote on January 19, 2023 at 4:20 PM

Also, I think this could help improve the actual "logic" of the model by focusing the small LM on that task while the search part would serve the role of knowledge base.

Another benefit could be the ability to cite its sources.

It really seems like a no brainer to me.

currentscurrents t1_j525hto wrote on January 19, 2023 at 9:33 PM

Retrieval language models do have some downsides. Keeping a copy of the training data around is suboptimal for a couple reasons:

Training data is huge. Retro's retrieval database is 1.75 trillion tokens. This isn't a very efficient way of storing knowledge, since a lot of the text is irrelevant or redundant.
Training data is still a mix of knowledge and language. You haven't achieved separation of the two types of information, so it doesn't help you perform logic on ideas and concepts.
Most training data is copyrighted. It's currently legal to train a model on copyrighted data, but distributing a copy of the training data with the model puts you on much less firm ground.

Ideally I think you want to condense the knowledge from the training data down into a structured representation, perhaps a knowledge graph. Knowledge graphs are easy to perform logic on and can be human-editable. There's also already an entire sub-field studying them.

BadassGhost t1_j55rxme wrote on January 20, 2023 at 4:25 PM

I think the biggest reason to use retrieval is to solve the two biggest problems:

Hallucination
long-term memory.

Make the retrieval database MUCH smaller than Retro, and constrain it to respectable sources (textbooks, nonfiction books, scientific papers, and Wikipedia. You could either not do textbooks/books, or you could make deals with publishers. Then add to the dataset (or have a second dataset) everything it sees in a certain context in production. For example, add all user chat history to the dataset for ChatGPT.

Could use cross-attention in RETRO (maybe with some RLHF like ChatGPT), or just software engineer some prompt manipulation based on embedding similarities.

You could imagine ChatGPT variants that have specialized knowledge that you can pay for. Maybe an Accounting ChatGPT has accounting textbooks and documents in its retrieval dataset, and accounting companies pay a premium for it.