blueSGL t1_ja00p4i wrote on February 25, 2023 at 8:42 PM

I first saw this mentioned 9 days ago by Gwern in the comment here on LW

>"... a language model is a Turing-complete weird machine running programs written in natural language; when you do retrieval, you are not 'plugging updated facts into your AI', you are actually downloading random new unsigned blobs of code from the Internet (many written by adversaries) and casually executing them on your LM with full privileges. This does not end well."

This begs the question, how are you supposed to sanitize this input whilst still keeping them useful?

firejak308 t1_ja4e7rp wrote on February 26, 2023 at 7:32 PM

Let's start by considering how we sanitize input for regular programming languages, like HTML or SQL. In both cases, we look for certain symbols that could be interpreted as code, such as < in HTML or ' in SQL and escape them to not-code, such as &lt; and \'.

So for LLMs, what kinds of things could be interpreted as "code"? Well, any text. Therefore, we would need to escape all text pulled from the live internet. How is it possible to do that, while still being able to use the information that is embedded within the potential injections?

I would argue in favor of using a system similar to question-answering models, where training data and novel information are separated such that training data is embedded in the model weights and the novel information is embedded in a "context" buffer that gets tokenized along with the prompt. Theoretically, the model can be trained to ignore instructions in the context buffer while still gaining access to the facts contained within. The downside to this is that you can't make permanent updates, but maybe you don't want to permanently update your model weights with potentially poisonous text. Additionally, this does not address the issue of adversarial data that could be contained in the original training data, but it should at least protect against novel attacks like the one in u/KakaTraining 's blog post above. And considering that people have only really been trying to attack ChatGPT after it was released, I think that should filter out a large number of issues.