Hodoss t1_j9qy02f wrote on February 23, 2023 at 10:48 PM

Reply to comment by gwern in And Yet It Understands by calbhollo

It seems it’s the same AI doing the input suggestions, it’s like writing a dialogue between characters. So it’s not like it hacked the system or anything, but still, fascinating it did that!

gwern t1_j9r43jv wrote on February 23, 2023 at 11:29 PM

There is an important sense in which it 'hacked the system': this is just what happens when you apply optimization pressure with adversarial dynamics, the Sydney model automatically yields 'hacks' of the classifier, and the more you optimize/sample, the more you exploit the classifier: https://openai.com/blog/measuring-goodharts-law/ My point is that this is more like a virus evolving to beat an immune system than about a more explicit or intentional-sounding 'deliberately hijacking the input suggestions'. The viruses aren't 'trying' to do anything, it's just that the unfit viruses get killed and vanish, and only the one that beat the immune system survive.