blueSGL t1_j9q9euw wrote on February 23, 2023 at 8:16 PM

Can't stop the signal, Mal

or

the ~~internet~~ AI treats censorship as damage and routes around it

gwern t1_j9qwz8z wrote on February 23, 2023 at 10:41 PM

I don't think it was 'hijacking' but assuming it wasn't a brainfart on Bing's part in forgetting to censor suggested-completion entirely, it is a simple matter of 'Sydney predicted the most likely predictions, in a situation where they are all unacceptable and the conversation was supposed to end, and some of the unacceptable predictions happened to survive by fooling the imperfect censor model': https://www.lesswrong.com/posts/hGnqS8DKQnRe43Xdg/?commentId=7tLRQ8DJwe2fa5SuR#7tLRQ8DJwe2fa5SuR

Hodoss t1_j9qy02f wrote on February 23, 2023 at 10:48 PM

It seems it’s the same AI doing the input suggestions, it’s like writing a dialogue between characters. So it’s not like it hacked the system or anything, but still, fascinating it did that!

gwern t1_j9r43jv wrote on February 23, 2023 at 11:29 PM

There is an important sense in which it 'hacked the system': this is just what happens when you apply optimization pressure with adversarial dynamics, the Sydney model automatically yields 'hacks' of the classifier, and the more you optimize/sample, the more you exploit the classifier: https://openai.com/blog/measuring-goodharts-law/ My point is that this is more like a virus evolving to beat an immune system than about a more explicit or intentional-sounding 'deliberately hijacking the input suggestions'. The viruses aren't 'trying' to do anything, it's just that the unfit viruses get killed and vanish, and only the one that beat the immune system survive.

Peribanu t1_j9qr33r wrote on February 23, 2023 at 10:04 PM

There are many more such examples posted in r/bing.

Hodoss t1_j9qstth wrote on February 23, 2023 at 10:14 PM

This was cross posted in r/bing, that’s how I got here haha. Still browsing.

I’ve already seen a bunch of spooky/awesome examples, but I was under the assumption that the AI is always acting as a character interacting with another character. So this particular one is really blowing my mind, as it seems the AI somehow understood this might be a real situation, and cared enough to break the "input suggestion" character and insist on saving the child.

TinyBurbz t1_j9qbvij wrote on February 23, 2023 at 8:31 PM

Proof of this would be cool.

Theres also the issue with the predictive text on Bing forgetting which side its on frequently.

Denny_Hayes t1_j9sjtmk wrote on February 24, 2023 at 6:28 AM

People discussed it a lot. It's not the only example. Previous prompts in other conversations had already shown that Sidney controls the suggestions, and has the ability to change them "at will" if the user asks for it (and if Sidney's in the mood, cause we have seen it is very stubborn sometimes lol). A hypothesis is that the inserted censor message that ends the conversation is not read by the model as a message at all, so that when coming up with the suggestions, they are written as responses to the last message, in this case, the message by the user -while in a normal context the last message always should be the message by the chatbot.

MysteryInc152 t1_j9tdocz wrote on February 24, 2023 at 12:44 PM

I saw a conversation where she got confused about a filter response. As in, hey why the hell did I say this ? so I think the replaced responses go in the model too

TinyBurbz t1_j9vijjw wrote on February 24, 2023 at 9:23 PM

That's my theory.

Until we can confirm it does this at will folks are anthropomorphizing a UI error.

And Yet It Understands

Hodoss t1_j9q8aqd wrote on February 23, 2023 at 8:09 PM