buzzbuzzimafuzz

buzzbuzzimafuzz t1_j8zafoo wrote

The mess that has been Bing Chat/Sydney, but instead of just verbally threatening users, it's connected with APIs that let it take arbitrary actions on the internet to carry out them out.

I really don't want to see what happens if you connect a deranged language model like Sydney with a competent version of Adept AI's action transformer to let it use a web browser.

5

buzzbuzzimafuzz t1_j7lz92s wrote

A quote from the Verge liveblog:

>This is an important part of the presentation, but I just want to note that Microsoft is having to carefully explain how its new search engine will be prevented from helping to plan school shootings.
>
>"Early red teaming showed that the model could help plan attacks" on things like schools. "We don't want to aid in illegal activity." So the model is used to act as a bad actor to test the model itself.

The safety system proposed sounds interesting but given how simple prompt engineering attacks still work on ChatGPT, I'm not feeling optimistic about how well this will work out in the real world.

23

buzzbuzzimafuzz t1_j4u5jrz wrote

I think what OpenAI and Anthropic typically do is providing evaluators with two possible responses and having them select which one is better. If you have numerical ratings, it might be hard to calibrate them. From the original paper "Deep reinforcement learning from human feedback" (2017):

>We ask the human to compare short video clips of the agent’s
behavior, rather than to supply an absolute numerical score. We found comparisons to be easier for humans to provide in some domains, while being equally useful for learning human preferences. Comparing short video clips is nearly as fast as comparing individual states, but we show that the resulting comparisons are significantly more helpful

ChatGPT seems to be trained from a combination of expert-written examples and upvotes and downvotes on individual messages.

9