Viewing a single comment thread. View all comments

FeepingCreature t1_jeesov2 wrote

Sure, and I agree with the idea that deceptions have continuously increasing overhead costs to maintain, but the nice thing about killing everyone is that it clears the gameboard. Sustaining a lie is in fact very easy if shortly - or even not so shortly - afterwards, you kill everyone who heard it. You don't have to not get caught in your lie, you just have to not get caught before you win.

In any case, I was thinking more about deceptive alignment, where you actually do the thing the human wants (for now), but not for the reason the human assumes. With how RL works, once such a strategy exists, it will be selected for, especially if the human reinforces something other than what you would "naturally" do.

1

Sure_Cicada_4459 OP t1_jeexhxg wrote

It will reason from your instructions, the higher intelligence means the higher the fidelity to it's intent, that's why killing everyone wouldn't advance it's goal as it is a completely alien class of mind divorced from evolution whose drive is directly set by us. There is no winning, it's not playing the game of evolution like every lifeform you have ever met hence why it so hard to reason about this without projection.

Think about this way, in the scenario mentioned above when naively implemented it's most deceptive, most misaligned yet still goal achieving course of action is to deceive all your senses and put you in a simulation where it's more trivial in terms of ressource expenditure to satisfy your goals. But that would be as simple as adding that clause to your query, not saying it can't go wrong. I am saying it there are a set of statements that when interpreted with sufficient capabilities will eliminate these scenarios trivially.

3

FeepingCreature t1_jef1wb3 wrote

Also: we have at present no way to train a system to reason from instructions.

GPT does it because its training set contained lots of humans following instructions from other humans in text form, and then RLHF semi-reliably amplified these parts. But it's not "trying" to follow instructions, it's completing the pattern. If there's an interiority there, it doesn't necessarily have anything to do with how instruction-following looks in humans, and we can't assume the same tendencies. (Not that human instruction-following is even in any way safe.)

> But that would be as simple as adding that clause to your query

And also every single other thing that it can possibly do to reach its goal, and on the first try.

1

Sure_Cicada_4459 OP t1_jef5qx9 wrote

It's the difference between understanding and "simulating understanding", you can always refer to lower level processes and dismiss the abstract notion of "understanding", "following instructions",... It's a shorthand, but a sufficiently close simulacra would be indistinguishable from the "real" thing, because not understanding and simulating understanding to an insufficient degree will look the same when it fails. If I am just completing patterns I learned that simulate following instructions to such a high degree that there is no failure happening to distinguish it from "actually following instructions", then the lower level patterns ceases to be relevant to the description of the behaviour and therefore to the forecasting of the behaviour. It's just adding more complexity with the same outcome, that is it will reason from our instructions hence my above arguments.

To your last point, yes you'd have to find a set of statements that exhaustively filters out undesirable outcomes, but the only thing you have to get right on the first try is "don't kill, incapacitate, brain wash everyone." + "Be transparent about your actions and their reasons starting the logic chain from our query.". If you just ensure that, which by my previous argument is trivial you essentially have to debug it continiously as there will inevitably be undesirable consequences or futures ahead but that least remain steerable. Even if we end up in a simulation, it is still steerable as long as the aforementioned is ensured. We just "debug" from there but with the certainty that the action is reversable, and with more edge cases to add to our clauses. Like building any software really.

3

FeepingCreature t1_jef872m wrote

The problem with "simulating understanding" is what happens when you leave the verified-safe domain. You have no way to confirm you're actually getting a sufficiently close simulacrum, especially if the simulation dynamically tracks your target. The simulation may even be better at it than the real thing, because you're also imperfectly aware of your own meaning, but you're rating it partially on your understanding of yourself.

> To your last point, yes you'd have to find a set of statements that exhaustively filters out undesirable outcomes, but the only thing you have to get right on the first try is "don't kill, incapacitate, brain wash everyone." + "Be transparent about your actions and their reasons starting the logic chain from our query."

Seems to me if you can rely on it to interpret your words correctly, you can just say "Be good, not bad" and skip all this. "Brainwash" and "transparent" aren't fundamentally less difficult to semantically interpret than "good".

2

Sure_Cicada_4459 OP t1_jefepok wrote

With a sufficiently good world model, it will be aware of my level of precision of understanding given the context, it will be arbitrarily good at infering intent, it might actually warn me because it is context aware enough to say that this action will yield net negative outcome if I were to assess the future state. That might be even the most likely scenario if it's forecasting ability and intent reading is vastly superior, so we don't even have to live through the negative outcome to debug future states. You can't really have such a vastly superior world model without also using the limitations of the understanding of the query by the user as a basis for your action calculation. In the end, there is a part that is unverifiable as I mentioned above but it is not relevant to forecasting behaviour kind of like how you can't confirm that anyone but yourself is conscious (and the implications of yes or no are irrelevant to human behaviour).

And that is usually the limit I hit with AI safety people, you can build arbitrary deceiving abstractions on a sub level that have no predictive influence on the upper one and are unfalsifiable until they again arbitrarily hit a failure mode in the undeterminable future. You can append to general relativity a term that would make the universe collapse into blackhole in exactly 1 trillion years, no way to confirm it either but that's not how we do science yet technically you can't validate that this is not in fact how the universe happens to work. There is an irreducible risk to this whose level of attention is likely directly correlated to how neurotic one is. And since the stakes are infinite and the risk is non-zero, you do the math, that's enough fuel to build a lifetime of fantasies and justify any actions really. I believe the least talked about topic is that the criteria of trust are just as much dependent on the observer as the observed.

By the way yeah, I think so but we will likely be ultra precise on the first tries because of the stakes.

2

FeepingCreature t1_jefl3ya wrote

> By the way yeah, I think so but we will likely be ultra precise on the first tries because of the stakes.

Have you met people. The internet was trying to hook GPT-4 up to unprotected shells within a day of release.

> it might actually warn me because it is context aware enough to say that this action will yield net negative outcome if I were to assess the future state

Sure if I have successfully trained it to want to optimize for my sense of negative rather than its proxy for my proxy for my sense of negative. Also if my sense of negative matches my actual dispreference. Keep in mind that failure can look very similar to success at first.

> You can append to general relativity a term that would make the universe collapse into blackhole in exactly 1 trillion years, no way to confirm it either

Right, which is why we need to understand what the models are actually doing, not just train-and-hope.

We're not saying it's unknowable, we're saying what we're currently doing is in no way sufficient to know.

1

Sure_Cicada_4459 OP t1_jefuu0z wrote

-Yes, but GPT-4 wasn't public till they did extensive red teaming. They looked at all the worst cases before letting it out, not that GPT-4 can't cause any damage by itself just not the kind ppl are freaked about.

-That is a given with the aforementioned arguments, ASI assumes superhuman ability on any task and metric. I really think if GPT-5 is showing this same trend that alignment ease scales with intelligence, people should seriously update their p(doom).

-My argument boils down that the standard of sufficiency can only be satisfied to the degree that one can't observe failure modes anymore, you can't arbitrarily satisfy it just like you can't observe anything smaller then Planck length. There is a finite resolution to this problem, whether it is limited by human cognition or infinite possible imagine substructures. We obvious need more interpretability research, and there are some recent trends like Reflexion, ILF and so on that will over the long term yield more insight into the behaviour of systems as you can work with "thoughts" in text form instead of inscrutable matrices. There will be likely some form of cognitive structures inspired by the human brain which will look more like our intuitive symbolic computations and allow us to measure these failure modes better. Misalignments on the lower level could still be possible ofc, but that doesn't say anything about the system on the whole, it could be load bearing in some way for example. That's why I think the only way one can approach this is empirical, and AI is largely an empirical science let's be real.

2