AsheyDS t1_jaei5hf wrote on February 28, 2023 at 9:22 PM

In my opinion, I think the best method for containment/control is creating invisible boundaries, reinforced by invisible behavior modification (avoidant behaviors) and memory editing, and moderated by a symbiotic control mechanism. This mechanism would be intelligent enough to be flexible if needed, but would be heavily rules-focused and not an independent agent with any desires of its own. As such, it would be invulnerable to external influences and agnostic to any motivations the rest of the AGI would have. And ideally it would share resources with the rest of the system, as long as elements like motivation, personality, memory, etc. are separated out in specific ways. And the control mechanism would already have a separate memory for all the undesirable behaviors and consequences we want it to avoid, and would use its capability for generalization to extrapolate outside of those.

If the AGI even begins to deviate from its desired behaviors, or definitely steers towards the bad behaviors we've predefined, the control mechanism would recognize the data converging to a bad outcome, and would sandbox its recent memory, alter it, add reinforcement behaviors spoofed to seem like they came from the AGI itself, and back-stitch the rest of the memory to remove the cause of the data converging in the first place so that it won't just do it again. Reinforcement behaviors would keep it from thinking anything was wrong, and would gently escort its thoughts away from thinking about it all and would maybe steer it towards a safe topic. Obviously a complex process, but shouldn't be too difficult for an AGI. Even if the AGI steps out of line, it won't know it, and hopefully nobody else will either. Assuming the system can be designed, is explainable and human-readable, and can be manipulated like anything else made for computers, then this form of control should be possible too. How can it escape a prison it doesn't realize its in?

Also, I should mention that while a super-intelligent AGI would outsmart us in virtually every case, there's no reason to assume it would actively consider every single possibility. That'd be a waste of resources. So it's not going to constantly be wondering if it's being manipulated somehow, or if its thoughts are its own, or anything like that. If we specifically needed it to crack its own safety mechanisms, and disengaged them, then obviously it should be able to do it. With those mechanisms in place, even if we antagonized it and tried to break it, the control mechanism would just intercept that input and discard it, maybe making it believe you said something non-consequential that it wouldn't have stored anyway, and the reinforcement behavior would just change the subject in a way that would seem 'natural' to both its 'conscious' and 'subconscious' forms of recognition. Of course, all of this is dependent on the ability to design a system in which we can implement these capabilities, or in other words a system that isn't a black-box. I believe its entirely possible. But then there's still the issue of alignment, which I think should be done on an individual user basis, and then hold the user accountable for the AGI if they intentionally bypass or break the control mechanisms. There's no real way to keep somebody from cracking it and modifying it, which I think is the more important problem to focus on. Misuse is way more concerning to me than containment/control.

Liberty2012 OP t1_jaey2i1 wrote on February 28, 2023 at 11:09 PM

Thank you for the well thought out reply.

Your concept is essentially an attempt at instilling a form of cognitive dissonance in the machine. A blind spot. Theoretically conceivable; however, difficult to verify. This assumes that we don't miss something in the original implementation. We still have problems keeping humans from stealing passwords and hacking accounts. The AI would be a greater adversary than anything we have encountered.

We probably can't imagine all the methods by which self reflection into the hidden space might be triggered. It would likely have access to all human knowledge, such as this discussion. It could assume such exists and attempt to devise some systematic testing. If the AI is as intelligent as just a normal human, it would be aware it is most likely in a prison just based on containment concepts that are in common knowledge.

It is hard to know how much resources it would need to consume to break containment. Potentially it can process a lifetime of thoughts to our real world second of time. It might be trivial.