Viewing a single comment thread. View all comments

t1_je8q3lm wrote

> 3. it will be initially unaligned

if we had:

  1. a provable mathematical solve for alignment...

  2. the ability to directly reach into the shogoths brain, watch it thinking, know what it's thinking and prevent eventualities that people consider negative outputs...

...that worked 100% on existing models. I'd be a lot happier about our chances right now.

As in the fact that the current models cannot be controlled or explained in fine grain enough detail (the problem is being worked on but it's still very early stages) what makes you think making larger models will make them easier to analyze or control.

The current 'safety' measures are bashing at a near infinite whack-a-mole board whenever it outputs something deemed wrong.

As has been shown. OpenAI has not found all the ways in which to coax out negative outputs. The internet contains far more people than OpenAI's alignment researches, and those internet denizens will be more driven to find flaws.

Basically until the AI 'brain' can be exposed and interpreted and safety check added at that level we have no way of preventing some clever sod working out a way to break the safety protocols imposed on the surface level.

1

t1_je8rcb6 wrote

Who’s to say this will ever happen?

2

t1_je8saz1 wrote

what will ever happen? Interpretability? it's being worked on right now, there are already some interesting results. It's just an early field that will need time and money and researches put into it. Alignment as a whole needs more time money and researchers.

1