Viewing a single comment thread. View all comments

Jadien t1_jeastxv wrote

I've only skimmed the link (and its sub-links), but the basic idea is this:

If you've trained a model to predict the next move in an Othello game, given the board state as an input, you can not necessarily conclude that the model also has the ability to perform similar tasks, like "Determine whether a given move is legal" or "Determine what the board state will be after executing a move". Those abilities might help a model predict the next move but are not required.

However:

> Context: A recent paper trained a model to play legal moves in Othello by predicting the next move, and found that it had spontaneously learned to compute the full board state - an emergent world representation.

In the process of optimizing the model's ability to predict moves, the model did also develop the ability to compute the next board state, given the initial state previous moves and predicted move (Thank you /u/ditchfieldcaleb).

The author's contribution:

> I find that actually, there's a linear representation of the board state! > This is evidence for the linear representation hypothesis: that models, in general, compute features and represent them linearly, as directions in space! (If they don't, mechanistic interpretability would be way harder)

Which is to say that the model's internal prediction of the next board state is fairly interpretable by humans: There's some square-ish set of activations in the model that correspond to the square-ish Othello board. That's another property of the model that is a reasonable outcome but isn't a foregone conclusion.

50

dancingnightly t1_jebf9zn wrote

incredibly interesting given humans represent some quantities this way too (spanning from left-to-right in the brain for numbers)

17

andreichiffa t1_jec26vk wrote

Which is basically the self-attention mechanism + universal approximators nature of NNs. So I am not sure what that proves or what is new about it.

5