Viewing a single comment thread. View all comments

t1_j8tud6r wrote

I don’t think this is true. You’re describing a Markov chain I believe, and this is more sophisticated than that. It is much more capable than that. Though you’re right it cannot discern between true and false.

10

t1_j8v4i1n wrote

It is in fact not true.

3

t1_j8vamcu wrote

2

t1_j8vnlyo wrote

I think you must be getting confused because of the "reward predictor". The reward predictor is a separate model which is used in training to reduce the amount of human effort needed to train the main model. Think of it as an amplifier for human feedback. Prediction is not what the model being trained does.

1

t1_j8xd81i wrote

Yes, I see the meanings as different, because I was thinking the context of the question would bias the result.

1