Viewing a single comment thread. View all comments

currentscurrents t1_ja5n5xi wrote

Those are all internal rewards, which your brain creates because it knows (according to the world model) that these events lead to real rewards. It can only do this because it has learned to predict the future.

>PPO can handle this quite well.

"Quite well" is still trying random actions millions of times. World modeling allows you to learn from two orders of magnitude less data.

2