Viewing a single comment thread. View all comments

AmalgamDragon t1_j9zyyib wrote

> Rewards are sparse in the real world

This doesn't seem true. The only reason we aren't getting negative rewards (e.g. pain, discomfort, etc.) constantly is that we learn to generally avoid them.

2

currentscurrents t1_ja5isuz wrote

Imagine you need to cook some food. None of the steps of cooking give you any reward, you only get the reward at the end.

Pure RL will quickly teach you not to touch the burner, but it really struggles with tasks that involve planning or delayed rewards. Self-supervised learning helps with this by building a world model that you can use to predict future rewards.

1

AmalgamDragon t1_ja5lz5b wrote

This really comes down to how 'reward' is defined. I think we likely disagree on that definition, with yours being a lot narrower then mine is. For example, during the cooking process, there is usually a point before the meal is done where it 'smells good', which is a reward. There's dopamine release as well, which could be triggered when completing some of the steps (don't know if that's the case or not), but simply observing that a step is complete is rewarding for lots of folks.

> Pure RL will quickly teach you not to touch the burner, but it really struggles with tasks that involve planning or delayed rewards.

Depends on which algorithms you're using, but PPO can handle this quite well.

1