Submitted by verbigratia t3_zsvsic in MachineLearning

Every lunar lander tutorial or example I've found so far uses deep RL. Is classical Q learning such an obviously bad idea that no-one bothers with it? I've had some success recently applying Q learning to lunar lander (converting the continuous observations into discrete values) and am surprised there aren't more tutorials about this approach. Am I missing something?

12

Comments

You must log in or register to comment.

fnbr t1_j1bc8i3 wrote

The main problem with tabular Q-learning (I'm assuming that by classical, you mean tabular) is that for most environments that are interesting, the state space is massive, so we can't actually store all states in memory.

In particular for lunar lander, you have a continuous observation space, so you need to apply some sort of discretization; at that point, you might as well just use tile coding or some sort of other function approximator.

10

blimpyway t1_j1aqaza wrote

I don't know, what were your results?

5

abhisheknaik96 t1_j1blgcx wrote

I am sure the classic (linear) Q-learning algorithm can solve Lunar Lander when the state space is discretized using tile coding.

Are you saying you solved it using tabular Q-learning, that is, by learning about each discretized state independently? I would be curious to know what kind of discretization you used and how many training steps were required.

3

leocus4 t1_j1cigff wrote

In a paper, l used a decision tree with Q learning to solve LunarLander. While it's not exactly what you asked, you can see a DT as a way to discretize the Q table, so basically that decision tree corresponds to a Q table with 5 discretized states.

If you're interested I can expand more this explanation, just let me know!

2

deepestdescent t1_j1b720n wrote

Out of interest, how many discrete states do you end up with? Surely this blows out as you explore the state space?

1

verbigratia OP t1_j1cxswc wrote

Thanks all.

I've only just started experimenting but my approach so far has been to discretise the observation into Q table sizes varying between [8, 8, 8, 8, 8, 8, 2, 2, action_space.n] and [20, 20, 20, 20, 20, 20, 2, 2, action_space.n]. And 5k-10k episodes, learning rate of 0.1 and discount of 0.99.

The results will not win any SpaceX contracts just yet but they do result in soft-ish landings between the flags more often than not.

I found hovering to be a problem so added some handling to exit the episode after around 500 steps.

At this point, I normally start looking at what others have done, and was surprised not to see more examples demonstrating tabular Q learning in this scenario (despite the issues with the continuous observation space).

Will look at deep RL next but found it interesting to try the tabular approach first.

Edit: grammar

1