Frozen Lake with Q-Learning!

q-learning.png

In the last few weeks, we've written two simple games in Haskell: Frozen Lake and Blackjack. These games are both toy examples from the Open AI Gym. Now that we've written the games, it's time to explore more advanced ways to write agents for them.

In this article, we'll explore the concept of Q-Learning. We've talked about this idea on the MMH blog before. But now we'll see it in action in a simpler context than we did before. We'll write a little bit of Python code, following some examples for Frozen Lake. Then we'll try to implement the same ideas in Haskell. Along the way, we'll see more patterns emerge about our games' interfaces.

We won't be using Tensorflow in the article. But we'll soon explore ways to augment our agent's capabilities with this library! To learn about Haskell and Tensorflow, download our TensorFlow guide!

Making a Q-Table

Let's start by taking a look at this basic Python implementation of Q-Learning for Frozen Lake. This will show us the basic ideas of Q-Learning. We start out by defining a few global parameters, as well as Q, a variable that will hold a table of values.

epsilon = 0.9
min_epsilon = 0.01
decay_rate = 0.9
Total_episodes = 10000
max_steps = 100
learning_rate = 0.81
gamma = 0.96

env = gym.make('FrozenLake-v0')
Q = numpy.zeros((env.observation_space.n, env.action_space.n))

Recall that our environment has an action space and an observation space. For this basic version of the Frozen Lake game, an observation is a discrete integer value from 0 to 15. This represents the location our character is on. Then the action space is an integer from 0 to 3, for each of the four directions we can move. So our "Q-table" will be an array with 16 rows and 4 columns.

How does this help us choose our move? Well, each cell in this table has a score. This score tells us how good a particular move is for a particular observation state. So we could define a choose_action function in a simple way like so:

def choose_action(observation):
  return numpy.argmax(Q[observation, :])

This will look at the different values in the row for this observation, and choose the highest index. So if the "0" value in this row is the highest, we'll return 0, indicating we should move left. If the second value is highest, we'll return 1, indicating a move down.

But we don't want to choose our moves deterministically! Our Q-Table starts out in the "untrained" state. And we need to actually find the goal at least once to start back-propagating rewards into our maze. This means we need to build some kind of exploration into our system. So each turn, we can make a random move with probability epsilon.

def choose_action(observation):
  action = 0
  if np.random.uniform(0, 1) < epsilon:
    action = env.action_space.sample()
  else:
    action = numpy.argmax(Q[observation, :])
  return action

As we learn more, we'll diminish the exploration probability. We'll see this below!

Updating the Table

Now, we also want to be able to update our table. To do this, we'll write a function that follows the Q-learning rule. It will take two observations, the reward for the second observation, and the action we took to get there.

def learn(observation, observation2, reward, action):
  prediction = Q[observation, action]
  target = reward + gamma * numpy.max(Q[observation2, :])
  Q[observation, action] = Q[observation, action] +
                              learning_rate * (target - prediction)

For more details on what happens here, read our Q-Learning primer. But there's one general rule.

Suppose we move from Observation O1 to Observation O2 with action A. We want the Q-table value for the pair (O1, A) to be closer to the best value we can get from O2. And we want to factor in the potential reward we can get by moving to O2. Thus our goal square should have the reward of 1. And squares near it should have values close to this reward!

Playing the Game

Playing the game now is straightforward, following the examples we've done before. We'll have a certain number of episodes. Within each episode, we make our move, and use the reward to "learn" for our Q-table.

for episode in range(total_episodes):
  obs = env.reset()
  t = 0
  if episode % 100 == 99:
    epsilon *= decay_rate
    epsilon = max(epsilon, min_epsilon)

  while t < max_steps:
    action = choose_action(obs)
    obs2, reward, done, info = env.step(action)
    learn(obs, obs2, reward, action)
    obs = obs2
    t += 1

    if done:
      if reward > 0.0:
        print("Win")
      else:
        print("Lose")
      break

Notice also how we drop the exploration rate epsilon every 100 episodes or so. We can run this, and we'll observe that we lose a lot at first. But by the end we're winning more often than not! At the end of the series, it's a good idea to save the Q-table in some sensible way.

Haskell: Adding a Q-Table

To translate this into Haskell, we first need to account for our new pieces of state. Let's extend our environment type to include two more fields. One will be for our Q-table. We'll use an array for this as well, as this gives convenient accessing and updating syntax. The other will be the current exploration rate:

data FrozenLakeEnvironment = FrozenLakeEnvironment
  { ...
  , qTable :: A.Array (Word, Word) Double
  , explorationRate :: Double
  }

Now we'll want to write two primary functions. First, we'll want to choose our action using the Q-Table. Second, we want to be able to update the Q-Table so we can "learn" a good path.

Both of these will use this helper function. It takes an Observation and the current Q-Table and produces the best score we can get from that location. It also provides us the action index. Note the use of a tuple section to produce indices.

maxScore ::
  Observation ->
  A.Array (Word, Word) Double ->
  (Double, (Word, Word))
maxScore obs table = maximum valuesAndIndices
  where
    indices = (obs, ) <$> [0..3]
    valuesAndIndices = (\i -> (table A.! i, i)) <$> indices

Using the Q-Table

Now let's see how we produce our actions using this table. As with most of our state functions, we'll start by retrieving the environment. Then we'll get our first roll to see if this is an exploration turn or not.

chooseActionQTable ::
  (MonadState FrozenLakeEnvironment m) => m Action
chooseActionQTable = do
  fle <- get
  let (exploreRoll, gen') = randomR (0.0, 1.0) (randomGenerator fle)
  if exploreRoll < explorationRate fle
    ...

If we're exploring, we do another random roll to pick an action and replace the generator. Otherwise we'll get the best scoring move and derive the Action from the returned index. In both cases, we use toEnum to turn the number into a proper Action.

chooseActionQTable ::
  (MonadState FrozenLakeEnvironment m) => m Action
chooseActionQTable = do
  fle <- get
  let (exploreRoll, gen') = randomR (0.0, 1.0) (randomGenerator fle)
  if exploreRoll < explorationRate fle
    then do
      let (actionRoll, gen'') = Rand.randomR (0, 3) gen'
      put $ fle { randomGenerator = gen'' }
      return (toEnum actionRoll)
    else do
      let maxIndex = snd $ snd $
                       maxScore (currentObservation fle) (qTable fle)
      put $ fle {randomGenerator = gen' }
      return (toEnum (fromIntegral maxIndex))

The last big step is to write our learning function. Remember this takes two observations, a reward, and an action. We start by getting our predicted value for the original observation. That is, what score did we expect when we made this move?

learnQTable :: (MonadState FrozenLakeEnvironment m) =>
  Observation -> Observation -> Double -> Action -> m ()
learnQTable obs1 obs2 reward action = do
  fle <- get
  let q = qTable fle
      actionIndex = fromIntegral . fromEnum $ action
      prediction = q A.! (obs1, actionIndex)
  ...

Now we specify our target. This combines the reward (if any) and the greatest score we can get from our new observed state. We use these values to get a newValue, which we put into the Q-Table at the original index. Then we put the new table into our state.

learnQTable :: (MonadState FrozenLakeEnvironment m) =>
  Observation -> Observation -> Double -> Action -> m ()
learnQTable obs1 obs2 reward action = do
  fle <- get
  let q = qTable fle
      actionIndex = fromIntegral . fromEnum $ action
      prediction = q A.! (obs1, actionIndex)
      target = reward + gamma * (fst $ maxScore obs2 q)
      newValue = prediction + learningRate * (target - prediction)
      newQ = q A.// [((obs1, actionIndex), newValue)]
  put $ fle { qTable = newQ }
  where
    gamma = 0.96
    learningRate = 0.81

And just like that, we're pretty much done! We can slide these new functions right into our existing functions!

Conclusion

The rest of the code is straightforward enough. We make a couple tweaks as necessary to our gameLoop so that it actually calls our training function. Then we just update the exploration rate at appropriate intervals. Take a look at our code our Github for more details! This week's code is in FrozenLake2.hs.

We've now got an agent that can play Frozen Lake coherently using Q-Learning! Next time, we'll try to adopt this agent for Blackjack as well. We'll see the similarities between the two games. Then we'll start formulating some ideas to combine the approaches.

Previous
Previous

Generalizing Our Environments

Next
Next

Blackjack: Following the Patterns