Generalizing Our Environments

May 25

In our previous episode, we used Q-Learning to find a solution for the Frozen Lake scenario. We also have a Blackjack game that shares a lot of core ideas with Frozen Lake.

So in this part, we're going to start by applying our Q-Learning solution to the Blackjack game. This will highlight the similarities in the code between the two games. But we'll also see a few differences. The similarities will lead us to create a typeclass for our environment concept. Each "difference" in the two systems will suggest an expression that must be part of the class. Let's explore the implications of this.

Adding to the Environment

Once again, we will need to express our Q-table and the exploration rate as part of the environment. But this time, the index of our Q-Table will need to be a bit more complex. Remember our observation now has three different parts: the user's score, whether the player has an ace, and the dealer's show-card. We can turn each of these into a Word, and combine them with the action itself. This gives us an index with four Word values.

We want to populate this array with bounds to match the highest value in each of those fields.

data BlackjackEnvironment = BlackjackEnvironment
  { ...
  , qTable :: A.Array (Word, Word, Word, Word) Double
  , explorationRate :: Double
  } deriving (Show)

basicEnv :: IO BlackjackEnvironment
basicEnv = do
  gen <- Rand.getStdGen
  let (d, newGen) = shuffledDeck gen
  return $ BlackjackEnvironment
    ...
    (A.listArray ((0,0,0,0), (30, 1, 12, 1)) (repeat 0.0))
    1.0

While we're at it, let's create a function to turn an Observation/Action combination into an index.

makeQIndex :: BlackjackObservation -> BlackjackAction
  -> (Word, Word, Word, Word)
makeQIndex (BlackjackObservation pScore hasAce dealerCard) action =
  ( pScore
  , if hasAce then 1 else 0
  , fromIntegral . fromEnum $ dealerCard
  , fromIntegral . fromEnum $ action
  )

With the help of this function, it's pretty easy to re-use most of our code from last time! The action choice function and the learning function look almost the same! So review last week's article (or the code on Github) for details.

Using the Same Game Loop

With our basic functions out of the way, let's now turn our attention to the game loop and running functions. For the game loop, we don't have anything too complicated. It's a step-by-step process.

Retrieve the current observation
Choose the next action
Use this action to step the environment
Use our "learning" function to update the Q-Table
If we're done, return the reward. Otherwise recurse.

Here's what it looks like. Recall that we're taking our action choice function as an input. All our functions live in a similar monad, so this is pretty easy.

gameLoop :: (MonadIO m) =>
  StateT BlackjackEnvironment m BlackjackAction ->
  StateT BlackjackEnvironment m (BlackjackObservation, Double)
gameLoop chooseAction = do
  oldObs <- currentObservation <$> get
  newAction <- chooseAction
  (newObs, reward, done) <- stepEnv newAction
  learnQTable oldObs newObs reward newAction
  if done
    then do
      if reward > 0.0
        then liftIO $ putStrLn "Win"
        else liftIO $ putStrLn "Lose"
      return (newObs, reward)
    else gameLoop chooseAction

Now to produce our final output and run game iterations, we need a little wrapper code. We create (and reset) our initial environment. Then we pass it to an action that runs the game loop and reduces the exploration rate when necessary.

playGame :: IO ()
playGame = do
  env <- basicEnv
  env' <- execStateT resetEnv env
  void $ execStateT stateAction env'
  where
    numEpisodes = 10000
    decayRate = 1.0
    minEpsilon = 0.01

    stateAction :: StateT BlackjackEnvironment IO ()
    stateAction = do
      rewards <- forM [1..numEpisodes] $ \i -> do
        resetEnv
        when (i `mod` 100 == 99) $ do
          bje <- get
          let e = explorationRate bje
          let newE = max minEpsilon (e * decayRate)
          put $ bje { explorationRate = newE }
        (_, reward) <- gameLoop chooseActionQTable
        return reward
      lift $ print (sum rewards)

Now we can play our game! Even with learning, we'll still only get around 40% of the points available. Blackjack is a tricky, luck-based game, so this isn't too surprising.

Constructing a Class

Now if you look very carefully at the above code, it should almost work for Frozen Lake as well! We'd only need to make a few adjustments to naming types. This tells us we have a general structure between our different games. And we can capture that structure with a class.

Let's look at the common elements between our environments. These are all functions we call from the game loop or runner:

Resetting the environment
Stepping the environment (with an action)
Rendering the environment (if necessary)
Apply some learning method on the new data
Diminish the exploration rate

So our first attempt at this class might look like this, looking only at the most important fields:

class Environment e where
  resetEnv :: (Monad m) => StateT e m Observation
  stepEnv :: (Monad m) => Action
    -> StateT e m (Observation, Double, Bool)
  renderEnv :: (MonadIO m) => StateT e m ()
  learnEnv :: (Monad m) =>
    Observation -> Observation -> Double -> Action -> StateT e m ()

instance Environment FrozenLakeEnvironment where
  ...

instance Environment BlackjackEnvironment where
  ...

We can make two clear observations about this class. First, we need to generalize the Observation and Action types! These are different in our two games and this isn't reflected above. Second, we're forcing ourselves to use the State monad over our environment. This isn't necessarily wise. It might force us to add extra fields to the environment type that don't belong there.

The solution to the first issue is to make this class a type family! Then we can associate the proper data types for observations and actions. The solution to the second issue is that our class should be over a monad instead of the environment itself.

Remember, a monad provides the context in which a computation takes place. So in our case, our game, with all its stepping and learning, is that context!

Doing this gives us more flexibility for figuring out what data should live in which types. It makes it easier to separate the game's internal state from auxiliary state, like the exploration rate.

Here's our second try, with associated types and a monad.

newtype Reward = Reward Double

class (MonadIO m) => EnvironmentMonad m where
  type Observation m :: *
  type Action m :: *
  resetEnv :: m (Observation m)
  currentObservation :: m (Observation m)
  stepEnv :: (Action m) -> m (Observation m, Reward, Bool)
  renderEnv :: m ()
  learnEnv :: 
    (Observation m) -> (Observation m) ->
    Reward -> (Action m) -> m () 
  explorationRate :: m Double
  reduceExploration :: Double -> Double -> m ()

There are a couple undesirable parts of this. Our monad has to be IO to account for rendering. But it's possible for us to play the game without needing to render. In fact, it's also possible for us to play the game without learning!

So we can separate this into more typeclasses! We'll have two "subclasses" of our Environment. We'll make a separate class for rendering. This will be the only class that needs an IO constraint. Then we'll have a class for learning functionality. This will allow us to "run" the game in different contexts and limit the reach of these effects.

newtype Reward = Reward Double

class (Monad m) => EnvironmentMonad m where
  type Observation m :: *
  type Action m :: *
  currentObservation :: m (Observation m) 
  resetEnv :: m (Observation m)
  stepEnv :: (Action m) -> m (Observation m, Reward, Bool)

class (MonadIO m, EnvironmentMonad m) => 
  RenderableEnvironment m where
    renderEnv :: m ()

class (EnvironmentMonad m) => LearningEnvironment m where
  learnEnv ::
    (Observation m) -> (Observation m) ->
    Reward -> (Action m) -> m () 
  explorationRate :: m Double
  reduceExploration :: Double -> Double -> m ()

Conclusion

Next week we'll explore how to implement these classes for our different games! We'll end up with a totally generic function for playing the game. We'll have a version with learning and a version without!

The next step after this will be to attach more sophisticated learning mechanisms. Soon, we'll explore how to expand our Q-Learning beyond simple discrete states. The way to do this is to use tensors! So in a couple weeks, we'll explore how to use TensorFlow to construct a function for Q-Learning. To get ready, download our Haskell TensorFlow Guide!

James Bowen

Generalizing Our Environments

Adding to the Environment

Using the Same Game Loop

Constructing a Class

Conclusion

Refactored Game Play!

Frozen Lake with Q-Learning!