Setting Up Our Model with Look-Ahead

future_sight.png

Last week we went over some of the basics of Temporal Difference (TD) learning. We explored a bit of the history, and compared it to its cousin, Q-Learning. Now let's start getting some code out there. Since there's a lot in common with Q-Learning, we'll want a similar structure.

This is at least the third different model we've defined over the course of this series. So we can now start observing the patterns we see in developing these algorithms. Here's a quick outline, before we get started:

  1. Define the inputs and outputs of the system.
  2. Define the data model. This should contain the weight variables we are trying to learn. By including them in the model, we can output our results later. It should also contain important Session actions, such as training.
  3. Create the model
  4. Run iterations using our model

We'll follow this outline throughout the article!

If you're new to Haskell and machine learning, a lot of the code we write here won't make sense. You should start off a little easier with our Haskell AI Series. You should also download our Haskell Tensor Flow Guide.

Inputs and Outputs

For our world features, we'll stick with our hand-crafted feature set, but simplified. Recall that we selected 8 different features for every location our bot could move to. We'll stick with these 8 features. But we only need to worry about them for the current location of the bot. We'll factor in look-ahead by advancing the world for our different moves. So the "features" of adjacent squares are irrelevant. This vectorization is easy enough to get using produceLocationFeatures:

vectorizeWorld8 :: World -> V.Vector Float
vectorizeWorld8 w = V.fromList (fromIntegral <$>
  [ lfOnActiveEnemy standStill
  , lfShortestPathLength standStill
  , lfManhattanDistance standStill
  , lfEnemiesOnPath standStill
  , lfNearestEnemyDistance standStill
  , lfNumNearbyEnemies standStill
  , lfStunAvailable standStill
  , lfDrillsRemaining standStill
  ])
  where
    standStill = produceLocationFeatures
      (playerLocation . worldPlayer $ w) w False

We also don't need to be as concerned about exploring the maze with this agent. We'll be defining what its possible moves are at every turn. This is a simple matter of using this function we have from our game:

possibleMoves :: World -> [PlayerMove]

We should also take this opportunity to specify the dimensions of our network. We'll use 20 hidden units:

inputDimen :: Int64
inputDimen = 8

hiddenDimen :: Int64
hiddenDimen = 20

outputDimen :: Int64
outputDimen = 1

Define the Model

Now let's define our data model. As in the past, we'll use a dense (fully-connected) neural network with one hidden layer. This means we'll expose two sets of weights and biases:

data TDModel = TDModel
  { tdHiddenWeights :: Variable Float
  , tdHiddenBias :: Variable Float
  , tdOutputWeights :: Variable Float
  , tdOutputBias :: Variable Float
  ...
  }

We'll also have two different actions to take with our tensor graph, as we had with Q-Learning. The first will be for evaluating a single world state. The second will take an expected score for the world state as well as the actual score for a world state. It will compare them and train our model:

data TDModel = TDModel
  { ...
  , tdEvaluateWorldStep :: TensorData Float -> Session (Vector Float)
  , tdTrainStep :: TensorData Float -> TensorData Float -> Session ()
  }

Building the Model

Now we need to construct this model. We'll start off as always by initializing random variables for our weights and biases. We'll also make a placeholder for our world input:

createTDModel :: Session TDModel
createTDModel = do
  (worldInputVector :: Tensor Value Float) <-
    placeholder (Shape [1, inputDimen])
  hiddenWeights <- truncatedNormal (vector [inputDimen, hiddenDimen])
    >>= initializedVariable
  hiddenBias <- truncatedNormal (vector [hiddenDimen])
    >>= initializedVariable
  outputWeights <- truncatedNormal (vector [hiddenDimen, outputDimen])
    >>= initializedVariable
  outputBias <- truncatedNormal (vector [outputDimen])
    >>= initializedVariable
  ...

Each layer of our dense network consists of a matrix multiplication by the weights, and adding the bias vector. Between the layers, we'll apply relu activation. We conclude by running the output vector with an input feed:

createTDModel :: Session TDModel
createTDModel = do
  ...
  let hiddenLayerResult = relu $
        (worldInputVector `matMul` (readValue hiddenWeights))
        `add` (readValue hiddenBias)
  let outputLayerResult =
        (hiddenLayerResult `matMul` (readValue outputWeights))
        `add` (readValue outputBias)
  let evaluateStep = \inputFeed -> runWithFeeds
        [feed worldInputVector inputFeed] outputLayerResult
  ...

We'll leave the training step undefined for now. We'll work on that next time.

createTDModel :: Session TDModel
createTDModel = do
  …
  return $ TDModel
    { tdHiddenWeights = hiddenWeights
    , tdHiddenBias = hiddenBias
    , tdOutputWeights = outputWeights
    , tdOutputBias = outputBias
    , tdEvaluateWorldStep = evaluateStep
    , tdTrainStep = undefined
    }

Running World Iterations

Much of the skeleton and support code remains the same from Q-Learning. But let's go over the details of running a single iteration on one of our worlds. This function will take our model as a parameter, as well as a random move chance. (Recall that adding randomness to our moves will help us avoid a stagnant model). It will be stateful over the World and a random generator.

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  ...

We'll start off by getting all the possible moves from our current position. We'll step the world forward for each one of these moves. Then we'll feed the resulting worlds into our model. This will give us the scores for every move:

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  (currentWorld, gen) <- get
  let allMoves = possibleMoves currentWorld
  let newWorlds = fst <$> map ((flip stepWorld) currentWorld) allMoves
  (allScores :: Vector Float) <-
    Data.Vector.fromList <$> (forM newWorlds $ \w -> do 
      let worldData = encodeTensorData
            (Shape [1, inputDimen]) (vectorizeWorld8 w)
      scoreVector <- lift $ (tdEvaluateWorldStep model) worldData
      return $ Data.Vector.head scoreVector)
  ...

Now we need to take a similar action to what we had with Q-Learning. We'll roll the dice, and either select the move with the best score, or we'll select a random index.

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  ...
  let (chosenIndex, newGen) = bestIndexOrRandom allScores gen
  ...
  where
    bestIndexOrRandom :: Vector Float -> StdGen -> (Int, StdGen)
    bestIndexOrRandom scores gen =
      let (randomMoveRoll, gen') = randomR (0.0, 1.0) gen
          (randomIndex, gen'') = randomR (0, 1) gen'
      in  if randomMoveRoll < randomChance
            then (randomIndex, gen'')
            else (maxIndex scores, gen')

Now that we have our "chosen" move and its score, we'll encode that score as data to pass to the training step. The exception to this is if the game ends. In that case, we'll have a "true" score of 1 or 0 to give. While we're at it, we can also calculate the continuationAction. This is either returning a boolean for ending the game, or looping again.

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  ...
  let nextWorld = newWorlds !! chosenIndex
  put (nextWorld, newGen)
  let (chosenNextScore, continuationAction) =
        case worldResult nextWorld of
          GameLost -> (0.0, return False)
          GameWon -> (1.0, return True)
          GameInProgress -> ( allScores ! chosenIndex
                            , runWorldIteration randomChance model)
  let nextScoreData = encodeTensorData
        (Shape [1]) (Data.Vector.singleton chosenNextScore)
  ...

We'll also encode the evaluation of our current world. Then we'll pass these values to our training step, and run the continuation!

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  ...
  let currentWorldData = encodeTensorData
        (Shape [1, inputDimen]) (vectorizeWorld8 currentWorld)
  currentScoreVector <- lift $
    (tdEvaluateWorldStep model) currentWorldData
  let currentScoreData = encodeTensorData
        (Shape [1]) currentScoreVector

  lift $ (tdTrainStep model) nextScoreData currentScoreData

  continuationAction

What's Next?

We've now got the basic framework set up for our TD agent. Next time, we'll start digging into the actual formula we use to learn the weights. It's a little more complicated than some of the previous loss functions we've dealt with in the past.

If you want to get started with Haskell and Tensor Flow, download our Haskell Tensor Flow Guide. It will help you learn the basics of this complicated library!

Previous
Previous

Looking Ahead with More Steps!

Next
Next

Temporal Difference Primer