Refine Your Sparse PySC2 Agent

Steven Brown
ITNEXT
Published in
4 min readApr 3, 2018

--

Click here to share this article on LinkedIn »

In my last tutorial I showed you how to build a PySC2 agent that learned from sparse rewards. That agent was able to win around 25% of the time, but would lose almost 50% of the time.

In this tutorial we will build on the previous agent, but with a few small changes we will be able to boost the win rate up over 70%.

1. Ignore Learning When State Does Not Change

While working on the previous agent, I output the Q-table data to a CSV:

I refreshed the CSV as the agent learned, and came up with a few theories as to how it could be improved.The first thing I noticed is that because my state was so simple, it was very common for the agent to take an action and land back in the same state. Each time the agent progresses from one state to the next, it takes the maximum discounted reward of that state. The value for that action in that state is then moved a little bit towards the reward.

While this is generally not a huge issue when an action moves the agent from one state to the next, when it is frequently landing in the same state it tends to push less valuable actions up towards to the most valuable action, and the most valuable action gets pushed downwards. Over time all of the rewards will approach zero if they more frequently land in the same state instead of another significantly different state.

In order to counter this, we can simply prevent the agent from learning any time the action does not alter the state. The code is very simple, just add two lines at the start of the learn() method of the QLearningTable:

This will instantly abort the learning step when the states are identical.

2. Prevent Invalid Actions

Watching the agent I noticed that on several occasions it would get caught repeatedly trying invalid actions. Quite often the number of usable actions was half of the possible actions, so the agent spent time attempting to learn from actions that should produce no result.

By filtering these invalid actions we can keep the agent focused on trying actions that should lead to a change in state, reducing exploration and improving the learning time.

First we modify the constructor of QLearningTable:

On the last line we create a list of disallowed actions.

Then, in the choose_action() method we accept a list of invalid actions for the given state:

Then we filter those actions from the possible choices, so the agent will not take an invalid action:

Then we make a very important change. Since the invalid actions never get chosen, their rewards never change. If they start at 0 then they could become the highest value action for that state if all other actions have negative values. To get around this, we filter the invalid actions from the future state’s rewards in the learn() method:

Next we want to build a list of invalid actions. Inside the agent’s step() method, collect some data we will use to make our decisions:

Now we start excluding actions. First of all, if we have reach our supply depot limit of 2, or we have no workers to build a supply depot, let’s remove the agent’s ability to build a supply depot:

If we have no supply depots, or we have reached our barracks limit of 2, or we have no workers to build barracks, let’s not allow the agent to build a barracks:

If we don’t have any barracks or we have reached our supply limit, we don’t want to train marines:

Lastly if we have no marines, we don’t want to attack anything:

Now that we have our list of invalid actions, we can feed it in when choosing the action:

3. Add Our Unit Locations to the State

This one is less data driven and more of a theory. If our agent doesn’t know where its units are, how does it know which location might be the best to attack?

Adding our unit locations to the state is quite simple, first we increase the state size of 12:

Next we use the same logic for identifying enemy locations, except we filter the minimap for our units:

As with the enemy locations, this divides the mini-map into 4, setting the value to 1 for any quadrant containing friendly units.

4. Try It!

Go ahead and give it a run:

python -m pysc2.bin.agent \
--map Simple64 \
--agent refined_agent.SparseAgent \
--agent_race T \
--max_agent_steps 0 \
--norender

Believe it or not my agent was able to achieve a 67% win rate (averaged across the previous 100 games) after just 512 games. It took 1965 games to reach 71% but you can see from the graph at the start of this article that it was doing well most of the way. The previous agent peaked at 59%.

Here’s a graph to compare with the previous agent, showing the win rate over time rather than over the last 100 games:

As you can see the wins were still trending upwards.

All code from this tutorial can be found here.

If you like this tutorial, please support me on Patreon. Also please join me on Discord, or follow me on Twitch, Medium, GitHub, Twitter and YouTube.

--

--