PROJECT STATUS - 5/26/2017

Video

PROJECT SUMMARY

The goal of this project is to have an agent survive for as long as possible in a 10x10 block environment with a maze and three zombies placed somewhere in the maze. The agent starts off in the middle of this environment. The zombies are placed at three pre-chosen spots. The agent must survive for as long as possible. See figure below for visualization:

Figure 1: Picture of 10x10 maze environment for Malmo-Minecraft agent. Black spots indicate walls. White spots indicate free path. Green spot indicates starting position of agent. Red spot indicates starting position of zombies.

APPROACH

We are using reinforcement learning to learn how to survive in the maze, specifically, we attempt to learn a Q-function to approximate a Q-table.

Markov Decison Process (MDP)

We first describe our Markov Decision Process.

Our reward function is -1000 for death.

For our states and actions, we start off with 10x10 = 100 grid positions, 4 possible discrete actions (movenorth 1, movesouth 1, movewest 1, moveeast 1), and 4 total entities. Using a Q-table, we would have to keep track of possibly 4 x 100⁴ possible state-action pairs (4 for each action and 100x100x100x100 for each possible permutation of positions of each entity). This is too large a number for us to expect to keep track of, so we approximate our state-action pairs with a list of 72 parameters

To represent the state in our approximation, we use 72 basis functions as features:

More precisely, we represent our state by breaking it down into 18 partitions: 9 dynamic partitions that move with the agent so that the agent is always at the center of this sub-space, and 9 stationary partitions that allow the agent to know which region of the environment the agent is in. We are currently experimenting with a couple different stationary sets of partitions. This is all visualized as followed in the figures below, along with descriptions for them:

Figure 2: Picture of 9 dynamic partitions as basis functions. Green spot indicates position of agent. Red spots indicate position of zombies. The basis function for each partiton returns an integer indicating how many zombies are in that particular partition. Possible integers include: 0,1,2,3.

Figure 3: Picture of 9 dynamic partitions as basis functions. Green spot indicates position of agent. The partitions move with the agent so that the agent is always at its center.
This allows us to keep track of the proximity of zombies to the agent.

Figure 4: Picture of 9 stationary partitions as basis functions. These partitions never move. Based off the “tile” basis functions from Sutton and Barto’s Reinforcement Learning: An Introduction . The basis function for each partiton returns an 1 if the agent is in it, 0 otherwise. They do not keep track of zombie positions.

Figure 5: Picture of 9 stationary partitions as basis functions. These partitions never move. These circular partitions have radius 2.50 unit blocks. Based off the “coarse” basis function from Sutton and Barto’s Reinforcement Learning: An Introduction . The basis function for each partiton returns an 1 if the agent is in it, 0 otherwise. They do not keep track of zombie positions.

For each of our 4 actions we allow a set of 18 basis functions, thus giving us 72 total basis functions and 72 total parameters.

So our MDP consists of reward function of -1000 for death and these 72 basis functions and parameters.

Algorithm

We now show the algorithm we use to learn the parameters for the above function approximation. A full description of the algorithm can be found in Sutton and Barto’s Reinforcement Learning: An Introduction chapter 8. One key thing to note is that epsilon and alpha are monotonically decreasing functions of k.

Figure 6: An online linear, gradient descent algorithm TD(lambda) for approximating Q(s,a) with eligibility tracing and an epsilon-greedy exploration.

EVALUATION

We now describe our evaluation process for our project. We break this evaluation into two main parts - with a maze and without a maze. Within these two parts, we elaborate on the performance of a baseline, non-intelligent agent and an intelligent agent with different basis functions. All of our individual evaluations are trained on 100 iterations/episodes. We did this because of time constraints with the deadline, but plan on expanding this number later.

For each episode, we measure the number of commands sent by the agent while it is alive. We then plot all of these data points for each evaluation onto a scatterplot and fit a 1-degree polynomial (using numpy) to the data. Be aware that each plot contains a lot of noise.

No Maze

To start our evaluation, we set our agent in a grid environment without a maze.

Figure 7: 10x10 maze environment without a maze. Green spots indicate starting position of agent. Red sports indicate starting position of zombies.

Baseline - Random

Our first agent consisted of completely random motions. We expected to see no improvement by the random moving agent and after running the program we see that performance slightly declines. We believe this may just be noise and that running this program with more episodes would show a more constant performance.

From the plots we see that the average number of commands alive each episode by the agent is approximately 23 commands. This will serve as a baseline to compare our “intelligent” agents with. The results can be seen in the figure below:

Figure 8: Scatter plot of random agent performance in a no maze environment with linear fit. We see negative performance in this plot over time.

Baseline - Handcode (Mob Fun Algorithm)

The second baseline we used was the algorithm used in mob_fun.py from the Python_Examples folder given by Malmo. We expected, again, to see no improvement by the agent since we are updating no parameters or Q-table and after running the program we see that (although, again, with a lot of background). The agent’s moves are a deterministic function of its position and the relative positions of the zombies.

From the plots we see the average performance of this agent is ~22-23 commands alive per episode. This is, surprisingly, on pace with the random moving agent. We believe that this may be so because the random moving agent is hard to catch because it moves quickly and randomly and that perhaps the optimal amount of commands any agent can stay alive before being converged on by the zombies is ~22-23 commands. This will also serve as a baseline to compare to our “intelligent” agent. The results can be seen in the figure below:

Figure 9: Scatter plot of handcoded agent performance in a no maze environment with linear fit. The agent shows no improvement and stays alive for about 22-23 commands per episode on average.

Learner - 9 Dynamic BFs + 9 Tile Stationary BFs

The first “intelligent” agent we evaluate is an agent using the Q-learning algorithm (figure 6) with 9 dynamic basis functions (figure 2) and 9 tile stationary basis functions (figure 4).

We expected to see improvement in this agent and hoped to see it meet or exceed the performance of the two baselines. From the plots, we see that performance did improve marginally as time continues, but the performance never caught up to that of the handcoded or random agent.

The agent stayed alive for ~15-16 commands per episode. The results can be seen in the figure below:

Figure 10: Scatter plot with linear fit of Q-learner with tile stationary BFs in no maze environment. The agents shows marginal improvement and stays alive for ~15-16 commands per episode.

Learner - 9 Dynamic BFs + 9 Coarse Stationary BFs (as seen in video)

The second “intelligent” agent we evaluate is an agent using the Q-learning algorithm (figure 6) with 9 dynamic basis functions (figure 2) and 9 coarse stationary basis functions (figure 5).

The performance of this agent did improve more considerably than the learner with the tile stationary BFs, but it still never caught up to the performance of the handcoded or random agent. We believe the improved performance of this learner over that of the tile learner can be perhaps because of the overlap in the coarse stationary BFs. This can possibly give the agent a more precise indication of his/her position and allow him/her to navigate the environment better.

The agent started out with about 17 commands alive per episode at the beginning of training and ended up at about 20 commands alive per episode. Looking at the agent in training (and as seen in the video), the agent is able to learn to manuver to the bottom right corner of the environment where the zombies are furthest away. Once there, it appears to try to avoid the zombies, but does not succeed. The results can be seen in the figure below:

Figure 11: Scatter plot with linear fit of Q-learner with coarse stationary BFs in a no maze environment. The agents does improvement and manages to stay alive for ~20-21 commands per episode by the end of training. We again see a lot of noise present in the data.

Learner - 9 Dynamic BFs

The final “intelligent” agent we evaluate is an agent using the Q-learning algorithm (figure 6) with 9 dynamic basis functions (figure 2) and 0 stationary BFs.

This was an interesting case because here the agent would only know the relative positions of zombies. It would have no knowledge of its own position in the environment.

The performance of this agent was poor and actually worsened over time. As we see in the plots, the agent starts off with ~17 commands per episode before death. But over time, the agents learns a set of parameters that lead it to ~14-15 commands per episode.

The results can be seen in the figure below:

Figure 12: Scatter plot with linear fit of Q-learner with no stationary BFs in a no maze environment. The agent shows regressing performance and manages to stay alive for only ~14-15 commands per episode by the end of training.