About

In this simulation, I train an agent to balance a ball that is launched onto a platform, avoiding the machinery below for 15 seconds. Despite the task appearing straightforward, it presents a couple of new challenges that haven't been in my previous agents. This agent uses Q-Learning with linear function approximation and  tile coding to handle the state space. You can take control of the platform by use of the toggle and using 'W' and 'D' to rotate the platform left and right.

The task: actions and states and rewards

The agent receives a +1 reward per step with a -1000 reward for the ball crashing into the machinery. This task involves three possible actions: tilting the platform to the left, tilting it to the right, or doing nothing. The state, or what the agent sees, is 5 variables: the x and y positions of the ball, the x and y velocities of the ball, and the current tilt angle of the platform. Unlike my other agents, where all state variables were discrete, or in other words, they could only be certain specific numbers, the state variables here are continuous. This means a variable could take any value between -8 and 8, like 3.0001 or -5.946. In my previous agents, I used a table where each state variable had its own spot. But because we’re dealing with a continuous range here (meaning 1.23 and 1.24 are different and there are endless possibilities in between), using a table cannot work. I needed a way to shrink the state space so it wasn't infinite and this is where tile coding comes in.

Tile Coding

Tile coding is a form of state aggregation, which allows us to convert a continuous range into a manageable number of discrete sets. Consider the x-position, which can fall anywhere between -8 and 8. Instead of handling an infinite number of possible values, we can categorize them into 16 distinct groups for simplicity. For example, all values between -8 and -7 can be classified into one group, values between -7 and -6 into another, and so on. This technique transforms our infinitely variable range into just 16 identifiable and manageable states. This agent leverages 8 tiles, dividing our continuous range into 8 distinct segments. With 5 variables and each divided into 8 tiles, we find ourselves with 32,768 unique combinations (calculated as 8^5). When we input our state into our tile coding function, it will pop out a singular number which represents 1 of those unique combinations.  In our approach, we use a certain number of tiles (or groups) and a technique known as 'tilings.' After we divide these variables into tiles, we do it again but shift the tile boundaries a bit. So, instead of having one tile from -8 to -7 and another from -7 to -6, we might adjust them to -7.9 to -6.9 and -6.9 to -5.9 in the next set of tiles.  In the original set, a value of -7.1 is treated the same as -7.8, even though there's a 0.7 difference, because they fall within the same tile. However, -7.1 and -6.9 are treated as totally different in this set since they land in different tiles, even though they are just 0.2 apart. Using several 'tilings' with different shifts allows us to manage these small numerical variations more effectively.  Think of it as collecting slightly different versions of the same state. This agent uses 8 tiles and 4 tilings, which means when we enter our state into our tile coding function, it will pop out an array of 4 integers, each one indicating which unique combination was activated in its respective tiling.

Linear function approximation

Since i am building up to using neural networks, i wanted to use linear function approximation for this environment. In this setup, every tile is given its own weight, culminating in a total of 8^5 or 32,768 weights, considering that we're using 8 tiles for each of the 5 variables. When working with Q-Learning, which assigns a value to every state-action pair, we expand our scope to 8^5×3 actions, summing up to 98,304 weights. Every time we take an action, the weights of that action and the tiles currently active get updated to reflect the reward we just got.

Epsilon Decay

This is the first environment where I've implemented epsilon decay. Up until now, all my agents have used an epsilon-greedy policy. If an agent only opts for the best (or greedy) action every time, its learning is stunted because learning only really comes from trying new things (exploration). So, in an epsilon-greedy policy, with epsilon set at 0.1, there's a 90% chance that we select the best possible action for a given state and a 10% chance that we take a completely random action, hoping to discover something new. Once training is finished, we leave behind exploration and only take the best possible action in all states. In all my prior agents, epsilon has been constant at 0.1. Epsilon decay, on the other hand, starts epsilon at 1 and gradually diminishes it to a minute amount or zero. For this agent, training spanned 100,000 episodes, with epsilon starting at 1 ( nothing but random actions ) and slowly decreasing to 0 over 99,000 episodes, leaving the final 1,000 episodes for testing our now fully greedy policy.

Problems / Curiosities

  • Initially, after training I was confused why the agent would immediately try and throw the ball into the machinery, without even trying to balance it. It turns out I  set the reward to -1 per timestep out of habit. This meant the longer the ball was in play the worse the agent thought it had performed, which meant to optimise the reward it had to throw the ball to its death as fast as it could.
  • After this i still struggled to train this agent. After training it would balance some of the time but it wasn't consistently keeping the ball out of the machinery. I didn't know if the problem was the number of tiles or tilings or any other parameter for that matter. Eventually, I automated the testing to cycle through a series of tests involving varying numbers of tiles, tilings, episodes, epsilon decay, and discount factors. Through this i found the reason why i had so much trouble - i didn't train the agent long enough. Initially, I tested different parameters, training for only about 5,000 episodes, under the assumption this was enough. This amount of episodes wasn't enough for the agent to fully explore the environment and to learn the best policy of actions. After increasing the episode count to 60,000+ things were much better. The final agent was trained on 100,000 episodes, which took around 30 minutes to complete.
  • You'll notice that sometimes when it is balancing the ball, the agent consistently moves the angle from left to right to keep the ball stationary. Humans typically wouldn't use this strategy; we'd keep the platform stable since exerting effort (pressing buttons to go left and right) consumes energy. However, this agent neither gains nor loses rewards for moving the platform, so it doesn’t “tire” and finds no reason to avoid maintaining balance by moving back and forth. To steer its behaviour towards a more human-like approach, you could introduce a penalty for every platform movement, or provide additional reward when the platform remains still, encouraging minimized movement.



Leave a comment

Log in with itch.io to leave a comment.