Introduction

Nim is a strategic game where players take turns removing objects from distinct heaps, aiming to avoid taking the last object. The origins of the game can be traced back to ancient China, and it has been played in various forms for over two thousand years.

In this version of Nim, there are three piles, each filled with a random number of tokens. On your turn, you have to remove one or more tokens from a single pile. Collect as many as you want, but only from that chosen pile. Click the hourglass in the corner to end your turn. The player who picks up the last token is the loser.

Reinforcement Learning

The reinforcement learning agent competes against itself over many episodes. During each game, it carefully tracks the board's state and the actions it chooses. Every action it takes is associated with a value: -1 indicates a poor choice, while +1 signifies an excellent move. At the end of each game, the values of the actions taken are adjusted - becoming closer to +1 if the agent won, and nearer to -1 if it lost. After many rounds of self-play and learning, the agent sharpens its strategy to play its best game. All it needs to do is assess the current state of the board and select the action with the highest value (closest to +1) to play optimally.

Afterstates

This agent was trained in a different way to my previous projects as this agent utilizes afterstates. The problem with state-action value functions, like mentioned above, is that each state-action pair holds a unique value. However, different states and actions can sometimes lead to identical outcomes or "afterstates".

Let's visualize this with a single pile of tokens:

  • Start State: 5 tokens, Action: Take 4, Result: 1 token left
  • Start State: 4 tokens, Action: Take 3, Result: 1 token left
  • Start State: 3 tokens, Action: Take 2, Result: 1 token left
  • Start State: 2 tokens, Action: Take 1, Result: 1 token left

In this case, though the initial state and actions differ, the outcomes are the same, leading us to a single afterstate. Here, the state-action pairs for these 4 state-actions will have different values associated with them, when logically, they should have the exact same value as they all end up in the exact same state afterwards. To use afterstates, instead of assigning values to specific state-action pairs, we assign values directly to the states, more precisely, the afterstates. At the end of each episode, the value of each visited afterstate is adjusted based on the game's outcome. Later, when the agent needs to decide the optimal action, it will loop through all possible actions and check the value of the states after taking these actions. The action that leads to the afterstate with the highest value is selected. Afterstates are only useful in games where we know the outcome of our actions before taking them, like in Tic-Tac-Toe or Chess, as we need to know the state that we reach after taking an action.

Agent vs Agent: Are afterstates any good?

Round 1

I wanted to find out if using afterstates would give any significant advantage. So, I trained two agents: one using afterstates and another using the standard Monte Carlo state-action method. Initially, I set the board in a fixed 5-3-1 configuration. After training individually for 500 episodes, they faced off in 500 matches. The results were neck-and-neck, with the afterstate agent clinching 251 wins against 249.

Round 2

In the second round, I introduced an element of unpredictability by randomizing the board setup in each game. The agents trained for 500 episodes on their own before competing against each other in 500 matches on these random boards. This time, the afterstate agent had a clear upper hand, securing a resounding 469 victories compared to the state-action agent's 31. The afterstate method proved to be more efficient, reducing many state-action pairs into single afterstate values, which allowed it to learn more within a shorter time frame.

Round 3

Next, I wanted to assess how the state-action agent would fare with extended training. Keeping the afterstate agent's training at 500 episodes, I boosted the state-action agent's training to 5000 episodes. Despite training for ten times longer, it only managed to win 167 games with the afterstate agent winning 333 games. Further increasing the training to 7500 episodes brought a better balance, resulting in a tight competition where the state-action agent won 254 games against the afterstate agent's 246.

In the end, the use of afterstates proved to be successful, enabling the afterstate agent to achieve similar performance to the state-action agent with just 1/15th of the training episodes. In the game above, you play against this afterstate agent.

Leave a comment

Log in with itch.io to leave a comment.