You and the agent alternate turns, clicking to pluck 1, 2, or 3 petals. After your turn, tap the flower's center for the agent's turn. If you click the last petal, you lose.

Reinforcement learning and how it works

Sunflower is inspired by the classic game of 21. Players count up in increments of 1, 2, or 3, and whoever reaches 21 first loses. You'll face off against a reinforcement learning agent I've trained on this game. Through self-play, the agent has learned to select actions based on the current state and the rewards it has received.

  • States (21 total): Number of remaining petals.
  • Actions (3 choices): Pluck 1, 2, or 3 petals.
  • Rewards (2 outcomes): +1 for a win, -1 for a loss.

On each turn, the agent decides how many petals to pluck based on the current state. Every state-action pair holds a value, telling us how good it is to take that action, in that state.  Once the game (episode) has finished, the agent will go back through the state-actions that it took and if it won, each action in that state will receive a reward of +1 and their values for the action they took in their states will be shifted closer to 1. Losses will mean the values of the state-actions that they took will be shifted closer to -1. Once the agent played itself many times ( in this case, 10,000) the values of the states and actions seem pretty stable. When you play, the agent checks the current state ( how many petals remain ) and selects the action with a value closest to +1,  indicating it's the most successful choice. This type of learning is called Monte Carlo, which relies on learning from the outcomes of complete episodes.

What went wrong

  • Backwards rewards: I struggled for a while wondering why the agent I played against was god awful. It would perform better when taking random actions. It turns out I was wrongly giving +1 rewards to the agent that plucked the last petal. This meant that the agent that lost got rewarded and the agent that won was punished. This essentially meant that i was training an agent to be the worst possible player of this game. This awful player still somehow managed to beat me a few times too.
  • Keeping history: I also struggled for a while wondering why the agents actions weren't smart or daft, they we're seemingly random. It turns out I wasn't clearing the history between episodes. This meant the history kept getting more and more filled. After enough episodes, every possible state-action pair was in this history and when the agent won or lost every possible state-action pair was shifted closed to +1 or -1, rather than just the state-action pairs that happened in the current episode. This gave the seemingly random behaviour.
  • Leave a comment

    Log in with itch.io to leave a comment.