Motivation

The goal of this project was to develop an AI agent capable of autonomously cleaning a room—detecting trash, picking it up, and dumping it into a trashcan—using Reinforcement Learning (RL). The idea was to train the agent to make decisions and improve its behavior over time based on rewards and penalties.

Reinforcement Learning

The bots learned using Reinforcement Learning, where their actions were either rewarded or penalized. Each new generation of agents built on the behavior of previous ones, gradually developing more efficient strategies.

In a typical RL setup, an agent interacts with an environment, receives rewards and state observations based on its actions, and uses this feedback to refine its future decisions.

Designing the Environment

The simulated learning environment included three key components:

Agent: The AI entity performing actions within the environment.
Brain: The decision-making model, defining inputs and outputs.
Academy: The system that orchestrated training, observations, and decision processes.

The agent tracked eight observations, including its position vector, distance and direction from the trashcan, and whether it had collected all the trash. It also used five raycasts to simulate vision within a 120-degree field of view. Items were tagged and spawned randomly within the environment.

The reward function played a crucial role—it acted as a feedback mechanism to help the agent learn what behaviors were desirable. The goal was for the agent to maximize total rewards by learning the correct sequence of actions.

Challenges

The first iteration yielded strange, yet expected results. The initial reward system failed to discourage the agent from standing still, so it often did just that. By adding penalties for inactivity and collisions, the agent was encouraged to explore and improve.

Another major challenge was the training time—teaching a single agent was time-consuming. To speed up the process, I implemented parallel training across multiple environments, which significantly reduced overall training time.

Result

Another major challenge was the training time—teaching a single agent was time-consuming. To speed up the process, I implemented parallel training across multiple environments, which significantly reduced overall training time.

Strategy #1: Collect all the litter first, then dump it (triggered by a moderate reward for dumping).

Strategy #2: Dump trash after every pickup (triggered by increasing the dump reward value).

After many iterations, reward adjustments, and hours of training, I ended up with two separate agent brains, each using a distinct approach to complete the task.