Reinforcement Learning Quiz: Basics and Beyond

Study Notes

Reinforcement learning is inspired by how humans learn from trial and error, taking actions, receiving feedback, and updating their decisions accordingly.
An agent interacts with an environment, takes actions based on its policy, and receives feedback in the form of rewards or penalties.
The agent's goal is to learn an optimal policy that maximizes cumulative rewards over time.

States: Represent the current situation or context of the environment, such as a game board or a robot's current position.
Actions: Decisions or choices made by the agent in response to a given state, such as moves or game play decisions.
Rewards: Feedback provided to the agent by the environment after it takes an action in a certain state, guiding the agent's learning by indicating the desirability of certain actions.
Policy: The strategy or rule that the agent uses to determine which action to take in a given state, which can be deterministic or stochastic.
Value Function: Estimates the expected cumulative rewards the agent can receive from a certain state following a certain policy, helping the agent make decisions by evaluating the desirability of different states or state-action pairs.

Robotic Vacuum Cleaner: A simple example of reinforcement learning, where the vacuum cleaner needs to learn how to navigate a room and clean up dirt patches efficiently.
States: Different positions of the vacuum cleaner in the room.
Actions: Movement directions of the vacuum cleaner or cleaning actions.
Rewards: Positive for successfully cleaning a dirt patch, negative for bumping into obstacles, and neutral for simply moving around the room.
Policy: Initially, a rule-based strategy, such as moving towards the nearest dirt patch, and can be refined over time through learning.
Value Function: Estimates the expected cumulative rewards for different states or state-action pairs, guiding the vacuum cleaner's decisions on which actions to take.

Q-Learning: An off-policy algorithm that estimates the Q-values for state-action pairs and updates them based on rewards and transitions.
SARSA: An on-policy algorithm that updates Q-values based on rewards and transitions experienced during interactions with the environment.
DDPG: An off-policy algorithm that uses a neural network to approximate the policy and Q-values, commonly used in continuous action spaces.
DQL: An off-policy algorithm that uses a neural network to estimate the Q-values for state-action pairs, selecting actions with the highest Q-values and updating Q-values based on rewards and transitions.

Basics of Reinforcement Learning