Protected: AI Unleashed: Mastering AI at Your Pace

0 of 27 lessons complete (0%)

Reinforcement Learning Fundamentals

Q-Learning

You don’t have access to this lesson

Please register or sign in to access the course content.

Q-Learning: A Model-Free Reinforcement Learning Algorithm

Q-Learning is a model-free reinforcement learning algorithm used to find the optimal policy in a Markov Decision Process (MDP). Unlike value iteration, which requires knowledge of transition probabilities, Q-Learning learns directly from experience.


Mathematical Foundation

Q-Learning is based on the Bellman Equation:

Where:

  • Q(s,a) is the Q-value (expected reward of taking action aaa in state sss).
  • α is the learning rate (controls how much new information overrides old information).
  • R(s,a) is the reward received after taking action aaa in state sss.
  • γ is the discount factor (it controls how much future rewards matter).
  • max⁡a′Q(s′,a′) is the best Q-value in the next state s′ (choosing the best future action).
  • The difference [R+γmax⁡Q−Q] is the temporal difference (TD) error.

Q-Learning Algorithm

  1. Initialize the Q-table Q(s,a) arbitrarily (usually zeros).
  2. For each episode (until convergence):
    • Start from an initial state s.
    • Select an action aaa using an exploration strategy (e.g., ϵ\epsilonϵ-greedy).
    • Take action aaa, observe the reward RRR and next state s′.
    • Update the Q-value using the formula.
    • Repeat until reaching a terminal state.
  3. Derive the optimal policy by choosing the action with the highest Q-value for each state.

Key Features

  • Doesn’t require a model of the environment (unlike Value Iteration).
  • Off-Policy: Learns from actions not necessarily following the current policy.
  • Exploration-Exploitation Tradeoff: Uses an ϵ\epsilonϵ-greedy strategy to balance learning (exploration) and using known information (exploitation).

Example

import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import random

def value_iteration(states, actions, transition_probs, rewards, gamma=0.9, theta=1e-6):
    """Performs Value Iteration Algorithm"""
    V = np.zeros(len(states))  # Initialize value function
    policy = np.zeros(len(states), dtype=int)  # Best action per state
    
    while True:
        delta = 0
        for s in range(len(states)):
            v_old = V[s]
            action_values = []
            for a in range(len(actions)):
                v_new = sum([transition_probs[s][a][s_next] * (rewards[s] + gamma * V[s_next]) 
                             for s_next in range(len(states))])
                action_values.append(v_new)
            
            V[s] = max(action_values)
            policy[s] = np.argmax(action_values)
            delta = max(delta, abs(v_old - V[s]))
        
        if delta < theta:
            break
    
    return V, policy

def q_learning(states, actions, transition_probs, rewards, gamma=0.9, alpha=0.1, epsilon=0.1, episodes=1000):
    """Performs Q-Learning Algorithm"""
    Q = np.zeros((len(states), len(actions)))  # Initialize Q-table
    
    for _ in range(episodes):
        state = random.choice(states)  # Start at a random state
        while state != len(states) - 1:  # Terminal state check
            # Choose action using epsilon-greedy strategy
            if random.uniform(0, 1) < epsilon:
                action = random.choice(actions)
            else:
                action = np.argmax(Q[state])
            
            # Get next state and reward
            next_state = np.argmax(transition_probs[state][action])
            reward = rewards[state]
            
            # Q-value update rule
            Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
            
            state = next_state  # Move to next state
    
    policy = np.argmax(Q, axis=1)  # Derive policy from Q-table
    return Q, policy

def plot_mdp(states, policy):
    """Visualizes the MDP transitions and optimal policy"""
    G = nx.DiGraph()
    state_labels = {0: "s1", 1: "s2", 2: "s3"}
    action_labels = {0: "Left", 1: "Right", 2: "Stay"}
    pos = {0: (0, 0), 1: (2, 0), 2: (4, 0)}
    
    for s in range(len(states)):
        action = policy[s]
        if action == 1 and s < len(states) - 1:  # Right move
            G.add_edge(state_labels[s], state_labels[s+1], label=action_labels[action])
        elif action == 0 and s > 0:  # Left move
            G.add_edge(state_labels[s], state_labels[s-1], label=action_labels[action])
    
    plt.figure(figsize=(6, 3))
    nx.draw(G, pos, with_labels=True, node_color='lightblue', node_size=2000, edge_color='gray', font_size=12)
    edge_labels = {(u, v): d['label'] for u, v, d in G.edges(data=True)}
    nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=10)
    plt.title("MDP Transition Diagram with Optimal Policy")
    plt.show()

# Define MDP elements
states = [0, 1, 2]  # s1, s2, s3
actions = [0, 1, 2]  # Left, Right, Stay

# Transition Probabilities (s, a, s_next)
transition_probs = [
    [[1, 0, 0], [0, 1, 0], [1, 0, 0]],  # From s1
    [[1, 0, 0], [0, 0, 1], [0, 1, 0]],  # From s2
    [[0, 1, 0], [0, 0, 1], [0, 0, 1]]   # From s3
]

# Rewards per state
rewards = [0, -1, 1]

gamma = 0.9  # Discount factor
alpha = 0.1  # Learning rate
epsilon = 0.1  # Exploration rate
episodes = 1000  # Number of episodes

# Run Value Iteration
V, policy_vi = value_iteration(states, actions, transition_probs, rewards, gamma)
print("Optimal Value Function (Value Iteration):", V)
print("Optimal Policy (Value Iteration):", policy_vi)

# Run Q-Learning
Q, policy_ql = q_learning(states, actions, transition_probs, rewards, gamma, alpha, epsilon, episodes)
print("Optimal Q-Values (Q-Learning):\n", Q)
print("Optimal Policy (Q-Learning):", policy_ql)

# Plot the MDP
plot_mdp(states, policy_vi)