Protected: AI Unleashed: Mastering AI at Your Pace

0 of 27 lessons complete (0%)

Reinforcement Learning Fundamentals

Q-Learning

You don’t have access to this lesson

Please register or sign in to access the course content.

Q-Learning: A Model-Free Reinforcement Learning Algorithm

Q-Learning is a model-free reinforcement learning algorithm used to find the optimal policy in a Markov Decision Process (MDP). Unlike value iteration, which requires knowledge of transition probabilities, Q-Learning learns directly from experience.


Mathematical Foundation

Q-Learning is based on the Bellman Equation:

Where:

  • Q(s,a) is the Q-value (expected reward of taking action aaa in state sss).
  • ╬▒ is the learning rate (controls how much new information overrides old information).
  • R(s,a) is the reward received after taking action aaa in state sss.
  • ╬│ is the discount factor (it controls how much future rewards matter).
  • maxüía▓Q(s▓,a▓) is the best Q-value in the next state s▓ (choosing the best future action).
  • The difference [R+╬│maxüíQ Q] is the temporal difference (TD) error.

Q-Learning Algorithm

  1. Initialize the Q-table Q(s,a) arbitrarily (usually zeros).
  2. For each episode (until convergence):
    • Start from an initial state s.
    • Select an action aaa using an exploration strategy (e.g., ╧╡\epsilon╧╡-greedy).
    • Take action aaa, observe the reward RRR and next state s▓.
    • Update the Q-value using the formula.
    • Repeat until reaching a terminal state.
  3. Derive the optimal policy by choosing the action with the highest Q-value for each state.

Key Features

  • Doesn’t require a model of the environment (unlike Value Iteration).
  • Off-Policy: Learns from actions not necessarily following the current policy.
  • Exploration-Exploitation Tradeoff: Uses an ╧╡\epsilon╧╡-greedy strategy to balance learning (exploration) and using known information (exploitation).

Example

import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import random

def value_iteration(states, actions, transition_probs, rewards, gamma=0.9, theta=1e-6):
  """Performs Value Iteration Algorithm"""
  V = np.zeros(len(states)) # Initialize value function
  policy = np.zeros(len(states), dtype=int) # Best action per state
  
  while True:
    delta = 0
    for s in range(len(states)):
      v_old = V[s]
      action_values = []
      for a in range(len(actions)):
        v_new = sum([transition_probs[s][a][s_next] * (rewards[s] + gamma * V[s_next]) 
               for s_next in range(len(states))])
        action_values.append(v_new)
      
      V[s] = max(action_values)
      policy[s] = np.argmax(action_values)
      delta = max(delta, abs(v_old - V[s]))
    
    if delta < theta:
      break
  
  return V, policy

def q_learning(states, actions, transition_probs, rewards, gamma=0.9, alpha=0.1, epsilon=0.1, episodes=1000):
  """Performs Q-Learning Algorithm"""
  Q = np.zeros((len(states), len(actions))) # Initialize Q-table
  
  for _ in range(episodes):
    state = random.choice(states) # Start at a random state
    while state != len(states) - 1: # Terminal state check
      # Choose action using epsilon-greedy strategy
      if random.uniform(0, 1) < epsilon:
        action = random.choice(actions)
      else:
        action = np.argmax(Q[state])
      
      # Get next state and reward
      next_state = np.argmax(transition_probs[state][action])
      reward = rewards[state]
      
      # Q-value update rule
      Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
      
      state = next_state # Move to next state
  
  policy = np.argmax(Q, axis=1) # Derive policy from Q-table
  return Q, policy

def plot_mdp(states, policy):
  """Visualizes the MDP transitions and optimal policy"""
  G = nx.DiGraph()
  state_labels = {0: "s1", 1: "s2", 2: "s3"}
  action_labels = {0: "Left", 1: "Right", 2: "Stay"}
  pos = {0: (0, 0), 1: (2, 0), 2: (4, 0)}
  
  for s in range(len(states)):
    action = policy[s]
    if action == 1 and s < len(states) - 1: # Right move
      G.add_edge(state_labels[s], state_labels[s+1], label=action_labels[action])
    elif action == 0 and s > 0: # Left move
      G.add_edge(state_labels[s], state_labels[s-1], label=action_labels[action])
  
  plt.figure(figsize=(6, 3))
  nx.draw(G, pos, with_labels=True, node_color='lightblue', node_size=2000, edge_color='gray', font_size=12)
  edge_labels = {(u, v): d['label'] for u, v, d in G.edges(data=True)}
  nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=10)
  plt.title("MDP Transition Diagram with Optimal Policy")
  plt.show()

# Define MDP elements
states = [0, 1, 2] # s1, s2, s3
actions = [0, 1, 2] # Left, Right, Stay

# Transition Probabilities (s, a, s_next)
transition_probs = [
  [[1, 0, 0], [0, 1, 0], [1, 0, 0]], # From s1
  [[1, 0, 0], [0, 0, 1], [0, 1, 0]], # From s2
  [[0, 1, 0], [0, 0, 1], [0, 0, 1]]  # From s3
]

# Rewards per state
rewards = [0, -1, 1]

gamma = 0.9 # Discount factor
alpha = 0.1 # Learning rate
epsilon = 0.1 # Exploration rate
episodes = 1000 # Number of episodes

# Run Value Iteration
V, policy_vi = value_iteration(states, actions, transition_probs, rewards, gamma)
print("Optimal Value Function (Value Iteration):", V)
print("Optimal Policy (Value Iteration):", policy_vi)

# Run Q-Learning
Q, policy_ql = q_learning(states, actions, transition_probs, rewards, gamma, alpha, epsilon, episodes)
print("Optimal Q-Values (Q-Learning):\n", Q)
print("Optimal Policy (Q-Learning):", policy_ql)

# Plot the MDP
plot_mdp(states, policy_vi)