Q-Learning: A Model-Free Reinforcement Learning Algorithm
Q-Learning is a model-free reinforcement learning algorithm used to find the optimal policy in a Markov Decision Process (MDP). Unlike value iteration, which requires knowledge of transition probabilities, Q-Learning learns directly from experience.
Mathematical Foundation
Q-Learning is based on the Bellman Equation:

Where:
- Q(s,a) is the Q-value (expected reward of taking action aaa in state sss).
- α is the learning rate (controls how much new information overrides old information).
- R(s,a) is the reward received after taking action aaa in state sss.
- γ is the discount factor (it controls how much future rewards matter).
- maxa′Q(s′,a′) is the best Q-value in the next state s′ (choosing the best future action).
- The difference [R+γmaxQ−Q] is the temporal difference (TD) error.
Q-Learning Algorithm
- Initialize the Q-table Q(s,a) arbitrarily (usually zeros).
- For each episode (until convergence):
- Start from an initial state s.
- Select an action aaa using an exploration strategy (e.g., ϵ\epsilonϵ-greedy).
- Take action aaa, observe the reward RRR and next state s′.
- Update the Q-value using the formula.
- Repeat until reaching a terminal state.
- Derive the optimal policy by choosing the action with the highest Q-value for each state.
Key Features
- Doesn’t require a model of the environment (unlike Value Iteration).
- Off-Policy: Learns from actions not necessarily following the current policy.
- Exploration-Exploitation Tradeoff: Uses an ϵ\epsilonϵ-greedy strategy to balance learning (exploration) and using known information (exploitation).
Example
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import random
def value_iteration(states, actions, transition_probs, rewards, gamma=0.9, theta=1e-6):
"""Performs Value Iteration Algorithm"""
V = np.zeros(len(states)) # Initialize value function
policy = np.zeros(len(states), dtype=int) # Best action per state
while True:
delta = 0
for s in range(len(states)):
v_old = V[s]
action_values = []
for a in range(len(actions)):
v_new = sum([transition_probs[s][a][s_next] * (rewards[s] + gamma * V[s_next])
for s_next in range(len(states))])
action_values.append(v_new)
V[s] = max(action_values)
policy[s] = np.argmax(action_values)
delta = max(delta, abs(v_old - V[s]))
if delta < theta:
break
return V, policy
def q_learning(states, actions, transition_probs, rewards, gamma=0.9, alpha=0.1, epsilon=0.1, episodes=1000):
"""Performs Q-Learning Algorithm"""
Q = np.zeros((len(states), len(actions))) # Initialize Q-table
for _ in range(episodes):
state = random.choice(states) # Start at a random state
while state != len(states) - 1: # Terminal state check
# Choose action using epsilon-greedy strategy
if random.uniform(0, 1) < epsilon:
action = random.choice(actions)
else:
action = np.argmax(Q[state])
# Get next state and reward
next_state = np.argmax(transition_probs[state][action])
reward = rewards[state]
# Q-value update rule
Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
state = next_state # Move to next state
policy = np.argmax(Q, axis=1) # Derive policy from Q-table
return Q, policy
def plot_mdp(states, policy):
"""Visualizes the MDP transitions and optimal policy"""
G = nx.DiGraph()
state_labels = {0: "s1", 1: "s2", 2: "s3"}
action_labels = {0: "Left", 1: "Right", 2: "Stay"}
pos = {0: (0, 0), 1: (2, 0), 2: (4, 0)}
for s in range(len(states)):
action = policy[s]
if action == 1 and s < len(states) - 1: # Right move
G.add_edge(state_labels[s], state_labels[s+1], label=action_labels[action])
elif action == 0 and s > 0: # Left move
G.add_edge(state_labels[s], state_labels[s-1], label=action_labels[action])
plt.figure(figsize=(6, 3))
nx.draw(G, pos, with_labels=True, node_color='lightblue', node_size=2000, edge_color='gray', font_size=12)
edge_labels = {(u, v): d['label'] for u, v, d in G.edges(data=True)}
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=10)
plt.title("MDP Transition Diagram with Optimal Policy")
plt.show()
# Define MDP elements
states = [0, 1, 2] # s1, s2, s3
actions = [0, 1, 2] # Left, Right, Stay
# Transition Probabilities (s, a, s_next)
transition_probs = [
[[1, 0, 0], [0, 1, 0], [1, 0, 0]], # From s1
[[1, 0, 0], [0, 0, 1], [0, 1, 0]], # From s2
[[0, 1, 0], [0, 0, 1], [0, 0, 1]] # From s3
]
# Rewards per state
rewards = [0, -1, 1]
gamma = 0.9 # Discount factor
alpha = 0.1 # Learning rate
epsilon = 0.1 # Exploration rate
episodes = 1000 # Number of episodes
# Run Value Iteration
V, policy_vi = value_iteration(states, actions, transition_probs, rewards, gamma)
print("Optimal Value Function (Value Iteration):", V)
print("Optimal Policy (Value Iteration):", policy_vi)
# Run Q-Learning
Q, policy_ql = q_learning(states, actions, transition_probs, rewards, gamma, alpha, epsilon, episodes)
print("Optimal Q-Values (Q-Learning):\n", Q)
print("Optimal Policy (Q-Learning):", policy_ql)
# Plot the MDP
plot_mdp(states, policy_vi)