A Markov Decision Process (MDP) is a mathematical framework used for decision-making in situations where outcomes are partly random and partly controlled by a decision-maker. MDPs provide a formal foundation for modeling sequential decision problems, particularly in reinforcement learning (RL).
Components of MDP
An MDP is defined by a tuple (S, A, P, R, γ):
- S (State Space): A set of all possible states in the environment.
- A (Action Space): A set of all possible actions an agent can take.
- P (Transition Probability): The probability of moving from one state to another given an action, defined as: P(s′∣s,a)=Pr(St+1=s′∣St=s,At=a) where s′s’s′ is the next state, sss is the current state, and aaa is the action taken.
- R (Reward Function): A function that provides an immediate reward when transitioning from state s to s′ after taking action aaa, denoted as: R(s,a)=E[r∣s,a]
- γ (Discount Factor): A value between 0 and 1 that determines the importance of future rewards. A higher γ values future rewards more, whereas a lower γ favors immediate rewards.
Markov Property
The process follows the Markov Property, meaning that the future state depends only on the current state and action, not on the sequence of past states. Pr(St+1∣St,At)=Pr(St+1∣S1,A1,…,St,At)
This memorylessness simplifies computations and allows for efficient policy optimization.
Solving MDPs
The goal in an MDP is to find an optimal policy π(s), which maps states to actions in a way that maximizes cumulative rewards. This is done using:
- Value Iteration: V(s)=maxa∑s′P(s′∣s,a)[R(s,a,s′)+γV(s′)]
- Policy Iteration:
- Policy Evaluation: Compute value function for a given policy.
- Policy Improvement: Update the policy based on the computed values.
Applications of MDPs
- AI and Reinforcement Learning: Formulating problems in game playing, self-driving cars, and chatbots.
- Robotics: Decision-making for autonomous agents.
- Finance: Optimal portfolio management.
- Healthcare: Treatment planning for patients.
Implementation
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import random
def value_iteration(states, actions, transition_probs, rewards, gamma=0.9, theta=1e-6):
"""Performs Value Iteration Algorithm"""
V = np.zeros(len(states)) # Initialize value function
policy = np.zeros(len(states), dtype=int) # Best action per state
while True:
delta = 0
for s in range(len(states)):
v_old = V[s]
action_values = []
for a in range(len(actions)):
v_new = sum([transition_probs[s][a][s_next] * (rewards[s] + gamma * V[s_next])
for s_next in range(len(states))])
action_values.append(v_new)
V[s] = max(action_values)
policy[s] = np.argmax(action_values)
delta = max(delta, abs(v_old - V[s]))
if delta < theta:
break
return V, policy
def q_learning(states, actions, transition_probs, rewards, gamma=0.9, alpha=0.1, epsilon=0.1, episodes=1000):
"""Performs Q-Learning Algorithm"""
Q = np.zeros((len(states), len(actions))) # Initialize Q-table
for _ in range(episodes):
state = random.choice(states) # Start at a random state
while state != len(states) - 1: # Terminal state check
# Choose action using epsilon-greedy strategy
if random.uniform(0, 1) < epsilon:
action = random.choice(actions)
else:
action = np.argmax(Q[state])
# Get next state and reward
next_state = np.argmax(transition_probs[state][action])
reward = rewards[state]
# Q-value update rule
Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
state = next_state # Move to next state
policy = np.argmax(Q, axis=1) # Derive policy from Q-table
return Q, policy
def plot_mdp(states, policy):
"""Visualizes the MDP transitions and optimal policy"""
G = nx.DiGraph()
state_labels = {0: "s1", 1: "s2", 2: "s3"}
action_labels = {0: "Left", 1: "Right", 2: "Stay"}
pos = {0: (0, 0), 1: (2, 0), 2: (4, 0)}
for s in range(len(states)):
action = policy[s]
if action == 1 and s < len(states) - 1: # Right move
G.add_edge(state_labels[s], state_labels[s+1], label=action_labels[action])
elif action == 0 and s > 0: # Left move
G.add_edge(state_labels[s], state_labels[s-1], label=action_labels[action])
plt.figure(figsize=(6, 3))
nx.draw(G, pos, with_labels=True, node_color='lightblue', node_size=2000, edge_color='gray', font_size=12)
edge_labels = {(u, v): d['label'] for u, v, d in G.edges(data=True)}
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=10)
plt.title("MDP Transition Diagram with Optimal Policy")
plt.show()
# Define MDP elements
states = [0, 1, 2] # s1, s2, s3
actions = [0, 1, 2] # Left, Right, Stay
# Transition Probabilities (s, a, s_next)
transition_probs = [
[[1, 0, 0], [0, 1, 0], [1, 0, 0]], # From s1
[[1, 0, 0], [0, 0, 1], [0, 1, 0]], # From s2
[[0, 1, 0], [0, 0, 1], [0, 0, 1]] # From s3
]
# Rewards per state
rewards = [0, -1, 1]
gamma = 0.9 # Discount factor
alpha = 0.1 # Learning rate
epsilon = 0.1 # Exploration rate
episodes = 1000 # Number of episodes
# Run Value Iteration
V, policy_vi = value_iteration(states, actions, transition_probs, rewards, gamma)
print("Optimal Value Function (Value Iteration):", V)
print("Optimal Policy (Value Iteration):", policy_vi)
# Run Q-Learning
Q, policy_ql = q_learning(states, actions, transition_probs, rewards, gamma, alpha, epsilon, episodes)
print("Optimal Q-Values (Q-Learning):\n", Q)
print("Optimal Policy (Q-Learning):", policy_ql)
# Plot the MDP
plot_mdp(states, policy_vi)