Protected: AI Unleashed: Mastering AI at Your Pace

0 of 27 lessons complete (0%)

Reinforcement Learning Fundamentals

Markov Decision Processes (MDPs)

You don’t have access to this lesson

Please register or sign in to access the course content.

A Markov Decision Process (MDP) is a mathematical framework used for decision-making in situations where outcomes are partly random and partly controlled by a decision-maker. MDPs provide a formal foundation for modeling sequential decision problems, particularly in reinforcement learning (RL).

Components of MDP

An MDP is defined by a tuple (S, A, P, R, γ):

  1. S (State Space): A set of all possible states in the environment.
  2. A (Action Space): A set of all possible actions an agent can take.
  3. P (Transition Probability): The probability of moving from one state to another given an action, defined as: P(s′∣s,a)=Pr(St+1=s′∣St=s,At=a) where s′s’s′ is the next state, sss is the current state, and aaa is the action taken.
  4. R (Reward Function): A function that provides an immediate reward when transitioning from state s to s′ after taking action aaa, denoted as: R(s,a)=E[r∣s,a]
  5. γ (Discount Factor): A value between 0 and 1 that determines the importance of future rewards. A higher γ values future rewards more, whereas a lower γ favors immediate rewards.

Markov Property

The process follows the Markov Property, meaning that the future state depends only on the current state and action, not on the sequence of past states. Pr(St+1∣St,At)=Pr(St+1∣S1,A1,…,St,At)

This memorylessness simplifies computations and allows for efficient policy optimization.


Solving MDPs

The goal in an MDP is to find an optimal policy π(s), which maps states to actions in a way that maximizes cumulative rewards. This is done using:

  1. Value Iteration: V(s)=max⁡a∑s′P(s′∣s,a)[R(s,a,s′)+γV(s′)]
  2. Policy Iteration:
    • Policy Evaluation: Compute value function for a given policy.
    • Policy Improvement: Update the policy based on the computed values.

Applications of MDPs

  • AI and Reinforcement Learning: Formulating problems in game playing, self-driving cars, and chatbots.
  • Robotics: Decision-making for autonomous agents.
  • Finance: Optimal portfolio management.
  • Healthcare: Treatment planning for patients.

Implementation

import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import random

def value_iteration(states, actions, transition_probs, rewards, gamma=0.9, theta=1e-6):
    """Performs Value Iteration Algorithm"""
    V = np.zeros(len(states))  # Initialize value function
    policy = np.zeros(len(states), dtype=int)  # Best action per state
    
    while True:
        delta = 0
        for s in range(len(states)):
            v_old = V[s]
            action_values = []
            for a in range(len(actions)):
                v_new = sum([transition_probs[s][a][s_next] * (rewards[s] + gamma * V[s_next]) 
                             for s_next in range(len(states))])
                action_values.append(v_new)
            
            V[s] = max(action_values)
            policy[s] = np.argmax(action_values)
            delta = max(delta, abs(v_old - V[s]))
        
        if delta < theta:
            break
    
    return V, policy

def q_learning(states, actions, transition_probs, rewards, gamma=0.9, alpha=0.1, epsilon=0.1, episodes=1000):
    """Performs Q-Learning Algorithm"""
    Q = np.zeros((len(states), len(actions)))  # Initialize Q-table
    
    for _ in range(episodes):
        state = random.choice(states)  # Start at a random state
        while state != len(states) - 1:  # Terminal state check
            # Choose action using epsilon-greedy strategy
            if random.uniform(0, 1) < epsilon:
                action = random.choice(actions)
            else:
                action = np.argmax(Q[state])
            
            # Get next state and reward
            next_state = np.argmax(transition_probs[state][action])
            reward = rewards[state]
            
            # Q-value update rule
            Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
            
            state = next_state  # Move to next state
    
    policy = np.argmax(Q, axis=1)  # Derive policy from Q-table
    return Q, policy

def plot_mdp(states, policy):
    """Visualizes the MDP transitions and optimal policy"""
    G = nx.DiGraph()
    state_labels = {0: "s1", 1: "s2", 2: "s3"}
    action_labels = {0: "Left", 1: "Right", 2: "Stay"}
    pos = {0: (0, 0), 1: (2, 0), 2: (4, 0)}
    
    for s in range(len(states)):
        action = policy[s]
        if action == 1 and s < len(states) - 1:  # Right move
            G.add_edge(state_labels[s], state_labels[s+1], label=action_labels[action])
        elif action == 0 and s > 0:  # Left move
            G.add_edge(state_labels[s], state_labels[s-1], label=action_labels[action])
    
    plt.figure(figsize=(6, 3))
    nx.draw(G, pos, with_labels=True, node_color='lightblue', node_size=2000, edge_color='gray', font_size=12)
    edge_labels = {(u, v): d['label'] for u, v, d in G.edges(data=True)}
    nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=10)
    plt.title("MDP Transition Diagram with Optimal Policy")
    plt.show()

# Define MDP elements
states = [0, 1, 2]  # s1, s2, s3
actions = [0, 1, 2]  # Left, Right, Stay

# Transition Probabilities (s, a, s_next)
transition_probs = [
    [[1, 0, 0], [0, 1, 0], [1, 0, 0]],  # From s1
    [[1, 0, 0], [0, 0, 1], [0, 1, 0]],  # From s2
    [[0, 1, 0], [0, 0, 1], [0, 0, 1]]   # From s3
]

# Rewards per state
rewards = [0, -1, 1]

gamma = 0.9  # Discount factor
alpha = 0.1  # Learning rate
epsilon = 0.1  # Exploration rate
episodes = 1000  # Number of episodes

# Run Value Iteration
V, policy_vi = value_iteration(states, actions, transition_probs, rewards, gamma)
print("Optimal Value Function (Value Iteration):", V)
print("Optimal Policy (Value Iteration):", policy_vi)

# Run Q-Learning
Q, policy_ql = q_learning(states, actions, transition_probs, rewards, gamma, alpha, epsilon, episodes)
print("Optimal Q-Values (Q-Learning):\n", Q)
print("Optimal Policy (Q-Learning):", policy_ql)

# Plot the MDP
plot_mdp(states, policy_vi)