Why Initialize Weights in Neural Network
Initializing weights and biases is a crucial step in building a neural network. Proper initialization helps ensure that the network converges to a good solution and does so efficiently. Let’s explore the reasons in detail:
1. Breaking Symmetry
If all weights are initialized to the same value (e.g., zeros), then all neurons in a layer will produce the same output and receive the same gradient during backpropagation. This means they will update in the same way, effectively making all neurons in the layer the same as each other. This is called symmetry and it prevents the network from learning effectively.
By initializing weights to small random values, each neuron will receive different gradients and learn different features.
2. Efficient Training
Proper initialization can help the network converge faster. If weights are too large, the activations can become too large or too small, causing saturation in activation functions like sigmoid or tanh, where gradients become very small (vanishing gradients). If weights are too small, the activations and gradients will be small, which can slow down learning.
3. Avoiding Vanishing/Exploding Gradients
Poor initialization can lead to vanishing or exploding gradients, which can make training difficult. For example, with very large weights, the gradients can grow exponentially during backpropagation (exploding gradients). With very small weights, the gradients can shrink exponentially (vanishing gradients).
Common Initialization Techniques
- Random Initialization:
- Initialize weights with small random values, often from a normal distribution.
- He Initialization (for ReLU and its variants):
- Weights are initialized from a normal distribution with mean 0 and variance 2/n2/n2/n, where nnn is the number of input neurons.
np.random.randn(input_size, hidden_size) * np.sqrt(2 / input_size)
- Xavier Initialization (for sigmoid, tanh):
- Weights are initialized from a normal distribution with mean 0 and variance 1/n1/n1/n, where nnn is the average of the number of input and output neurons.
np.random.randn(input_size, hidden_size) * np.sqrt(1 / input_size)
Bias Initialization
Biases are often initialized to zero because they are added to the weighted sum of inputs and do not suffer from symmetry issues. However, initializing biases to small random values is also common to add a small amount of noise which might help in some scenarios.
Example Code for Weight and Bias Initialization
Here’s an example showing how to initialize weights and biases using the Xavier initialization for a network with a single hidden layer:
import numpy as np
def initialize_parameters(input_size, hidden_size, output_size):
# Xavier Initialization
W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1 / input_size)
b1 = np.zeros((1, hidden_size))
W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1 / hidden_size)
b2 = np.zeros((1, output_size))
return W1, b1, W2, b2
# Example usage
input_size = 3 # Number of input features
hidden_size = 4 # Number of neurons in the hidden layer
output_size = 1 # Number of output neurons
W1, b1, W2, b2 = initialize_parameters(input_size, hidden_size, output_size)
Summary
- Symmetry Breaking: Random initialization ensures that neurons learn different features.
- Efficient Training: Proper initialization helps in faster convergence and efficient training.
- Avoiding Vanishing/Exploding Gradients: Appropriate initialization prevents the gradients from becoming too small or too large.
By carefully initializing weights and biases, we set up the neural network for effective training, leading to better performance and faster convergence.