Understanding Gradient Descent
Explore the optimization algorithms powering modern machine learning through interactive visualizations and mathematical insights.
1. What Is Gradient Descent?
At its core, gradient descent seeks to minimize a differentiable function $J(\theta)$ (e.g., a loss function) by iteratively updating parameters $\theta$ in the direction of steepest descent:
- $\nabla_\theta J(\theta)$ is the gradient: a vector of partial derivatives giving the slope of $J$ w.r.t. each parameter.
- $\eta$ (learning rate) controls the step size.
Each update "walks" the parameters downhill on the loss surface until (ideally) reaching a minimum.
Interactive Gradient Descent Visualization
Select an algorithm to see how different optimizers navigate the loss landscape.
2. Why Learning Rate Matters
Too Small Learning Rate
- Painfully slow convergence
- Long training times
- May get stuck in plateaus
Too Large Learning Rate
- Overshoots the minimum
- Can diverge completely
- Unstable training
3. Core Variants
3.1 Batch Gradient Descent
Computes gradient over the entire training set each step.
Pros
- Stable updates
- True gradient direction
Cons
- Expensive for large datasets
- Memory-bound
for epoch in range(num_epochs):
grad = compute_gradient(data=X, labels=y, params=θ)
θ = θ - η * grad
3.2 Stochastic Gradient Descent (SGD)
Uses one randomly sampled example per update.
Pros
- Cheap updates
- Can escape shallow local minima
Cons
- Noisy trajectory
- Zig-zag steps around optimum
for epoch in range(num_epochs):
for xi, yi in shuffle(data, labels):
grad_i = compute_gradient(xi, yi, θ)
θ = θ - η * grad_i
3.3 Mini-Batch Gradient Descent
Compromise: compute gradient over a small batch (e.g., 32–512 samples).
4. Momentum and Nesterov Accelerated Gradient
4.1 Momentum
Adds an exponentially decaying "velocity" term $v$ to smooth updates:
$\gamma$ (e.g., 0.9) retains past velocity, helping traverse flat regions faster and dampen oscillations.
class MomentumOptimizer:
def __init__(self, learning_rate=0.01, momentum=0.9):
self.lr = learning_rate
self.momentum = momentum
self.velocity = 0
def update(self, params, gradients):
self.velocity = self.momentum * self.velocity + self.lr * gradients
params -= self.velocity
return params
4.2 Nesterov Accelerated Gradient (NAG)
Looks ahead by computing the gradient at the "future" position:
class NesterovOptimizer:
def __init__(self, learning_rate=0.01, momentum=0.9):
self.lr = learning_rate
self.momentum = momentum
self.velocity = 0
def update(self, params, gradient_fn):
# Look ahead
future_params = params - self.momentum * self.velocity
gradients = gradient_fn(future_params)
self.velocity = self.momentum * self.velocity + self.lr * gradients
params -= self.velocity
return params
5. Adaptive Learning-Rate Methods
| Variant | Key Idea |
|---|---|
| AdaGrad | Accumulates squared gradients to shrink η over time. |
| RMSProp | Uses exponential decay on squared gradients to avoid AdaGrad's rapid decay. |
| Adam | Combines momentum (first moment) and RMSProp (second moment). |
| AdaDelta | Extension of RMSProp that removes the need for a global η. |
5.1 Adam (Adaptive Moment Estimation)
The most popular adaptive optimizer combining momentum and adaptive learning rates:
Default parameters: $\beta_1\approx0.9$, $\beta_2\approx0.999$, $\epsilon\approx10^{-8}$
class AdamOptimizer:
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.lr = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = 0 # First moment
self.v = 0 # Second moment
self.t = 0 # Time step
def update(self, params, gradients):
self.t += 1
# Update biased first moment estimate
self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
# Update biased second raw moment estimate
self.v = self.beta2 * self.v + (1 - self.beta2) * (gradients ** 2)
# Compute bias-corrected first moment estimate
m_hat = self.m / (1 - self.beta1 ** self.t)
# Compute bias-corrected second raw moment estimate
v_hat = self.v / (1 - self.beta2 ** self.t)
# Update parameters
params -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
return params
6. Choosing the Right Optimizer
| Scenario | Recommended |
|---|---|
| Small dataset / convex problems | Batch GD or SGD |
| Deep neural networks | Adam or RMSProp |
| Very large datasets | Mini-batch SGD w/ momentum |
| Memory-constrained / online learning | SGD |
| Need fine-tuned generalization | SGD w/ momentum (often matches Adam) |
7. Practical Tips
Essential Optimization Strategies
- Learning-Rate Scheduling: Step decay, cosine annealing, or warm restarts can boost performance.
- Gradient Clipping: Especially in RNNs, to prevent exploding gradients.
- Weight Decay: Equivalent to L2 regularization; often applied as a separate term in the optimizer.
- Hyperparameter Search: Even Adam needs η tuning; try grid/random search or automated schedulers.
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR
# Initialize optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Step decay scheduler
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# Cosine annealing scheduler
# scheduler = CosineAnnealingLR(optimizer, T_max=100)
for epoch in range(num_epochs):
train_one_epoch(model, optimizer)
scheduler.step() # Update learning rate
8. Conclusion
Gradient descent started as a simple, intuitive idea—and through decades of research has evolved into a family of powerful optimizers. Understanding their mechanics helps you make informed choices for your models, saving training time and improving convergence.
Further Reading & Resources
- Ian Goodfellow et al., Deep Learning (2016): Chapter on optimization.
- Sebastian Ruder's blog: "An overview of gradient descent optimization algorithms."
- TensorFlow/PyTorch docs for built-in optimizer implementations.
Happy optimizing! 🚀