Explore the optimization algorithms powering modern machine learning through interactive visualizations and mathematical insights.
At its core, gradient descent seeks to minimize a differentiable function $J(\theta)$ (e.g., a loss function) by iteratively updating parameters $\theta$ in the direction of steepest descent:
Each update "walks" the parameters downhill on the loss surface until (ideally) reaching a minimum.
Select an algorithm to see how different optimizers navigate the loss landscape.
Computes gradient over the entire training set each step.
for epoch in range(num_epochs):
grad = compute_gradient(data=X, labels=y, params=θ)
θ = θ - η * grad
Uses one randomly sampled example per update.
for epoch in range(num_epochs):
for xi, yi in shuffle(data, labels):
grad_i = compute_gradient(xi, yi, θ)
θ = θ - η * grad_i
Compromise: compute gradient over a small batch (e.g., 32–512 samples).
Adds an exponentially decaying "velocity" term $v$ to smooth updates:
$\gamma$ (e.g., 0.9) retains past velocity, helping traverse flat regions faster and dampen oscillations.
class MomentumOptimizer:
def __init__(self, learning_rate=0.01, momentum=0.9):
self.lr = learning_rate
self.momentum = momentum
self.velocity = 0
def update(self, params, gradients):
self.velocity = self.momentum * self.velocity + self.lr * gradients
params -= self.velocity
return params
Looks ahead by computing the gradient at the "future" position:
class NesterovOptimizer:
def __init__(self, learning_rate=0.01, momentum=0.9):
self.lr = learning_rate
self.momentum = momentum
self.velocity = 0
def update(self, params, gradient_fn):
# Look ahead
future_params = params - self.momentum * self.velocity
gradients = gradient_fn(future_params)
self.velocity = self.momentum * self.velocity + self.lr * gradients
params -= self.velocity
return params
| Variant | Key Idea |
|---|---|
| AdaGrad | Accumulates squared gradients to shrink η over time. |
| RMSProp | Uses exponential decay on squared gradients to avoid AdaGrad's rapid decay. |
| Adam | Combines momentum (first moment) and RMSProp (second moment). |
| AdaDelta | Extension of RMSProp that removes the need for a global η. |
The most popular adaptive optimizer combining momentum and adaptive learning rates:
Default parameters: $\beta_1\approx0.9$, $\beta_2\approx0.999$, $\epsilon\approx10^{-8}$
class AdamOptimizer:
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.lr = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = 0 # First moment
self.v = 0 # Second moment
self.t = 0 # Time step
def update(self, params, gradients):
self.t += 1
# Update biased first moment estimate
self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
# Update biased second raw moment estimate
self.v = self.beta2 * self.v + (1 - self.beta2) * (gradients ** 2)
# Compute bias-corrected first moment estimate
m_hat = self.m / (1 - self.beta1 ** self.t)
# Compute bias-corrected second raw moment estimate
v_hat = self.v / (1 - self.beta2 ** self.t)
# Update parameters
params -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
return params
| Scenario | Recommended |
|---|---|
| Small dataset / convex problems | Batch GD or SGD |
| Deep neural networks | Adam or RMSProp |
| Very large datasets | Mini-batch SGD w/ momentum |
| Memory-constrained / online learning | SGD |
| Need fine-tuned generalization | SGD w/ momentum (often matches Adam) |
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR
# Initialize optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Step decay scheduler
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# Cosine annealing scheduler
# scheduler = CosineAnnealingLR(optimizer, T_max=100)
for epoch in range(num_epochs):
train_one_epoch(model, optimizer)
scheduler.step() # Update learning rate
Gradient descent started as a simple, intuitive idea—and through decades of research has evolved into a family of powerful optimizers. Understanding their mechanics helps you make informed choices for your models, saving training time and improving convergence.
Happy optimizing! 🚀