Understanding Gradient Descent

Explore the optimization algorithms powering modern machine learning through interactive visualizations and mathematical insights.

1. What Is Gradient Descent?

At its core, gradient descent seeks to minimize a differentiable function $J(\theta)$ (e.g., a loss function) by iteratively updating parameters $\theta$ in the direction of steepest descent:

$$\theta \leftarrow \theta - \eta \,\nabla_\theta J(\theta)$$

Each update "walks" the parameters downhill on the loss surface until (ideally) reaching a minimum.

Interactive Gradient Descent Visualization

Select an algorithm to see how different optimizers navigate the loss landscape.

2. Why Learning Rate Matters

Too Small Learning Rate

  • Painfully slow convergence
  • Long training times
  • May get stuck in plateaus

Too Large Learning Rate

  • Overshoots the minimum
  • Can diverge completely
  • Unstable training

3. Core Variants

3.1 Batch Gradient Descent

Computes gradient over the entire training set each step.

Pros

  • Stable updates
  • True gradient direction

Cons

  • Expensive for large datasets
  • Memory-bound
Batch Gradient Descent Implementation
for epoch in range(num_epochs):
    grad = compute_gradient(data=X, labels=y, params=θ)
    θ = θ - η * grad

3.2 Stochastic Gradient Descent (SGD)

Uses one randomly sampled example per update.

Pros

  • Cheap updates
  • Can escape shallow local minima

Cons

  • Noisy trajectory
  • Zig-zag steps around optimum
SGD Implementation
for epoch in range(num_epochs):
    for xi, yi in shuffle(data, labels):
        grad_i = compute_gradient(xi, yi, θ)
        θ = θ - η * grad_i

3.3 Mini-Batch Gradient Descent

Compromise: compute gradient over a small batch (e.g., 32–512 samples).

Sweet Spot: Balances computational efficiency and stable updates while leveraging vectorized hardware.

4. Momentum and Nesterov Accelerated Gradient

4.1 Momentum

Adds an exponentially decaying "velocity" term $v$ to smooth updates:

$$\begin{aligned} v &\leftarrow \gamma\,v + \eta\,\nabla_\theta J(\theta)\\ \theta &\leftarrow \theta - v \end{aligned}$$

$\gamma$ (e.g., 0.9) retains past velocity, helping traverse flat regions faster and dampen oscillations.

Momentum Implementation
class MomentumOptimizer:
    def __init__(self, learning_rate=0.01, momentum=0.9):
        self.lr = learning_rate
        self.momentum = momentum
        self.velocity = 0
    
    def update(self, params, gradients):
        self.velocity = self.momentum * self.velocity + self.lr * gradients
        params -= self.velocity
        return params

4.2 Nesterov Accelerated Gradient (NAG)

Looks ahead by computing the gradient at the "future" position:

$$\begin{aligned} v &\leftarrow \gamma\,v + \eta\,\nabla_\theta J(\theta - \gamma\,v)\\ \theta &\leftarrow \theta - v \end{aligned}$$
Nesterov AGD Implementation
class NesterovOptimizer:
    def __init__(self, learning_rate=0.01, momentum=0.9):
        self.lr = learning_rate
        self.momentum = momentum
        self.velocity = 0
    
    def update(self, params, gradient_fn):
        # Look ahead
        future_params = params - self.momentum * self.velocity
        gradients = gradient_fn(future_params)
        
        self.velocity = self.momentum * self.velocity + self.lr * gradients
        params -= self.velocity
        return params

5. Adaptive Learning-Rate Methods

Variant Key Idea
AdaGrad Accumulates squared gradients to shrink η over time.
RMSProp Uses exponential decay on squared gradients to avoid AdaGrad's rapid decay.
Adam Combines momentum (first moment) and RMSProp (second moment).
AdaDelta Extension of RMSProp that removes the need for a global η.

5.1 Adam (Adaptive Moment Estimation)

The most popular adaptive optimizer combining momentum and adaptive learning rates:

$$\begin{aligned} m_t &= \beta_1\,m_{t-1} + (1-\beta_1)\,g_t \\ v_t &= \beta_2\,v_{t-1} + (1-\beta_2)\,g_t^2 \\ \hat m_t &= \frac{m_t}{1 - \beta_1^t}, \quad \hat v_t = \frac{v_t}{1 - \beta_2^t} \\ \theta &\leftarrow \theta - \eta\,\frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon} \end{aligned}$$

Default parameters: $\beta_1\approx0.9$, $\beta_2\approx0.999$, $\epsilon\approx10^{-8}$

Adam Implementation
class AdamOptimizer:
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = 0  # First moment
        self.v = 0  # Second moment
        self.t = 0  # Time step
    
    def update(self, params, gradients):
        self.t += 1
        
        # Update biased first moment estimate
        self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
        
        # Update biased second raw moment estimate
        self.v = self.beta2 * self.v + (1 - self.beta2) * (gradients ** 2)
        
        # Compute bias-corrected first moment estimate
        m_hat = self.m / (1 - self.beta1 ** self.t)
        
        # Compute bias-corrected second raw moment estimate
        v_hat = self.v / (1 - self.beta2 ** self.t)
        
        # Update parameters
        params -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
        return params

6. Choosing the Right Optimizer

Scenario Recommended
Small dataset / convex problems Batch GD or SGD
Deep neural networks Adam or RMSProp
Very large datasets Mini-batch SGD w/ momentum
Memory-constrained / online learning SGD
Need fine-tuned generalization SGD w/ momentum (often matches Adam)
Tip: Monitor training and validation loss curves—no optimizer can replace good learning-rate scheduling, early stopping, or regularization.

7. Practical Tips

Essential Optimization Strategies

  • Learning-Rate Scheduling: Step decay, cosine annealing, or warm restarts can boost performance.
  • Gradient Clipping: Especially in RNNs, to prevent exploding gradients.
  • Weight Decay: Equivalent to L2 regularization; often applied as a separate term in the optimizer.
  • Hyperparameter Search: Even Adam needs η tuning; try grid/random search or automated schedulers.
Learning Rate Scheduling Example
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR

# Initialize optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Step decay scheduler
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# Cosine annealing scheduler
# scheduler = CosineAnnealingLR(optimizer, T_max=100)

for epoch in range(num_epochs):
    train_one_epoch(model, optimizer)
    scheduler.step()  # Update learning rate

8. Conclusion

Gradient descent started as a simple, intuitive idea—and through decades of research has evolved into a family of powerful optimizers. Understanding their mechanics helps you make informed choices for your models, saving training time and improving convergence.

Further Reading & Resources

  • Ian Goodfellow et al., Deep Learning (2016): Chapter on optimization.
  • Sebastian Ruder's blog: "An overview of gradient descent optimization algorithms."
  • TensorFlow/PyTorch docs for built-in optimizer implementations.

Happy optimizing! 🚀