Gradient Descent Explained

Interactive visualizer showing how gradient descent navigates a loss landscape step by step. Adjust learning rate, toggle momentum, and watch convergence in real time. No data leaves your browser.

Gradient Descent Visualizer

Learning Rate: 0.050

Optimizer

Loss Surface

Steps

Loss

Grad X

Grad Y

Ready

Status

What Is Gradient Descent?

Gradient descent is the backbone of nearly every machine learning training algorithm. At its core, it is an optimization procedure that iteratively adjusts model parameters to minimize a loss function. Imagine standing on a hilly landscape in dense fog. You cannot see the lowest valley, but you can feel the slope beneath your feet. Gradient descent works the same way: at each step it computes the direction of steepest descent and moves the parameters downhill by a small amount proportional to the learning rate.

Formally, given a loss function L(theta) and parameters theta, the update rule is:

theta_new = theta_old - lr * gradient(L(theta_old))

The gradient is the vector of partial derivatives of the loss with respect to each parameter. It points in the direction of steepest increase, so we subtract it to move toward the minimum. The learning rate lr controls the step size. Too large and you overshoot; too small and training takes prohibitively long or gets stuck in a shallow local minimum.

Variants of Gradient Descent

Batch Gradient Descent

Batch (or vanilla) gradient descent computes the gradient over the entire training dataset before making a single parameter update. This produces the most accurate gradient estimate but is computationally expensive for large datasets. It requires the entire dataset to fit in memory and makes only one update per pass through the data. In practice, batch gradient descent is rarely used for large-scale problems, but it is useful for understanding the theoretical behavior of the algorithm.

Stochastic Gradient Descent (SGD)

Stochastic gradient descent computes the gradient from a single randomly selected training example per update. This makes each step extremely fast but introduces high variance into the gradient estimate. The noisy updates can actually be beneficial: they help the optimizer escape shallow local minima and saddle points. However, SGD rarely converges smoothly and requires careful learning rate scheduling to achieve good results. It is the simplest form of gradient descent and remains a strong baseline, especially when combined with momentum.

Mini-Batch Gradient Descent

Mini-batch gradient descent is the practical sweet spot used in virtually all deep learning. It computes the gradient over a small random subset of the data (typically 32 to 256 samples) per update. This balances the gradient accuracy of batch methods with the speed and regularizing noise of SGD. Mini-batches also exploit hardware parallelism: GPUs process batches of data much more efficiently than individual samples. Batch size is a hyperparameter that affects both training speed and generalization performance.

SGD with Momentum

Momentum modifies SGD by adding a fraction of the previous update vector to the current step. The velocity accumulates over time, accelerating the optimizer in directions of consistent gradient and damping oscillations in directions of high curvature. Standard momentum uses a coefficient (typically 0.9) applied to the previous velocity. Nesterov accelerated gradient (NAG) is a refinement that computes the gradient at the look-ahead position, providing even faster convergence on convex problems.

v_t = beta * v_{t-1} + gradient(L(theta))
theta = theta - lr * v_t

Adam (Adaptive Moment Estimation)

Adam is the most widely used optimizer in modern deep learning. It combines the benefits of momentum (first moment) with RMSProp-style adaptive learning rates (second moment). Each parameter gets its own effective learning rate that scales inversely with the root mean square of its recent gradients. Parameters with consistently large gradients get smaller effective learning rates, while parameters with small or sparse gradients get larger ones. Adam uses bias correction in the first few steps to counteract initialization effects.

m_t = beta1 * m_{t-1} + (1 - beta1) * g_t       // first moment
v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2      // second moment
m_hat = m_t / (1 - beta1^t)                        // bias correction
v_hat = v_t / (1 - beta2^t)
theta = theta - lr * m_hat / (sqrt(v_hat) + eps)

Common defaults are lr=0.001, beta1=0.9, beta2=0.999, and eps=1e-8. Adam converges faster than SGD on most problems but can generalize slightly worse on some tasks. AdamW adds decoupled weight decay to address this.

Learning Rate Selection

The learning rate is arguably the single most important hyperparameter in gradient-based optimization. It governs the trade-off between convergence speed and stability. A learning rate that is too high causes the loss to oscillate or diverge. A learning rate that is too low leads to painfully slow convergence and can trap the optimizer in suboptimal local minima.

Several strategies exist for setting and adjusting the learning rate:

Grid search: Try values on a logarithmic scale (1e-5, 1e-4, 1e-3, 1e-2, 1e-1) and pick the one that produces the fastest stable decrease in validation loss.
Learning rate finder: Gradually increase the learning rate from a very small value over one epoch while recording the loss. Plot loss vs. learning rate and choose the rate just before the loss starts increasing sharply.
Warm-up: Start with a very low learning rate and gradually increase it over the first few hundred or thousand steps. This stabilizes training, especially for transformer architectures and large batch sizes.
Cosine annealing: After warm-up, decrease the learning rate following a cosine schedule down to near zero. This produces smooth convergence without abrupt schedule changes.
Step decay: Multiply the learning rate by a factor (e.g., 0.1) at fixed epoch milestones. Simple and effective for many problems.
Reduce on plateau: Monitor validation loss and reduce the learning rate by a factor when it stops improving. Adaptive and requires less manual tuning.

Convergence Criteria

How do you know when gradient descent has finished? In practice, you monitor several signals:

Loss plateau: When the training loss changes by less than a threshold (e.g., 1e-6) over several consecutive epochs, the optimizer has likely found a minimum.
Validation performance: If validation loss stops improving or begins increasing while training loss continues to drop, you are overfitting and should stop training (early stopping).
Gradient magnitude: When the norm of the gradient approaches zero, the optimizer is at a critical point (minimum, maximum, or saddle point).
Maximum iterations: Set a hard cap on epochs to prevent runaway training. This is a safety measure, not a convergence signal.
Parameter stability: When parameters change by negligible amounts between updates, the model has effectively converged.

Common Pitfalls

Even experienced practitioners fall into gradient descent traps. Here are the most common issues and their solutions:

Exploding gradients: In deep networks (especially RNNs), gradients can grow exponentially through backpropagation. Use gradient clipping (cap the gradient norm at a threshold, typically 1.0) to prevent this.
Vanishing gradients: Gradients shrink exponentially in deep networks using sigmoid or tanh activations. Switch to ReLU-family activations, use residual connections, or apply batch normalization.
Saddle points: In high-dimensional spaces, most critical points are saddle points rather than local minima. SGD noise and momentum help escape them. Adam is particularly robust to saddle points.
Poor initialization: Starting with all-zero weights in a neural network means all neurons learn the same thing (symmetry problem). Use Xavier or He initialization to break symmetry.
Not shuffling data: If training data is ordered (e.g., all class-A samples first), gradients will be biased. Always shuffle before each epoch.
Feature scale mismatch: When features have vastly different scales, the loss landscape becomes elongated, causing slow convergence. Normalize or standardize features before training.

Gradient Descent in Practice

When training a real model, gradient descent is just one piece of the puzzle. You also need proper data preprocessing (normalization, handling missing values), a well-chosen architecture, regularization (dropout, weight decay, data augmentation), and a validation strategy (hold-out set or k-fold cross-validation). The optimizer choice matters, but it is rarely the bottleneck. Focus on data quality, feature engineering, and architecture selection first. Use Adam as a sensible default optimizer, switch to SGD with momentum if you need better generalization, and always use a learning rate scheduler.

For the interactive visualizer above, try experimenting with different settings: increase the learning rate on the ravine surface to see oscillation, switch between SGD and Adam on the multi-modal surface to see how adaptive methods handle complex landscapes, and toggle momentum to observe how velocity accumulation affects convergence speed. All computations run entirely in your browser with no data sent to any server.

Frequently Asked Questions

What is gradient descent in machine learning?

Gradient descent is an iterative optimization algorithm used to minimize a loss function by repeatedly adjusting parameters in the direction of steepest descent. It computes the gradient (partial derivatives) of the loss with respect to each parameter, then updates each parameter by subtracting the gradient multiplied by a learning rate. It is the foundation of training nearly all modern machine learning models including neural networks.

What is the difference between SGD, mini-batch, and batch gradient descent?

Batch gradient descent computes the gradient over the entire dataset per update, giving stable but slow convergence. Stochastic Gradient Descent (SGD) uses a single random sample per update, which is fast but noisy. Mini-batch gradient descent is the practical middle ground: it uses a small batch (typically 32-256 samples) per update, balancing computational efficiency with gradient stability. Most deep learning uses mini-batch.

How do I choose a learning rate for gradient descent?

Start with common defaults like 0.001 for Adam or 0.01 for SGD. Use learning rate schedulers (cosine annealing, step decay, or warm-up + decay) to adjust during training. The learning rate finder technique (gradually increasing LR and plotting loss) helps identify the optimal range. Too high causes divergence; too low causes slow convergence or getting stuck in local minima.

What is momentum in gradient descent?

Momentum adds a fraction (typically 0.9) of the previous update vector to the current gradient step. This accelerates convergence in consistent gradient directions and dampens oscillations in noisy or ravine-like loss surfaces. It helps the optimizer build velocity and escape shallow local minima. Nesterov momentum is a variant that computes the gradient at the look-ahead position for even faster convergence.

Why does gradient descent sometimes not converge?

Common causes of non-convergence include: learning rate too high (parameters overshoot the minimum), exploding gradients in deep networks (use gradient clipping), vanishing gradients (use ReLU activations or residual connections), saddle points in high dimensions, poor weight initialization, and non-convex loss landscapes with many local minima. Adaptive optimizers like Adam handle many of these issues automatically.

Related Tools

About the Author

Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.

Last updated: May 25, 2026