Learning Rate Finder
Simulate an LR range test to find the optimal learning rate for your model. Set a minimum and maximum learning rate, choose the number of steps, and watch the loss curve unfold on a logarithmic scale. The tool automatically detects the steepest descent point and suggests the best learning rate. Compare warm restarts versus one-cycle scheduling side by side. All computation runs locally in your browser with zero server calls.
LR Range Test Configuration
What Is a Learning Rate Range Test?
The learning rate range test, introduced by Leslie Smith in his 2015 paper "Cyclical Learning Rates for Training Neural Networks," is a systematic method for finding the optimal learning rate for your model. Instead of guessing or doing a full grid search over many training runs, you perform a single abbreviated training run where the learning rate is increased exponentially from a very small value to a very large value. By recording the training loss at each step, you generate a characteristic curve that reveals three distinct phases of learning rate behavior.
In the first phase (learning rate too small), the loss decreases very slowly or remains essentially flat. The model is making progress, but the updates are so tiny that learning is impractically slow. In the second phase (optimal range), the loss drops steeply. This is where the learning rate is large enough to make meaningful progress but small enough to remain stable. In the third phase (learning rate too large), the loss starts increasing sharply or oscillating wildly. The updates have become so large that they overshoot minima and the optimization diverges.
The steepest descent point on this curve, where the loss is decreasing most rapidly, corresponds to the learning rate that produces the most efficient learning. The standard recommendation is to select a learning rate slightly below this point (often one-third to one-tenth of the minimum-loss learning rate) to provide a margin of safety while still benefiting from fast convergence.
Why the Learning Rate Matters So Much
The learning rate is arguably the single most important hyperparameter in deep learning. It controls the step size of gradient descent updates: too small and training takes forever (or gets stuck in a poor local minimum), too large and training diverges or oscillates without converging. Unlike other hyperparameters that affect model capacity (like width, depth, or regularization strength), the learning rate directly determines whether training succeeds or fails. A model with perfect architecture but a terrible learning rate will produce garbage results.
Research has shown that the optimal learning rate depends on many factors: the loss landscape geometry, batch size, model architecture, optimizer choice, and even the current stage of training. What works for a ResNet-50 on ImageNet will not necessarily work for a Transformer on a text corpus. This is why empirical methods like the LR range test are so valuable: they probe the actual loss landscape of your specific model on your specific data, rather than relying on generic rules of thumb.
The relationship between learning rate and batch size is particularly important. When you increase the batch size by a factor of k, the gradient estimates become k times less noisy, which means you can often increase the learning rate by a factor of sqrt(k) (the linear scaling rule suggests a factor of k, but sqrt(k) is more conservative and often more reliable). The LR range test naturally accounts for your chosen batch size, making it the most practical approach for any training setup.
Understanding Smoothed Loss Curves
Raw training loss is inherently noisy due to mini-batch stochasticity. Each batch produces a slightly different loss value depending on which samples happen to be in that batch. This noise can obscure the underlying trend, making it difficult to identify the optimal learning rate. Exponential moving average (EMA) smoothing addresses this by computing a weighted average where recent values have more influence than older ones.
The smoothing factor (beta) controls how much smoothing is applied. A value of 0.0 means no smoothing (raw loss), while 0.99 produces very heavy smoothing. For LR range tests, a smoothing factor between 0.8 and 0.95 typically works well. Too little smoothing leaves the curve too noisy to interpret. Too much smoothing introduces lag, causing the detected optimal point to shift toward higher learning rates than is actually optimal. The bias-corrected EMA (dividing by 1 - beta^t) compensates for the initialization bias in early steps.
One-Cycle Policy vs Warm Restarts
Once you have identified the optimal learning rate, you need to choose a scheduling strategy for the full training run. Two of the most effective modern approaches are the one-cycle policy and warm restarts with cosine annealing.
The one-cycle policy (Smith, 2018) consists of a single cycle with three phases. First, the learning rate warms up linearly from a low value (typically 1/25th of the max LR) to the maximum over the first 30-45% of training. Then it anneals following a cosine schedule back down to near zero over the remaining epochs. Simultaneously, the momentum is cycled inversely: starting high (0.95), decreasing during warmup, then increasing back during annealing. This approach often achieves "super-convergence," reaching equivalent or better accuracy in 5-10x fewer epochs than traditional training.
The warm restart strategy (SGDR, Loshchilov & Hutter, 2016) uses cosine annealing to decrease the learning rate from the initial value to near zero, then abruptly resets it back to the initial value and repeats. Each restart allows the optimizer to potentially escape local minima and explore new regions of the loss landscape. The restart period can be fixed or increasing (multiplied by a factor T_mult after each cycle). This approach is particularly effective when combined with snapshot ensembling, where you save the model at the end of each cycle and average the predictions.
In practice, the one-cycle policy tends to produce better single-model results with faster convergence, while warm restarts with snapshot ensembling produce better ensemble results. For most practitioners training a single model, one-cycle is the recommended default. For competition settings where ensemble performance matters, warm restarts with snapshot ensembling are a powerful technique.
Advanced LR Scheduling Strategies
Beyond one-cycle and warm restarts, several other scheduling strategies are worth understanding. Step decay reduces the learning rate by a fixed factor (typically 0.1) at predetermined epoch milestones. This was the standard approach for years (e.g., dividing by 10 at epochs 30, 60, and 90 for ImageNet training) and remains competitive for many tasks. Exponential decay continuously reduces the learning rate by multiplying by a factor less than 1 at each step or epoch. Polynomial decay follows a polynomial schedule, with the power parameter controlling the curvature. ReduceLROnPlateau monitors a validation metric and reduces the learning rate when improvement stalls, making it adaptive to the actual training dynamics.
Modern optimizers like Adam, AdamW, and LAMB incorporate per-parameter adaptive learning rates, which partially reduces the sensitivity to the global learning rate choice. However, even with adaptive optimizers, the base learning rate still matters significantly, and the LR range test remains valuable. Research by Zhang et al. (2019) showed that AdamW with a properly tuned learning rate schedule matches or outperforms SGD with momentum for most tasks, debunking the earlier belief that adaptive optimizers generalize worse.
Practical Tips for LR Range Tests
- Run the range test on your full dataset (or a representative subset) with the exact batch size and optimizer you plan to use for training. These factors affect the optimal LR.
- Use 100-300 steps covering 5-6 orders of magnitude. Starting at 1e-7 and ending at 1-10 usually covers the useful range for most architectures and optimizers.
- Apply EMA smoothing with beta=0.9 to see the trend clearly. Adjust if the curve is still too noisy or overly smoothed.
- The optimal LR is at the steepest descent, not at the minimum loss. The minimum is often already in the instability zone.
- For transfer learning, the optimal LR for fine-tuning is typically 10-100x smaller than for training from scratch, because the pretrained weights are already near a good minimum.
- Repeat the test 2-3 times with different random seeds to verify the optimal range is consistent. If results vary widely, use the more conservative (lower) estimate.
- When using learning rate warmup (recommended for Transformers and large batch training), the LR range test gives you the peak learning rate. The warmup period typically covers 5-10% of total training steps.
Frequently Asked Questions
What is a learning rate range test?
A learning rate range test is a technique where you gradually increase the learning rate from a very small value to a very large value over one training epoch, recording the loss at each step. Plotting loss vs learning rate on a log scale reveals the optimal learning rate at the steepest descent point. This gives you a data-driven starting point instead of guessing.
How do I pick the optimal learning rate from the curve?
Look for the point where the smoothed loss curve is decreasing most steeply. This is the steepest descent point, found by computing the gradient of the loss curve and selecting the learning rate with the most negative gradient. A common rule of thumb is to pick an LR about 10x smaller than where the minimum loss occurs.
What is the difference between warm restarts and one-cycle policy?
Warm restarts periodically reset the learning rate back to a high value and anneal it down again, creating multiple training cycles. The one-cycle policy uses a single cycle: warm up to a maximum LR, then anneal down to near zero. One-cycle typically achieves super-convergence with fewer epochs, while warm restarts are better for snapshot ensembling.
Why should I use a log scale for the learning rate axis?
Learning rates span several orders of magnitude (e.g., 1e-7 to 1). On a linear scale, almost everything would be compressed near zero. A log scale spaces these values evenly, so you can clearly see the flat region, optimal region, and divergence region of the loss curve.
How many steps should I use for an LR range test?
Typically 100 to 300 steps is sufficient. You need enough steps to cover 5-6 orders of magnitude and see the full loss landscape, but too many steps can cause weights to diverge so badly that the early part of the curve becomes meaningless. Running through 10-25% of one epoch usually works well.
Related Tools
About the Author
Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.
Last updated: May 25, 2026