Overfitting vs Underfitting

Interactive bias-variance tradeoff demo. Adjust the polynomial degree to see how model complexity affects fitting. Watch training and test errors change in real time. All computations run locally in your browser.

Bias-Variance Tradeoff Demo

--
Train MSE
--
Test MSE
--
Fit Status
3
Degree

Adjust the polynomial degree slider to explore underfitting, good fit, and overfitting regimes.

The Bias-Variance Tradeoff

The bias-variance tradeoff is one of the most fundamental concepts in machine learning. Every prediction error can be decomposed into three components: bias, variance, and irreducible noise. Understanding this decomposition is essential for diagnosing model problems and choosing appropriate solutions.

Bias is the error introduced by approximating a complex real-world problem with a simplified model. A linear model trying to fit a quadratic relationship has high bias. It consistently misses the true pattern regardless of the training data. High-bias models are said to underfit the data.

Variance is the error introduced by the model's sensitivity to fluctuations in the training set. A degree-15 polynomial fit to 20 noisy data points will produce wildly different curves depending on which 20 points are sampled. The model captures noise as if it were signal. High-variance models are said to overfit the data.

Irreducible noise is the inherent randomness in the data that no model can explain. This sets a floor on achievable error. Trying to reduce error below this floor causes the model to fit noise, increasing variance.

The total expected error is: Error = Bias^2 + Variance + Noise. As model complexity increases, bias decreases (the model can represent more complex patterns) but variance increases (the model becomes more sensitive to training data). The optimal model minimizes the sum of both.

What Is Overfitting?

Overfitting occurs when a model learns the noise and random fluctuations in the training data rather than the underlying pattern. An overfit model performs excellently on training data but poorly on unseen data. In the visualization above, increase the polynomial degree to 12 or higher and observe how the curve passes through every training point but oscillates wildly between them. Those wild oscillations predict nonsensical values for new data points.

Overfitting is more likely when:

A classic sign of overfitting is a large gap between training and validation performance. If your model achieves 99% accuracy on training data but only 60% on validation data, it has memorized the training set rather than learning generalizable patterns.

What Is Underfitting?

Underfitting occurs when a model is too simple to capture the underlying pattern in the data. It performs poorly on both training and test data. In the visualization above, set the polynomial degree to 1. The straight line cannot capture the curved relationship, resulting in high error on both training and test data.

Underfitting is more likely when:

Regularization Techniques

Regularization adds constraints or penalties to the learning process to reduce overfitting. The key insight is that simpler models generalize better. Regularization explicitly favors simplicity.

L1 Regularization (Lasso)

L1 adds the sum of absolute values of weights to the loss function: Loss = Original_Loss + lambda * sum(|w|). L1 drives some weights exactly to zero, effectively performing automatic feature selection. It produces sparse models where irrelevant features are eliminated. Use L1 when you suspect many features are irrelevant.

L2 Regularization (Ridge)

L2 adds the sum of squared weights to the loss: Loss = Original_Loss + lambda * sum(w^2). L2 penalizes large weights but does not drive them to zero. It distributes the weight values more evenly, preventing any single feature from dominating. L2 is the default regularization for most tasks and is equivalent to weight decay in neural networks.

Elastic Net

Elastic Net combines L1 and L2 regularization: Loss = Original_Loss + lambda1 * sum(|w|) + lambda2 * sum(w^2). This gets the feature selection benefit of L1 with the stability of L2. Particularly useful when features are correlated.

Dropout

Dropout randomly sets a fraction of neuron activations to zero during each training step. This prevents neurons from co-adapting and forces the network to learn redundant representations. At test time, all neurons are active but their outputs are scaled by the dropout probability. Typical dropout rates are 0.2-0.5. Dropout is one of the most effective regularizers for neural networks.

Data Augmentation

Instead of constraining the model, data augmentation increases the effective training set size by creating modified copies of existing data. For images: rotation, flipping, cropping, color jitter. For text: synonym replacement, back-translation. For tabular data: SMOTE for class imbalance, mixup for interpolated examples. More data is almost always the best cure for overfitting.

Cross-Validation

Cross-validation provides a robust estimate of model performance by using all data for both training and validation. The most common variant is k-fold cross-validation:

  1. Split the data into k equal folds (typically k=5 or k=10).
  2. For each fold: train the model on the other k-1 folds, evaluate on the held-out fold.
  3. Average the k evaluation scores to get the final performance estimate.

Cross-validation is more reliable than a single train-test split, especially on small datasets. It ensures every data point is used for both training and validation. Stratified k-fold preserves class proportions in each fold, which is important for imbalanced datasets.

Leave-one-out cross-validation (LOOCV) uses k = n (one fold per data point). It is computationally expensive but provides the least biased estimate. Time-series cross-validation respects temporal ordering by always training on past data and validating on future data.

Early Stopping

Early stopping monitors validation performance during training and halts when it begins to degrade. The procedure is straightforward: after each epoch, evaluate the model on a validation set. If validation loss has not improved for a specified number of epochs (the patience parameter), stop training and restore the weights from the best epoch.

Early stopping acts as an implicit regularizer. As training progresses, the model first learns the true signal (both training and validation loss decrease) and then begins to memorize noise (training loss continues to decrease but validation loss increases). By stopping at the inflection point, early stopping captures the learned signal without the memorized noise.

Typical patience values range from 5 to 20 epochs. Set patience too low and you risk stopping prematurely (underfitting). Set it too high and you waste computation. A common pattern is to combine early stopping with a learning rate reducer: first reduce the learning rate when validation loss plateaus, then stop if it still does not improve.

Practical Decision Framework

When diagnosing a model, start with these questions:

Frequently Asked Questions

What is the difference between overfitting and underfitting?

Overfitting occurs when a model learns noise rather than the underlying pattern, resulting in excellent training performance but poor generalization. Underfitting occurs when a model is too simple to capture the pattern, resulting in poor performance on both training and test data. The goal is the sweet spot between the two.

What is the bias-variance tradeoff?

The bias-variance tradeoff is the fundamental tension in ML. Bias is error from overly simplistic assumptions (underfitting). Variance is error from sensitivity to training data fluctuations (overfitting). Total error equals bias squared plus variance plus irreducible noise. Reducing bias typically increases variance and vice versa.

How do I detect overfitting?

Compare training and validation performance. If training loss is much lower than validation loss, or training accuracy is much higher than validation accuracy, the model is overfitting. Learning curves that diverge between training and validation sets are a clear signal.

What regularization techniques prevent overfitting?

Common techniques include: L1 (Lasso) which drives some weights to zero, L2 (Ridge) which penalizes large weights, dropout which randomly deactivates neurons during training, early stopping which halts training when validation loss increases, and data augmentation which creates synthetic training examples.

What is cross-validation and when should I use it?

Cross-validation splits data into K folds, trains on K-1 folds and validates on the remaining one, rotating through all folds. K-fold CV (typically K=5 or K=10) provides a more reliable estimate of generalization than a single train-test split, especially on small datasets.

Related Tools

About the Author

Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.

Last updated: May 25, 2026