Loss Function Explorer

Compare MSE, MAE, Huber, Cross-Entropy, Hinge, Focal, Log-Cosh, and Quantile loss on interactive canvas plots. Adjust Huber delta, Focal gamma, and Quantile tau with sliders to see how parameter choices reshape each curve. Drag the probe slider to read off loss values at any prediction point. All computation runs locally in your browser.

View Mode

x-axis: residual error (prediction − target) | y-axis: loss value

Parameters

Huber δ (transition point) 1.00

Focal γ (focusing parameter) 2.0

Quantile τ (quantile level) 0.50

X-axis range ±4.0

Why Loss Functions Matter

The loss function (also called cost function or objective function) is the mathematical expression that quantifies how wrong a model's predictions are. It is the single number that gradient descent minimizes. The choice of loss function is not just a technical detail — it encodes your assumptions about the data distribution, your tolerance for outliers, and what kind of errors matter most for your application. A poorly chosen loss function can cause a model to converge to a solution that is technically optimal by the metric but practically useless.

Every loss function makes implicit statistical assumptions. MSE is the maximum likelihood estimator for a Gaussian noise model. MAE corresponds to a Laplacian noise model. Cross-entropy is the negative log-likelihood for a Bernoulli or categorical distribution. Understanding these connections helps you choose the right loss for your data-generating process and explains why certain losses are more robust to specific types of noise or outliers.

Regression Losses

Mean Squared Error (MSE) squares the residual, heavily penalizing large errors. It is differentiable everywhere and corresponds to Gaussian noise assumption. The gradient is proportional to the residual, making it straightforward to optimize but sensitive to outliers — a single large error can dominate the total loss and pull the model far from the true solution.

Mean Absolute Error (MAE) takes the absolute value of the residual. It is robust to outliers because large errors are not squared, but it has a non-differentiable kink at zero and a constant gradient magnitude everywhere (subgradient is ±1), which can cause oscillation near the optimum and slow convergence with standard gradient descent.

Huber Loss combines the best of both: it behaves like MSE for small residuals (within delta) and like MAE for large residuals. This makes it robust to outliers while retaining smooth, well-scaled gradients near zero. The delta parameter controls the transition point. Setting delta very small approaches MAE; setting it very large approaches MSE. Huber loss is the default in many robust regression settings.

Log-Cosh Loss is log(cosh(residual)), which approximates MSE for small errors and MAE for large errors, but is twice differentiable everywhere — an advantage over Huber loss for second-order optimization methods. It is numerically stable and less sensitive to outliers than MSE.

Quantile Loss enables quantile regression: instead of predicting the conditional mean, you predict a specific quantile of the target distribution. Tau=0.5 gives the median (equivalent to MAE). Tau=0.9 penalizes underestimates more heavily, producing a 90th percentile prediction. This is essential for prediction intervals and applications where asymmetric error costs matter (e.g., overestimating demand is cheaper than underestimating it).

Classification Losses

Binary Cross-Entropy is the negative log-likelihood for binary classification. It is derived from the Bernoulli distribution and is the canonical loss for sigmoid output layers. Larger penalties are assigned to confident wrong predictions (when the model is very certain but wrong). Unlike MSE applied to probability outputs, cross-entropy provides strong gradients even when the model is confidently wrong, making training more stable.

Hinge Loss is used for SVMs and max-margin classifiers. It penalizes predictions that are on the wrong side of the margin (score below 1 for positive class). Once a prediction is correct by a sufficient margin (score ≥ 1), the loss is zero — the model does not try to push correct predictions further. This creates sparse support vectors and a different inductive bias than probabilistic classifiers.

Focal Loss was introduced by Lin et al. (2017) for object detection to address extreme class imbalance. It adds a modulating factor (1 - p_t)^gamma that down-weights easy examples. When gamma=0, focal loss equals binary cross-entropy. As gamma increases, easy correctly classified examples contribute less to the total loss, forcing the model to focus on hard, misclassified examples. It is now widely used beyond detection for any task with severe class imbalance.

Choosing the Right Loss Function

For regression with clean data: start with MSE. With significant outliers: use Huber or MAE. When you need prediction intervals or asymmetric error costs: use quantile loss. For binary classification: use binary cross-entropy. For multi-class classification: use categorical cross-entropy. For imbalanced classification: try focal loss. For SVMs or large-margin classifiers: hinge loss. When you need second-order optimization: prefer smooth losses like Log-Cosh over MAE.

Always validate your choice empirically. Run experiments with 2-3 candidate loss functions on a held-out validation set and choose based on the metric that matters for your application — not the training loss itself. A model trained with MSE that achieves better F1 score than one trained with cross-entropy would be the correct choice if F1 is your deployment metric.

Related Tools

Frequently Asked Questions

Why is MSE more sensitive to outliers than MAE?

MSE squares the residual error, so an outlier with error=10 contributes 100 to the loss while an inlier with error=1 contributes only 1. The ratio is 100:1. With MAE, the same outlier contributes 10 vs 1 for the inlier — a ratio of 10:1. Squaring amplifies large errors quadratically, causing the optimizer to disproportionately focus on fitting outliers. In datasets with heavy-tailed noise or label errors, this can severely degrade model quality.

What does the Huber delta parameter control?

The Huber delta parameter sets the boundary between the MSE regime and the MAE regime. For residuals smaller than delta in absolute value, Huber loss behaves quadratically (like MSE). For residuals larger than delta, it behaves linearly (like MAE). A smaller delta makes the loss more MAE-like and robust to outliers. A larger delta makes it more MSE-like and sensitive to all errors. Typical values range from 0.5 to 2.0, tuned on a validation set.

When should I use Focal loss over standard cross-entropy?

Use focal loss when you have significant class imbalance — particularly when the negative class vastly outnumbers the positive class, as in object detection (many background regions, few objects). Standard cross-entropy can be dominated by easy negative examples, causing the model to predict the majority class. Focal loss down-weights those easy examples via the (1-p_t)^gamma factor, forcing the model to focus on hard, informative examples. A good starting point is gamma=2, alpha=0.25.

How does quantile loss enable prediction intervals?

By training two models with quantile loss at tau=0.1 and tau=0.9, you get predictions for the 10th and 90th percentiles of the target distribution. The region between them forms an 80% prediction interval. Unlike confidence intervals for the mean, these intervals account for the full spread of the target distribution at any given input. This is more informative than point predictions for decision-making under uncertainty, especially when the variance of the target is input-dependent (heteroscedastic noise).

Why does cross-entropy work better than MSE for classification?

When using MSE on sigmoid/softmax outputs for classification, the gradient becomes very small when the model is confidently wrong (because sigmoid saturates near 0 or 1). This creates a vanishing gradient problem that slows learning. Cross-entropy avoids this: its gradient with respect to the pre-activation logit simplifies to (prediction - target), which remains large even when the model is confidently wrong. Cross-entropy is also the statistically motivated loss for probabilistic classifiers under maximum likelihood estimation.

About the Author

Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.

Last updated: May 28, 2026