ML Metrics Explained

Choosing the right evaluation metric is as important as choosing the right algorithm. The wrong metric can lead you to optimize for the wrong objective, producing a model that looks good on paper but fails in production. Use the wizard below to get task-specific metric recommendations, then explore each metric's formula, when to use it, and compute it with mini-calculators.

Metric Selector Wizard

Select your task type to see recommended metrics with formulas and guidance.

Why Metric Choice Matters

The evaluation metric defines what "good" means for your model. It is the objective function that drives model selection, hyperparameter tuning, and deployment decisions. Different metrics can lead to fundamentally different models because they prioritize different aspects of performance. A model optimized for accuracy will behave differently from one optimized for recall, even when trained on the same data.

Consider spam detection. If you optimize for accuracy on a dataset where 98% of emails are legitimate, a model that never flags anything as spam achieves 98% accuracy. But it catches zero spam. If you optimize for recall on the spam class, you catch more spam but might flag legitimate emails. If you optimize for precision, you reduce false alarms but might miss spam. The right metric depends on the business cost of each error type: how much does a missed spam cost vs. a legitimate email sent to the spam folder?

This decision should involve domain experts and stakeholders, not just data scientists. The metric should reflect the real-world impact of model predictions. In healthcare, missing a disease (false negative) may be far worse than a false alarm (false positive). In criminal justice, a false positive (wrongly flagging someone) may be worse than a false negative. These asymmetric costs must be encoded into the evaluation metric.

Binary Classification Metrics

Accuracy is the simplest metric: the fraction of correct predictions. Use it when classes are balanced and all errors are equally costly. Its formula is (TP + TN) / (TP + TN + FP + FN). Accuracy fails on imbalanced datasets because a naive model can achieve high accuracy by predicting the majority class.

Precision (Positive Predictive Value) measures the fraction of positive predictions that are correct: TP / (TP + FP). High precision means when the model says "positive," it is usually right. Use precision when false positives are costly, such as in recommender systems (users trust recommendations) or legal document review (flagged documents need human review).

Recall (Sensitivity, True Positive Rate) measures the fraction of actual positives correctly identified: TP / (TP + FN). High recall means the model catches most positive cases. Use recall when false negatives are costly, such as in medical screening (do not miss a disease), fraud detection (do not miss fraudulent transactions), or safety systems.

F1 Score is the harmonic mean of precision and recall: 2 * P * R / (P + R). It balances both metrics and penalizes extreme tradeoffs. Use F1 when you need a single number that accounts for both precision and recall, especially on imbalanced datasets. The F-beta score generalizes this with a parameter beta that controls the relative weight: F-beta = (1 + beta^2) * P * R / (beta^2 * P + R). F2 weights recall twice as much as precision; F0.5 weights precision twice as much.

AUC-ROC (Area Under the ROC Curve) measures the probability that the model ranks a random positive example higher than a random negative example. It is threshold-independent, making it useful for comparing models regardless of the operating threshold. AUC of 0.5 is random; 1.0 is perfect. Use it when you want to evaluate discriminative ability across all possible thresholds.

AUC-PR (Area Under the Precision-Recall Curve) is more informative than AUC-ROC on highly imbalanced datasets. It focuses on the positive class and is less influenced by the large number of true negatives that can inflate AUC-ROC.

MCC (Matthews Correlation Coefficient) uses all four confusion matrix values and returns a balanced measure from -1 to +1. It is considered the most reliable single metric for binary classification because it is informative even on imbalanced datasets where accuracy and F1 can be misleading.

Log Loss (Binary Cross-Entropy) evaluates the quality of predicted probabilities, not just class labels. It heavily penalizes confident wrong predictions. Use it when calibrated probability estimates matter, such as in medical risk scores or insurance pricing.

Multi-Class Classification Metrics

Multi-class metrics extend binary metrics to problems with more than two classes. Three averaging strategies exist:

Cohen's Kappa measures agreement between predictions and labels, adjusted for chance agreement. Kappa of 1.0 indicates perfect agreement; 0 indicates agreement equal to chance. More informative than accuracy for imbalanced multi-class problems.

Top-K Accuracy considers a prediction correct if the true class is among the model's top K predictions. Used in image classification (ImageNet uses top-5 accuracy) and recommendation systems where users see multiple suggestions.

Regression Metrics

MSE (Mean Squared Error) is the average of squared prediction errors: mean((y - y_hat)^2). Squaring penalizes large errors disproportionately. Use MSE when large errors are particularly costly. The gradient of MSE is smooth, making it well-suited for gradient-based optimization.

RMSE (Root Mean Squared Error) is sqrt(MSE), bringing the error back to the original units of the target variable. More interpretable than MSE because you can compare it directly to the target values. For example, RMSE of $5,000 on house prices means typical prediction errors are around $5,000.

MAE (Mean Absolute Error) is the average of absolute prediction errors: mean(|y - y_hat|). Treats all error sizes equally and is more robust to outliers than MSE/RMSE. Use MAE when outliers should not dominate the evaluation. The median absolute error is even more robust.

R-squared (Coefficient of Determination) measures the proportion of variance explained by the model: 1 - SS_res / SS_tot. R-squared of 1.0 means the model explains all variance; 0 means it is no better than predicting the mean. It can be negative if the model performs worse than the mean prediction. Adjusted R-squared penalizes for the number of features to prevent overfitting in high-dimensional settings.

MAPE (Mean Absolute Percentage Error) expresses errors as percentages: mean(|y - y_hat| / |y|) * 100. Useful when relative error matters more than absolute error (10% error on a $100 item vs. a $10,000 item). Undefined when actual values are zero. Symmetric MAPE (sMAPE) addresses some of MAPE's asymmetry issues.

Ranking Metrics

NDCG (Normalized Discounted Cumulative Gain) measures ranking quality with position-aware weighting. Items ranked higher contribute more to the score. DCG = sum(rel_i / log2(i+1)), normalized by the ideal DCG (perfect ranking). NDCG@K evaluates only the top K positions. The standard metric for search engines and recommendation systems.

MAP (Mean Average Precision) is the mean of average precision scores across all queries. AP for a single query is the average of precision values at each position where a relevant item appears. MAP rewards models that rank all relevant items highly, not just the first one. Common in information retrieval.

MRR (Mean Reciprocal Rank) is the average of 1/rank for the first relevant item across queries. If the first relevant result is at position 3, the reciprocal rank is 1/3. MRR focuses on where the first correct answer appears, making it suitable for question-answering and navigational search.

Precision@K and Recall@K measure the fraction of the top K results that are relevant, and the fraction of all relevant items that appear in the top K, respectively. These are simple and intuitive but do not account for the ordering within the top K.

Practical Guidelines for Metric Selection

Frequently Asked Questions

How do I choose the right evaluation metric for my ML model?

Start with the task type: classification, regression, or ranking. For classification, use accuracy if balanced, F1 if imbalanced, AUC-ROC for threshold independence. For regression, use RMSE if large errors matter, MAE for robustness. For ranking, use NDCG or MAP. Your metric should reflect the real-world cost of different error types in your application.

What is the difference between MSE, RMSE, and MAE?

MSE averages squared errors, penalizing large errors heavily. RMSE is sqrt(MSE), bringing error back to original units. MAE averages absolute errors, treating all sizes equally and being robust to outliers. Use RMSE when large errors are particularly bad; use MAE when you want outlier robustness.

What is AUC-ROC and when should I use it?

AUC-ROC measures a classifier's ability to distinguish between classes across all thresholds. It ranges from 0.5 (random) to 1.0 (perfect). Use it for threshold-independent comparison. For heavily imbalanced data, prefer AUC-PR instead.

What is R-squared and can it be negative?

R-squared measures the proportion of variance explained by the model: 1.0 is perfect, 0.0 means no better than predicting the mean. Yes, it can be negative when the model performs worse than predicting the mean, indicating a fundamentally flawed model.

What metrics should I use for ranking and recommendation systems?

Use NDCG for position-weighted relevance scoring, MAP for binary relevance across all queries, MRR for first-relevant-item focus, and Precision@K/Recall@K for top-K evaluation. Also consider diversity, novelty, and coverage alongside accuracy metrics.

Related Tools

About the Author

Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.

Last updated: May 25, 2026