Confusion Matrix Calculator
Enter your classification results to compute precision, recall, F1 score, accuracy, specificity, and Matthews Correlation Coefficient. Visualize the confusion matrix as an SVG heatmap and plot your model's operating point in ROC space. Supports binary and multi-class (up to 5 classes). All computation runs locally in your browser.
Mode Selection
Binary Confusion Matrix Input
Understanding the Confusion Matrix
A confusion matrix is the foundational tool for evaluating classification models. Unlike single-number metrics such as accuracy, the confusion matrix reveals the complete picture of how a model's predictions relate to actual outcomes. For binary classification, the matrix is a 2x2 table with four cells that capture every possible outcome of a prediction.
True Positives (TP) are cases where the model correctly predicts the positive class. The patient has the disease and the model says "positive." True Negatives (TN) are cases where the model correctly predicts the negative class. The patient is healthy and the model says "negative." False Positives (FP), also called Type I errors or false alarms, occur when the model incorrectly predicts positive. The patient is healthy but the model says "positive." False Negatives (FN), also called Type II errors or misses, occur when the model incorrectly predicts negative. The patient has the disease but the model says "negative."
The relative cost of FP vs FN depends entirely on the application. In cancer screening, false negatives (missing a cancer diagnosis) are far more dangerous than false positives (unnecessary follow-up tests). In spam filtering, false positives (blocking legitimate emails) may be more costly than false negatives (letting spam through). Understanding these trade-offs is essential for choosing the right threshold and the right evaluation metric.
Precision, Recall, and the F1 Score
Precision (also called Positive Predictive Value) is defined as TP / (TP + FP). It answers the question: "Of all the items the model predicted as positive, what fraction are actually positive?" High precision means the model rarely produces false alarms. Precision is critical in applications like search engines (users expect top results to be relevant) and legal document review (flagged documents should actually be relevant).
Recall (also called Sensitivity or True Positive Rate) is defined as TP / (TP + FN). It answers the question: "Of all the items that are actually positive, what fraction did the model successfully identify?" High recall means the model rarely misses positive cases. Recall is critical in medical diagnosis (you don't want to miss a disease), fraud detection (you don't want to miss fraudulent transactions), and safety systems (you don't want to miss defective products).
Precision and recall are inherently in tension. Lowering the classification threshold (predicting positive more aggressively) increases recall but decreases precision. Raising the threshold does the opposite. The F1 Score is the harmonic mean of precision and recall: F1 = 2 * (Precision * Recall) / (Precision + Recall). It ranges from 0 to 1, where 1 indicates perfect precision and recall. The harmonic mean penalizes extreme imbalances between precision and recall more than the arithmetic mean would, making F1 a good single-number summary when you need to balance both metrics.
Accuracy and Its Limitations
Accuracy is the simplest metric: (TP + TN) / (TP + TN + FP + FN), the fraction of all predictions that are correct. It is intuitive and appropriate when classes are balanced and all errors are equally costly. However, accuracy is misleading on imbalanced datasets. Consider a dataset with 99% negative examples and 1% positive. A model that always predicts "negative" achieves 99% accuracy but catches zero positive cases, making it useless for the actual task.
Specificity (True Negative Rate) is defined as TN / (TN + FP). It answers: "Of all the items that are actually negative, what fraction did the model correctly identify as negative?" Specificity complements recall: recall focuses on catching positives, specificity focuses on correctly identifying negatives. Together, they define the model's operating point in ROC space.
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is considered by many researchers to be the most informative single metric for binary classification. It is defined as: MCC = (TP * TN - FP * FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN)). MCC ranges from -1 (complete disagreement) through 0 (random prediction) to +1 (perfect prediction). Its key advantage over accuracy and F1 is that it uses all four quadrants of the confusion matrix and is not misleading even on highly imbalanced datasets. A high MCC requires the model to perform well on both positive and negative classes.
ROC Space and the Operating Point
The ROC (Receiver Operating Characteristic) space is a 2D plot with False Positive Rate (FPR = FP / (FP + TN) = 1 - Specificity) on the X-axis and True Positive Rate (TPR = Recall = TP / (TP + FN)) on the Y-axis. Each classifier at a specific threshold maps to a single point in this space. The point (0, 1) in the top-left corner represents perfect classification. The diagonal line from (0, 0) to (1, 1) represents random guessing. Points above the diagonal are better than random; points below are worse. By varying the classification threshold, you trace out a ROC curve, and the area under this curve (AUC-ROC) is a threshold-independent measure of model quality.
Multi-Class Confusion Matrices
For problems with more than two classes, the confusion matrix generalizes to an NxN table where N is the number of classes. Row i, column j contains the count of examples that belong to class i but were predicted as class j. The diagonal entries are correct predictions, and off-diagonal entries are specific misclassifications. Per-class precision and recall are computed by treating each class as a one-vs-rest binary problem. Macro-averaged metrics average across classes (treating all classes equally), while weighted averages account for class support (the number of true instances for each class).
Practical Tips for Evaluation
- Always start by looking at the full confusion matrix before reducing to a single metric. The distribution of errors often reveals actionable insights.
- For imbalanced datasets, use F1, MCC, or AUC-ROC instead of accuracy.
- Choose your primary metric based on the relative cost of false positives vs false negatives in your specific application.
- Report confidence intervals using bootstrap resampling, not just point estimates.
- Use stratified cross-validation to get reliable metric estimates, especially with small or imbalanced datasets.
- When comparing models, use the same evaluation set and consider statistical significance (paired t-test or McNemar's test).
- For multi-class problems, examine per-class metrics to identify which classes the model struggles with most.
Frequently Asked Questions
What is a confusion matrix?
A confusion matrix is a table that summarizes the performance of a classification model by comparing predicted labels against actual labels. For binary classification, it contains four cells: True Positives, False Positives, True Negatives, and False Negatives. It reveals not just overall accuracy but the specific types of errors the model makes.
What is the difference between precision and recall?
Precision measures the fraction of positive predictions that are actually correct: TP / (TP + FP). Recall measures the fraction of actual positives that are correctly identified: TP / (TP + FN). High precision means few false alarms; high recall means few missed positives.
When should I use F1 score vs accuracy?
Use accuracy when classes are balanced and both types of errors are equally costly. Use F1 score when classes are imbalanced or when you need to balance precision and recall. F1 is the harmonic mean of precision and recall, so it penalizes models that sacrifice one for the other.
What is Matthews Correlation Coefficient (MCC)?
MCC is a balanced metric that uses all four confusion matrix values and returns a value between -1 and +1. A value of +1 indicates perfect prediction, 0 indicates random prediction, and -1 indicates total disagreement. MCC is considered the most informative single metric for binary classification because it accounts for class imbalance.
How do I interpret the ROC point on the chart?
The ROC space plots True Positive Rate (recall) on the Y-axis against False Positive Rate (1 - specificity) on the X-axis. The ideal point is the top-left corner (0, 1). The diagonal line represents random guessing. The closer your model's point to the top-left corner, the better its discrimination ability.
Related Tools
About the Author
Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.
Last updated: May 25, 2026