K-Fold Cross Validation Explained
Cross validation is the gold standard for estimating how well a machine learning model generalizes to unseen data. This interactive simulator lets you configure K, dataset size, and stratification, then visualize how data is split across folds and how accuracy varies. Compare different K values to understand the bias-variance tradeoff in model evaluation.
CV Simulator Configuration
What Is Cross Validation?
Cross validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The basic idea is simple: instead of relying on a single train/test split (which may give a lucky or unlucky estimate), split the data multiple ways, train and evaluate the model on each split, and aggregate the results. This produces a more robust and reliable estimate of model performance on unseen data.
The most common form is K-Fold cross validation. The dataset is randomly partitioned into K equal-sized subsets called folds. The model is trained and evaluated K times. In each iteration, one fold serves as the validation set while the remaining K-1 folds form the training set. The final performance metric is the average of all K evaluation scores. This ensures every data point appears in exactly one validation fold and K-1 training folds, maximizing the use of available data.
Cross validation serves two primary purposes. First, it provides a reliable estimate of how well a model will perform on unseen data, which is essential for model selection (choosing between different algorithms or architectures). Second, it provides a measure of variance in that estimate (the standard deviation across folds), which tells you how sensitive the model is to the particular training data it receives. A high standard deviation suggests the model is unstable and may benefit from regularization or more data.
How K-Fold Cross Validation Works
The K-Fold CV procedure follows these steps precisely:
- Shuffle the dataset randomly (optional but recommended to break any ordering effects).
- Split into K folds of approximately equal size. If N is not evenly divisible by K, some folds will have one extra sample.
- For each fold i from 1 to K: Set fold i aside as the validation set. Train the model on all other K-1 folds combined. Evaluate the model on fold i and record the score.
- Compute the average and standard deviation of the K scores. Report as: mean ± std.
The computational cost is K times that of a single train/evaluate cycle. For K=5, you train 5 models; for K=10, you train 10 models. Each model sees slightly different training data (one fold is withheld), so the K scores provide a sample from the distribution of possible model performances. The mean estimates expected performance; the standard deviation estimates the uncertainty in that expectation.
Choosing the Right K
The choice of K involves a fundamental tradeoff. There are two sources of error in the CV estimate: bias (how far the estimate is from the true generalization performance) and variance (how much the estimate fluctuates across different data splits).
Low K (e.g., K=2): Each training set contains only 50% of the data, so the model is trained on significantly less data than the full dataset. This introduces pessimistic bias because the model underperforms relative to one trained on the full dataset. However, the two validation sets are large and independent, giving lower variance. K=2 is rarely used in practice because the bias is too large.
High K (e.g., K=N, Leave-One-Out): Each training set contains N-1 samples, nearly the full dataset, so the bias is minimal. However, the K training sets overlap almost completely (they differ by only one sample), making the K models highly correlated. This correlation inflates the variance of the mean estimate. LOOCV also has the highest computational cost (N training runs).
K=5 or K=10: Empirical studies by Ron Kohavi (1995) and others have shown that K=5 and K=10 provide the best tradeoff between bias and variance. K=10 has slightly lower bias than K=5, but the difference is usually small. K=5 is preferred when computation is a concern. These values are widely used as defaults in machine learning practice and libraries like scikit-learn.
Stratified K-Fold Cross Validation
Standard K-Fold assigns data points to folds randomly, without considering class labels. In balanced datasets (roughly equal number of samples per class), this works fine because each fold will naturally contain a representative mix of classes. However, with imbalanced datasets, random splitting can create folds where some classes are underrepresented or absent entirely.
Stratified K-Fold solves this by ensuring each fold has approximately the same class distribution as the full dataset. If the dataset has 90% negative and 10% positive samples, each fold will also have approximately 90/10 distribution. This produces more stable and reliable CV estimates, especially for minority class metrics like recall and F1.
Stratified K-Fold should be the default for classification problems. For regression problems where there are no discrete classes, you can use stratification based on binned target values (dividing the continuous target into quantiles and treating each quantile as a "class"). Scikit-learn's StratifiedKFold implements this for classification, and libraries like scikit-learn provide StratifiedShuffleSplit for repeated stratified splitting.
Other Cross Validation Variants
- Repeated K-Fold: Runs K-Fold multiple times with different random shuffles, then averages all scores. Reduces variance further at the cost of more computation. Common choices: 5x2 CV (5 repeats of 2-fold) or 10x10 CV (10 repeats of 10-fold).
- Leave-One-Out (LOOCV): K equals N (the dataset size). Each iteration trains on N-1 samples and validates on 1. Minimizes bias but maximizes variance and computation. Useful only for very small datasets (N < 50).
- Leave-P-Out: Generalizes LOOCV by withholding P samples in each iteration. Combinatorially explosive: C(N, P) iterations. Rarely used in practice.
- Group K-Fold: Ensures that data from the same group (e.g., same patient, same user) never appears in both training and validation. Essential when data points within a group are correlated.
- Time Series Split: For temporal data. Uses expanding or sliding windows to ensure the model always trains on past data and validates on future data. Prevents temporal leakage.
- Nested Cross Validation: Uses an inner CV loop for hyperparameter tuning and an outer CV loop for performance estimation. Prevents information leakage from hyperparameter selection into the performance estimate. Essential for rigorous model comparison.
Common Mistakes and Best Practices
The most dangerous mistake in cross validation is data leakage, where information from the validation set leaks into the training process. This happens when preprocessing steps (feature scaling, feature selection, imputation) are performed on the entire dataset before splitting. The correct approach is to fit all preprocessing on the training fold only, then apply (transform) to the validation fold. In scikit-learn, use Pipeline objects to ensure this happens automatically.
Another common mistake is using CV scores for final model selection AND performance reporting. If you try many model configurations and report the best CV score, you are optimistically biased because you selected the score that happened to be highest. Use nested CV: inner loop for model selection, outer loop for unbiased performance estimation.
- Always use stratified K-Fold for classification tasks.
- Use K=5 as a default, K=10 for smaller datasets.
- Put all preprocessing inside the CV loop (use Pipelines).
- Report mean ± std, not just the mean.
- For time series data, use TimeSeriesSplit, never random K-Fold.
- For grouped data (multiple samples per subject), use GroupKFold.
- Use nested CV when simultaneously selecting hyperparameters and estimating performance.
- Set a random seed for reproducibility.
Frequently Asked Questions
What is K-Fold cross validation?
K-Fold cross validation splits the dataset into K equal-sized folds. The model is trained K times, each time using K-1 folds for training and the remaining fold for validation. The final performance estimate is the average of all K scores, giving a more reliable estimate than a single train-test split.
How do I choose the value of K?
K=5 or K=10 are the most common choices. K=5 provides a good balance between bias and variance. K=10 gives lower bias but slightly higher variance. For small datasets, use K=10 or LOOCV. For large datasets, K=5 is usually sufficient.
What is the difference between stratified and regular K-Fold?
Regular K-Fold randomly assigns data points to folds without considering class labels. Stratified K-Fold ensures each fold has approximately the same proportion of each class as the full dataset. Stratified K-Fold is the default choice for classification problems, especially with imbalanced data.
Why not just use a simple train/test split?
A single split gives only one estimate which can be lucky or unlucky. Cross validation provides K estimates, giving you a mean and standard deviation. The standard deviation reveals how sensitive your model is to the training data. Cross validation also uses all data for both training and evaluation.
Can I use cross validation for time series data?
Standard K-Fold should NOT be used for time series data because it violates temporal ordering. Instead, use time series specific methods like expanding window, sliding window, or TimeSeriesSplit in scikit-learn, which preserve temporal order and prevent data leakage.
Related Tools
About the Author
Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.
Last updated: May 25, 2026