Dataset Splitter Calculator
Configure train, validation, and test split ratios for your machine learning dataset. Preview how stratification preserves class distributions across splits, verify minimum samples per class, and generate production-ready Python code. Supports binary and multi-class datasets with up to 10 classes. All computation runs locally in your browser.
Dataset Configuration
Split Configuration
Why Data Splitting Matters
Proper data splitting is one of the most important steps in any machine learning pipeline, yet it is frequently done incorrectly. The fundamental goal is to evaluate your model on data it has never seen during training or any decision-making process. Without a clean separation between training and evaluation data, your reported model performance will be optimistically biased, sometimes dramatically so. Models that appear to achieve 99% accuracy during development may perform at 70% in production because the evaluation was contaminated.
The standard approach divides your dataset into three parts. The training set (typically 60-80% of data) is used to fit model parameters: learning the weights of a neural network, the split points of a decision tree, or the coefficients of a linear model. The validation set (typically 10-20%) is used during development for hyperparameter tuning, model selection, architecture decisions, and early stopping. The test set (typically 10-20%) is held out completely until the very end and used exactly once to estimate the final generalization performance that you report in papers, to stakeholders, or use for deployment decisions.
A common mistake is using only a train/test split and then tuning hyperparameters against the test set. Every time you evaluate on the test set and make a decision (change the learning rate, add a layer, try a different model), you are implicitly fitting to the test set. After dozens of such decisions, the test performance is no longer a reliable estimate of true generalization. The validation set exists specifically to absorb this optimization pressure, keeping the test set pristine.
Understanding Split Ratios
The optimal split ratio depends primarily on your dataset size. For large datasets (100,000+ samples), you can afford to allocate a smaller percentage to validation and test because even a small percentage yields thousands of samples, providing reliable performance estimates. Ratios like 90/5/5 or 95/2.5/2.5 are common in large-scale deep learning. Andrew Ng recommends that in the era of big data, you should have just enough validation and test data to evaluate your model, and put everything else in training.
For medium datasets (1,000-100,000 samples), the classic 70/15/15 or 80/10/10 split works well. This provides enough training data for the model to learn meaningful patterns while keeping enough evaluation data for reliable metrics. For small datasets (under 1,000 samples), a single holdout split may have high variance: different random splits can produce very different performance estimates. In this case, cross-validation is strongly preferred, where you rotate the validation fold across the entire dataset and average the results.
The minimum test set size depends on the granularity of metric you need. For binary classification accuracy, you need approximately 384 test samples for a 95% confidence interval of plus or minus 5 percentage points (based on the normal approximation to the binomial). For more precise estimates (plus or minus 1%), you need approximately 9,604 samples. For rare classes, you need enough test samples of the minority class to compute meaningful precision and recall.
Stratified Splitting Explained
Stratified splitting ensures that each partition (train, validation, test) has approximately the same class distribution as the original dataset. If your dataset is 85% negative and 15% positive, a stratified split guarantees each partition is approximately 85/15. This is critical for classification tasks because a purely random split can produce partitions with very different class ratios, especially when classes are imbalanced or the dataset is small.
Consider a dataset with 1,000 samples where 50 belong to the positive class (5% prevalence). A random 80/20 split might put anywhere from 2 to 18 positive samples in the test set of 200 (expected: 10). With only 2 positive test samples, precision and recall become meaningless. Stratified splitting guarantees approximately 10 positive samples in the test set, providing a more reliable evaluation. In scikit-learn, train_test_split(X, y, stratify=y) enables stratification with a single parameter.
Beyond class labels, you may need to stratify on other factors. In medical imaging, stratify by patient ID to prevent the same patient's images from appearing in both train and test (data leakage). In multi-site studies, stratify by site to ensure each split represents all sites. In time series, use temporal splits rather than random splits. For regression tasks, you can stratify on binned target values to ensure the range of target values is represented in all splits.
Cross-Validation: Beyond Single Splits
K-fold cross-validation divides the dataset into K equally-sized folds, trains the model K times (each time using K-1 folds for training and 1 fold for validation), and averages the K performance estimates. The standard choice is K=5 or K=10. Cross-validation uses all data for both training and evaluation, producing lower-variance performance estimates than a single holdout split.
Stratified K-fold preserves class distributions within each fold. Repeated stratified K-fold runs the procedure multiple times with different random splits and averages across all repetitions, further reducing variance. Leave-one-out cross-validation (LOOCV) uses K=N (each sample is a fold), providing the lowest-bias estimate but highest variance and highest compute cost. LOOCV is appropriate for very small datasets (under 50 samples).
Even when using cross-validation for model selection and hyperparameter tuning, you should still hold out a final test set that is never used in the cross-validation loop. This nested approach (outer split for final evaluation, inner cross-validation for model selection) provides an honest estimate of generalization performance. In scikit-learn, cross_val_score with a separate test set or nested cross-validation implements this pattern.
Data Leakage: The Silent Killer
Data leakage occurs when information from the evaluation set leaks into the training process, producing artificially inflated performance metrics. It is the most common and most damaging mistake in machine learning evaluation. Leakage can be subtle and hard to detect. Common sources include fitting preprocessing steps on the entire dataset before splitting, allowing duplicate or near-duplicate samples across splits, and using future information in time series predictions.
Preprocessing leakage is the most frequent form. If you standardize features (subtract mean, divide by standard deviation) using the entire dataset, the training set's statistics are contaminated by information from the test set. The correct approach is to fit the scaler on the training set only, then transform both training and test sets using the training set's statistics. In scikit-learn, always use Pipeline to chain preprocessing and modeling steps, ensuring that cross_val_score correctly fits preprocessing within each fold.
Sample leakage occurs when related samples appear in both train and test sets. Medical images from the same patient, text documents with overlapping content, or augmented versions of the same image can all cause leakage. Use group-aware splits (GroupKFold in scikit-learn) to ensure that all samples from the same group (patient, document, session) stay in the same split.
Temporal leakage occurs in time series when future data is used to predict the past. Always split time series chronologically, with training data from earlier periods and test data from later periods. Never shuffle time series data before splitting. Use TimeSeriesSplit in scikit-learn for temporal cross-validation.
Special Considerations for Deep Learning
Deep learning introduces additional splitting considerations. Large models with millions of parameters need more training data, so allocate as much as possible to the training set. The validation set serves double duty: it is used for early stopping (stopping training when validation loss stops decreasing) and for hyperparameter tuning. Some practitioners use separate validation sets for each purpose to avoid overfitting the early stopping decision to the hyperparameter tuning set, though this is rarely necessary in practice.
Data augmentation should be applied only to the training set, never to validation or test sets. If you augment before splitting, augmented versions of the same original sample may appear in both train and test, creating leakage. In practice, augmentation is typically applied on-the-fly during training using data loaders (PyTorch's DataLoader with transform pipelines).
For transfer learning and fine-tuning, the split strategy depends on how much data you have. With very few samples (under 100), freeze the pretrained backbone and only train the classifier head, using cross-validation for evaluation. With moderate data (100-10,000), fine-tune with a standard train/val/test split. With abundant data (10,000+), you can afford to train longer and use smaller validation/test ratios.
Practical Recommendations
- Always use stratified splitting for classification tasks. It costs nothing and prevents many problems.
- For datasets under 1,000 samples, use stratified K-fold cross-validation (K=5 or K=10) instead of a single holdout split.
- Never look at or evaluate on the test set until you have finalized your model. Treat it as sealed.
- Fit all preprocessing (scaling, encoding, imputation) on the training set only. Use sklearn Pipeline to automate this.
- For time series, always split chronologically. Never shuffle before splitting.
- Check for duplicate or near-duplicate samples across splits. Even partial overlap can inflate metrics.
- Report confidence intervals on your test metrics, not just point estimates. Bootstrap resampling of the test set is the simplest approach.
- Document your splitting strategy, random seed, and split sizes. Reproducibility requires exact specification of the data split.
Frequently Asked Questions
What is the best train/test split ratio?
The most common split is 80/20 for train/test or 70/15/15 for train/validation/test. For large datasets (100K+), use 90/5/5 or 98/1/1. For small datasets (under 1,000), use cross-validation instead of a single holdout split.
What is stratified splitting and when should I use it?
Stratified splitting ensures each split maintains the same class distribution as the original dataset. Always use it for classification tasks, especially with imbalanced classes, to prevent unreliable evaluation from skewed splits.
Why do I need a validation set separate from the test set?
The validation set is used during development for hyperparameter tuning and model selection. The test set is used only once at the end. Using the test set for decisions during development contaminates it and produces optimistically biased performance estimates.
How do I handle time series data splitting?
Time series must be split chronologically, not randomly. Training data from the earliest period, validation from the middle, test from the most recent. Use TimeSeriesSplit in sklearn for temporal cross-validation.
What is data leakage and how does splitting prevent it?
Data leakage occurs when information from outside the training set influences model building. Proper splitting with preprocessing fit only on training data prevents leakage. Use sklearn Pipeline with cross_val_score for leak-free workflows.
Related Tools
About the Author
Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.
Last updated: May 25, 2026