Hyperparameter Tuning Guide

Compare grid search, random search, and Bayesian optimization side by side. Use the interactive tuner below to visualize how each strategy explores a 2D parameter space, watch convergence curves in real time, and understand why random search usually beats grid search. All computation runs locally in your browser.

Search Strategy Configuration

Search Method

Budget (evaluations)

Objective Landscape

Parameter 1 Range

Parameter 2 Range

What Are Hyperparameters?

Machine learning models have two types of parameters. Model parameters are learned from data during training: weights in a neural network, split points in a decision tree, or coefficients in linear regression. Hyperparameters are set before training begins and control how the learning algorithm operates. They cannot be estimated from the training data directly and must be specified by the practitioner or found through systematic search.

Common hyperparameters include the learning rate (how fast the model updates its weights), regularization strength (how much the model penalizes complexity), the number of trees in a random forest, the depth of those trees, batch size for stochastic gradient descent, the number of hidden layers and neurons in a neural network, and the kernel type and parameters for support vector machines. The choice of hyperparameters can make the difference between a model that achieves 70% accuracy and one that achieves 95% accuracy on the same dataset.

The fundamental challenge of hyperparameter tuning is that the relationship between hyperparameters and model performance is unknown, non-convex, and often noisy. You cannot compute the gradient of validation accuracy with respect to learning rate in any straightforward way. This makes hyperparameter optimization a black-box optimization problem, and different strategies for exploring the search space have dramatically different efficiency.

Grid Search: The Exhaustive Approach

Grid search evaluates every combination of hyperparameter values from a predefined set. If you specify 5 values for learning rate and 5 values for regularization, grid search evaluates all 25 combinations. For three parameters with 5 values each, it evaluates 125 combinations. This exponential growth is called the curse of dimensionality and is the fundamental limitation of grid search.

The advantage of grid search is its simplicity and completeness within the specified grid. You are guaranteed to find the best combination among the values you specified. In scikit-learn, GridSearchCV implements grid search with cross-validation in a few lines of code. Grid search is appropriate when you have only 1-2 hyperparameters to tune, each with a small number of candidate values, and model training is fast enough that the total compute cost is acceptable.

The critical weakness of grid search beyond scalability is that it wastes evaluations. Bergstra and Bengio demonstrated in their seminal 2012 paper that for most machine learning models, only a small subset of hyperparameters significantly affect performance. Grid search allocates equal resolution to all parameters, including those that do not matter. If learning rate is the only parameter that truly affects your model, a 5x5 grid search only tests 5 unique learning rate values despite using 25 evaluations.

Random Search: Surprisingly Effective

Random search samples hyperparameter combinations randomly from specified distributions rather than from a fixed grid. For continuous parameters, you define a distribution (uniform, log-uniform, or normal); for categorical parameters, you define a set of choices with optional weights. Each evaluation uses an independently drawn combination.

Random search is almost always more efficient than grid search for the same compute budget. The key insight from Bergstra and Bengio (2012) is that random search explores more unique values of each parameter. With a 5x5 grid, you test only 5 unique values per parameter. With 25 random samples, you test approximately 25 unique values per parameter (slightly fewer due to random collisions in bounded spaces). When only one or two parameters matter, random search effectively concentrates its exploration where it counts.

In practice, random search with 60 evaluations has been shown to find configurations within the top 5% of the search space with 95% probability, assuming the important parameters occupy at least 5% of their range. This means you can often get excellent results with far fewer evaluations than a complete grid would require. Random search is also trivially parallelizable: all evaluations are independent and can run on separate machines or GPUs simultaneously.

Bayesian Optimization: The Intelligent Search

Bayesian optimization is a sequential model-based optimization strategy that uses the results of past evaluations to guide future ones. Instead of searching blindly (grid) or randomly, Bayesian optimization builds a probabilistic model of the objective function and uses it to make informed decisions about where to evaluate next.

The process works in three steps repeated iteratively. First, a surrogate model approximates the objective function based on all evaluations so far. Common surrogate models include Gaussian Processes (used in scikit-optimize and GPyOpt), Tree-structured Parzen Estimators (TPE, used in Optuna and Hyperopt), and Random Forests (used in SMAC). Second, an acquisition function determines the most promising point to evaluate next by balancing exploration (sampling uncertain regions) and exploitation (sampling near the current best). Popular acquisition functions include Expected Improvement (EI), Upper Confidence Bound (UCB), and Probability of Improvement (PI). Third, the suggested point is evaluated (the model is trained and validated), and the result is added to the observation history.

Bayesian optimization is dramatically more sample-efficient than grid or random search. It typically finds near-optimal configurations in 10-50 evaluations, compared to hundreds or thousands for grid or random search. This makes it the preferred approach when each evaluation is expensive, such as training a deep neural network for hours on a GPU cluster. Tools like Optuna, Ray Tune, Hyperopt, and Weights & Biases Sweeps make Bayesian optimization accessible with minimal code changes.

Choosing the Right Strategy

The choice of hyperparameter tuning strategy depends on four factors: the number of hyperparameters, the cost of each evaluation, the available compute budget, and whether evaluations can be parallelized.

1-2 parameters, cheap evaluations: Grid search. Simple, complete, and fast enough. Use GridSearchCV in scikit-learn.
2-5 parameters, moderate cost: Random search. Better coverage per evaluation, trivially parallel. Use RandomizedSearchCV in scikit-learn.
3+ parameters, expensive evaluations: Bayesian optimization. Maximum sample efficiency. Use Optuna or Ray Tune.
Neural architecture search, massive budget: Population-based methods (PBT) or evolutionary strategies. Use Ray Tune's PBT scheduler.

A common practical workflow is to start with random search to identify promising regions of the search space, then switch to Bayesian optimization to fine-tune within those regions. Many practitioners also use successive halving (Hyperband) to quickly discard poor configurations by training them for only a few epochs before allocating more resources to promising candidates.

Advanced Techniques

Successive Halving and Hyperband address the problem that most hyperparameter configurations are obviously bad, and you do not need to fully train a model to know it will not perform well. Successive halving starts by training many configurations for a small number of epochs, then promotes the top half and doubles their training budget, repeating until one configuration remains. Hyperband runs multiple rounds of successive halving with different initial budgets to balance early stopping aggressiveness with thorough evaluation.

Multi-fidelity optimization combines Bayesian optimization with early stopping. BOHB (Bayesian Optimization and Hyperband) uses TPE as the surrogate model within a Hyperband framework, achieving state-of-the-art efficiency on many benchmarks. This approach is available in Ray Tune's BOHB scheduler and can reduce hyperparameter search costs by 10-100x compared to standard grid or random search.

Transfer learning for hyperparameters leverages knowledge from previous tuning runs on related tasks. If you tuned a ResNet on CIFAR-10, the optimal learning rate and weight decay are likely in a similar range for CIFAR-100. Tools like Optuna support warm-starting from previous studies, and research on meta-learning for hyperparameters aims to predict good initial configurations from dataset characteristics alone.

Common Pitfalls

Overfitting the validation set: Extensive hyperparameter search on a fixed validation set can overfit to that specific split. Use nested cross-validation: outer loop for performance estimation, inner loop for hyperparameter selection.
Wrong parameter scales: Learning rate and regularization strength should be searched on a log scale (e.g., 1e-5 to 1e-1), not linear. Searching linearly wastes evaluations in regions that produce nearly identical results.
Too narrow a range: If the best value is at the boundary of your search range, expand it. The optimal configuration should be in the interior of your search space.
Ignoring interactions: Some hyperparameters interact strongly. Learning rate and batch size are coupled: larger batches often require larger learning rates. Search them jointly, not independently.
Not using early stopping: For iterative models (neural networks, gradient boosting), always use early stopping during tuning. It reduces per-evaluation cost and acts as implicit regularization.
Tuning too many parameters at once: Start with the most impactful parameters (learning rate, model size) and fix the rest at reasonable defaults. Add more parameters to the search only after the primary ones are in a good range.

Practical Recipes

For XGBoost / LightGBM: First tune learning_rate (log-uniform 0.01-0.3) and n_estimators (100-2000) with early stopping. Then tune max_depth (3-10), min_child_weight (1-10), and subsample (0.5-1.0). Finally, tune regularization: reg_alpha (1e-5 to 10, log) and reg_lambda (1e-5 to 10, log). This staged approach reduces the effective dimensionality at each step.

For neural networks: Learning rate is the single most important hyperparameter. Use a learning rate finder (start very small, increase exponentially, plot loss) to identify the right order of magnitude. Then tune batch size (often a power of 2 from 16 to 512), architecture (layers, units), dropout (0.0-0.5), and weight decay (1e-6 to 1e-2, log). Use cosine annealing or one-cycle learning rate schedules, which reduce sensitivity to the initial learning rate.

For random forests: The most important parameter is max_features (try sqrt, log2, 0.3, 0.5, 0.7, 1.0). Then n_estimators (more is generally better until diminishing returns; 100-1000). max_depth and min_samples_split control overfitting. Random forests are relatively robust to hyperparameters compared to gradient boosting or neural networks.

Frequently Asked Questions

What is hyperparameter tuning in machine learning?

Hyperparameter tuning is the process of finding the optimal configuration of parameters that are set before training begins, rather than learned from data. These include learning rate, regularization strength, number of hidden layers, batch size, and tree depth. Unlike model parameters, hyperparameters control the learning process itself and must be selected through systematic search or optimization strategies.

What is the difference between grid search and random search?

Grid search exhaustively evaluates every combination of specified hyperparameter values, guaranteeing the best result within the grid but scaling exponentially with dimensions. Random search samples parameter combinations randomly from defined distributions, which is more efficient because it explores more unique values per parameter. For the same compute budget, random search typically finds better configurations.

How does Bayesian optimization work for hyperparameter tuning?

Bayesian optimization builds a probabilistic surrogate model of the objective function. After each evaluation, it updates the model and uses an acquisition function (like Expected Improvement) to decide which point to evaluate next, balancing exploration of uncertain regions with exploitation of promising areas. This makes it far more sample-efficient than grid or random search.

When should I use grid search vs Bayesian optimization?

Use grid search when you have few hyperparameters (1-2) and cheap evaluations. Use random search as a baseline with 2-5 parameters. Use Bayesian optimization when training is expensive, you have many hyperparameters, or you need the best performance with limited compute. For deep learning with GPU training, Bayesian optimization is almost always the right choice.

What are common hyperparameters to tune for different models?

For random forests: n_estimators, max_depth, min_samples_split, max_features. For gradient boosting: learning_rate, n_estimators, max_depth, subsample, reg_alpha, reg_lambda. For neural networks: learning_rate, batch_size, layers, dropout, weight decay. For SVMs: C, kernel, gamma. Always tune learning rate first as it has the largest impact.

Related Tools

About the Author

Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.

Last updated: May 25, 2026