The complete quick reference for ML algorithms, evaluation metrics, sklearn code snippets, hyperparameters, data preprocessing, and common pitfalls. One page. Everything you need.
Pick your data type and goal. Get algorithm recommendations with reasoning.
Select both options above to get personalized algorithm recommendations.
Supervised learning algorithms learn a mapping from input features X to an output y using labeled training data. The two main tasks are classification (predicting a discrete label) and regression (predicting a continuous value).
When to use: Continuous target variable with approximately linear relationship to features. Good baseline for any regression task. Works best when features are not highly correlated (multicollinearity).
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Coefficients reveal feature importance
print(model.coef_) # feature weights
print(model.intercept_) # bias term
Regularized variants: Use Ridge (L2) when features are correlated, Lasso (L1) when you want feature selection, ElasticNet for both.
When to use: Binary or multiclass classification. Strong baseline for any classification task. Excellent when you need probability estimates and interpretable coefficients. Works well with high-dimensional sparse data (text classification).
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1.0, max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test) # probability estimates
# C controls regularization strength (smaller = stronger)
When to use: When interpretability is critical. Handles mixed feature types and missing values natively. No feature scaling needed. Use as a building block for understanding your data before trying ensembles.
from sklearn.tree import DecisionTreeClassifier, export_text
model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)
model.fit(X_train, y_train)
# Print the tree rules
print(export_text(model, feature_names=feature_names))
Key limitation: Prone to overfitting without depth constraints. Always set max_depth or min_samples_leaf.
When to use: Excellent general-purpose algorithm for tabular data. Robust to overfitting, handles nonlinear relationships, provides feature importance scores. Great when you need a strong model without much tuning. Parallelizable.
from sklearn.ensemble import RandomForestClassifier
import numpy as np
model = RandomForestClassifier(
n_estimators=200,
max_depth=15,
min_samples_leaf=5,
max_features='sqrt',
n_jobs=-1,
random_state=42
)
model.fit(X_train, y_train)
# Feature importance
importances = model.feature_importances_
sorted_idx = np.argsort(importances)[::-1]
for i in sorted_idx[:10]:
print(f"{feature_names[i]}: {importances[i]:.4f}")
When to use: Small to medium datasets with clear margin of separation. Effective in high-dimensional spaces (text classification). Kernel trick enables non-linear decision boundaries. Less effective on noisy data with overlapping classes or datasets with more than 100K samples (slow training).
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
# SVM requires feature scaling!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = SVC(C=1.0, kernel='rbf', probability=True)
model.fit(X_train_scaled, y_train)
y_prob = model.predict_proba(X_test_scaled)
When to use: Small datasets, simple decision boundaries, or as a baseline. Non-parametric with no training phase. Struggles with high-dimensional data (curse of dimensionality) and large datasets (slow predictions). Requires feature scaling.
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5, weights='distance')
model.fit(X_train_scaled, y_train) # must scale features first
# Finding optimal k: try odd values to avoid ties
# Typical range: 3 to sqrt(n)
When to use: The go-to algorithm for tabular data competitions and production systems. Sequentially builds trees that correct previous errors. Consistently wins Kaggle competitions. XGBoost, LightGBM, and CatBoost are the three major implementations. LightGBM is fastest, CatBoost handles categoricals best, XGBoost is most widely deployed.
# Option A: sklearn API
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=5
)
# Option B: XGBoost (recommended)
import xgboost as xgb
model = xgb.XGBClassifier(
n_estimators=500,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
eval_metric='logloss',
early_stopping_rounds=50,
n_jobs=-1
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])
# Option C: LightGBM (fastest)
import lightgbm as lgb
model = lgb.LGBMClassifier(
n_estimators=1000,
learning_rate=0.03,
num_leaves=31,
subsample=0.8
)
Tip: Always use early stopping with boosting. Set n_estimators high and let early stopping find the optimal number. Reduce learning_rate and increase n_estimators for better performance.
When to use: Large datasets (100K+ samples), image recognition, natural language processing, speech, and any task where feature engineering is difficult. Requires significant data and compute. For tabular data, gradient boosting usually wins unless you have millions of rows. Use sklearn's MLPClassifier for quick experiments; switch to PyTorch/TensorFlow for production deep learning.
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(
hidden_layer_sizes=(128, 64, 32),
activation='relu',
solver='adam',
learning_rate_init=0.001,
early_stopping=True,
validation_fraction=0.1,
max_iter=500,
random_state=42
)
model.fit(X_train_scaled, y_train) # must scale features first
# For deep learning: use PyTorch or TensorFlow
# import torch.nn as nn
# model = nn.Sequential(nn.Linear(d, 128), nn.ReLU(), ...)
Unsupervised learning finds patterns in data without labeled targets. The two primary tasks are clustering (grouping similar data points) and dimensionality reduction (compressing features while preserving structure).
When to use: Clusters are roughly spherical and similarly sized. You know (or can estimate) the number of clusters. Fast and scalable to large datasets. Sensitive to initialization (use k-means++) and outliers.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Find optimal k using elbow method or silhouette score
scores = []
for k in range(2, 11):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = km.fit_predict(X_scaled)
scores.append(silhouette_score(X_scaled, labels))
# Best k = max silhouette score
best_k = scores.index(max(scores)) + 2
model = KMeans(n_clusters=best_k, random_state=42)
clusters = model.fit_predict(X_scaled)
When to use: Clusters have arbitrary shapes, unknown number of clusters, or data contains outliers/noise. Does not require specifying k in advance. Automatically detects noise points. Struggles with varying density clusters (use HDBSCAN instead).
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
import numpy as np
# Find optimal eps using k-distance graph
nn = NearestNeighbors(n_neighbors=5)
nn.fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
k_distances = np.sort(distances[:, -1]) # find the "elbow"
model = DBSCAN(eps=0.5, min_samples=5)
labels = model.fit_predict(X_scaled)
n_clusters = len(set(labels) - {-1})
n_noise = (labels == -1).sum()
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")
When to use: Reduce feature count while preserving maximum variance. Preprocessing step before other algorithms. Visualization (project to 2D/3D). Dealing with multicollinearity. Fast and deterministic. Does not work well when the important structure is non-linear.
from sklearn.decomposition import PCA
import numpy as np
# Keep 95% of variance
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_scaled)
print(f"Reduced from {X_scaled.shape[1]} to {X_reduced.shape[1]} features")
# Explained variance per component
print(np.cumsum(pca.explained_variance_ratio_))
# For visualization only
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)
When to use: Visualizing high-dimensional data in 2D/3D. Preserves local neighborhood structure. Excellent for exploring cluster structure visually. Do not use for preprocessing before ML models (non-deterministic, no inverse transform, does not preserve global distances). Slow on large datasets.
from sklearn.manifold import TSNE
# Apply PCA first for speed if d > 50
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X_scaled)
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_2d = tsne.fit_transform(X_pca)
# perplexity: try values between 5-50
# Higher = more global structure, lower = more local detail
When to use: Modern alternative to t-SNE that is faster, preserves more global structure, and supports transforming new data points. Works for both visualization and as a preprocessing step. Better at preserving the relative distances between clusters. Requires the umap-learn package.
import umap
reducer = umap.UMAP(
n_components=2,
n_neighbors=15,
min_dist=0.1,
metric='euclidean',
random_state=42
)
X_2d = reducer.fit_transform(X_scaled)
# Unlike t-SNE, UMAP can transform new data
X_new_2d = reducer.transform(X_new_scaled)
# n_neighbors: controls local vs global (higher = more global)
# min_dist: how tightly points cluster (lower = tighter)
Choosing the right metric is as important as choosing the right algorithm. A model optimized for accuracy might be useless for imbalanced classes. This section covers all the metrics you need, organized by task type, with when-to-use guidance and sklearn code.
| Metric | Formula | When to Use | Range |
|---|---|---|---|
| Accuracy | (TP + TN) / Total | Balanced classes only. Misleading when one class dominates. | 0 to 1 |
| Precision | TP / (TP + FP) | When false positives are costly (spam detection, fraud alerts). | 0 to 1 |
| Recall | TP / (TP + FN) | When false negatives are costly (disease screening, safety). | 0 to 1 |
| F1 Score | 2 · (P · R) / (P + R) | Balance precision and recall. Use when classes are imbalanced. | 0 to 1 |
| AUC-ROC | Area under ROC curve | Overall discrimination ability. Threshold-independent. Robust to class imbalance. | 0.5 to 1 |
| Log Loss | -mean(y·log(p)) | When you care about calibrated probabilities, not just labels. | 0 to inf |
| Cohen's Kappa | (Accuracy - Expected) / (1 - Expected) | Accounts for chance agreement. Good for imbalanced multiclass. | -1 to 1 |
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, classification_report,
confusion_matrix, log_loss
)
# Full classification report (precision, recall, f1 per class)
print(classification_report(y_test, y_pred))
# Individual metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred, average='weighted'):.4f}")
print(f"Recall: {recall_score(y_test, y_pred, average='weighted'):.4f}")
print(f"F1: {f1_score(y_test, y_pred, average='weighted'):.4f}")
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob[:, 1]):.4f}")
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Rows = actual, Columns = predicted
# [[TN, FP], [FN, TP]]
| Metric | Formula | When to Use | Best Value |
|---|---|---|---|
| MSE | mean((y - ŷ)²) | Default regression metric. Penalizes large errors heavily. | 0 |
| RMSE | sqrt(MSE) | Same as MSE but in original units. More interpretable. | 0 |
| MAE | mean(|y - ŷ|) | Robust to outliers. When all errors matter equally. | 0 |
| R² Score | 1 - SS_res / SS_tot | How much variance your model explains. Baseline: 0 (mean predictor). | 1 |
| MAPE | mean(|y - ŷ| / |y|) | Percentage error. Scale-independent. Undefined when y=0. | 0% |
from sklearn.metrics import (
mean_squared_error, mean_absolute_error,
r2_score, mean_absolute_percentage_error
)
import numpy as np
print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}")
print(f"R2: {r2_score(y_test, y_pred):.4f}")
print(f"MAPE: {mean_absolute_percentage_error(y_test, y_pred):.2%}")
Data preprocessing is often 80% of the work in a machine learning project. The quality of your preprocessing directly determines model performance. Always preprocess after splitting into train/test to avoid data leakage.
Many algorithms (SVM, KNN, Neural Networks, PCA, Gradient Descent-based) require features on the same scale. Tree-based algorithms (Random Forest, XGBoost) do not require scaling.
| Method | Formula | When to Use |
|---|---|---|
| StandardScaler | (x - mean) / std | Default choice. Assumes roughly normal distribution. Sensitive to outliers. |
| MinMaxScaler | (x - min) / (max - min) | When you need values in [0, 1]. Good for neural networks. Very sensitive to outliers. |
| RobustScaler | (x - median) / IQR | Data contains outliers. Uses median and interquartile range. |
| MaxAbsScaler | x / max(|x|) | Sparse data. Does not destroy sparsity. |
from sklearn.preprocessing import StandardScaler, RobustScaler
# CRITICAL: fit on train only, transform both
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform
X_test_scaled = scaler.transform(X_test) # transform only!
# With outliers? Use RobustScaler
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
| Method | When to Use | Output |
|---|---|---|
| OrdinalEncoder | Ordinal categories (low/med/high). Tree-based models. | Integer per category |
| OneHotEncoder | Nominal categories (<15 unique values). Linear models, SVM, KNN. | Binary column per category |
| TargetEncoder | High-cardinality categoricals (100+ values). Gradient boosting models. | Float (mean target per category) |
| LabelEncoder | Target variable encoding only. Not for features. | Integer |
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
# Best practice: use ColumnTransformer for mixed types
preprocessor = ColumnTransformer(transformers=[
('num', StandardScaler(), numerical_cols),
('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_cols)
])
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
from sklearn.impute import SimpleImputer, KNNImputer
# Strategy: 'mean', 'median', 'most_frequent', 'constant'
imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)
# For more sophisticated imputation
knn_imputer = KNNImputer(n_neighbors=5)
X_train_imputed = knn_imputer.fit_transform(X_train)
# When to use what:
# - median: numeric features with outliers
# - mean: numeric features, normally distributed
# - most_frequent: categorical features
# - KNNImputer: when values are not missing at random
# - Always add a binary "is_missing" indicator column for important features
Reducing the number of features improves model performance, reduces overfitting, and speeds up training. Three approaches, from simplest to most thorough.
from sklearn.feature_selection import (
SelectKBest, f_classif, mutual_info_classif,
RFE, SequentialFeatureSelector
)
# Method 1: Filter — fast, univariate
selector = SelectKBest(score_func=f_classif, k=20)
X_selected = selector.fit_transform(X_train, y_train)
# Method 2: Wrapper — uses model performance
from sklearn.ensemble import RandomForestClassifier
selector = RFE(RandomForestClassifier(n_estimators=100), n_features_to_select=15)
X_selected = selector.fit_transform(X_train, y_train)
# Method 3: Embedded — L1 regularization (built into model)
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel
lasso = Lasso(alpha=0.01)
selector = SelectFromModel(lasso)
X_selected = selector.fit_transform(X_train, y_train)
# Check which features were selected
selected_mask = selector.get_support()
selected_features = [f for f, s in zip(feature_names, selected_mask) if s]
You have picked an algorithm and preprocessed your data. Now you need to validate that your model generalizes and tune its hyperparameters for best performance. This section covers the essential techniques.
Never evaluate on training data. Never rely on a single train/test split. Cross-validation gives a reliable estimate of generalization performance.
from sklearn.model_selection import (
cross_val_score, StratifiedKFold, TimeSeriesSplit
)
# Standard: 5-fold stratified cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='f1_weighted')
print(f"F1: {scores.mean():.4f} +/- {scores.std():.4f}")
# Time series: NEVER shuffle, use temporal split
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X_train, y_train, cv=tscv, scoring='neg_mean_squared_error')
# Common scoring values:
# Classification: 'accuracy', 'f1', 'f1_weighted', 'roc_auc', 'precision', 'recall'
# Regression: 'neg_mean_squared_error', 'neg_mean_absolute_error', 'r2'
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# Grid Search: exhaustive, use when few parameters
param_grid = {
'n_estimators': [100, 200, 500],
'max_depth': [5, 10, 15, None],
'min_samples_leaf': [1, 5, 10]
}
grid = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid, cv=5, scoring='f1_weighted', n_jobs=-1
)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best CV F1: {grid.best_score_:.4f}")
# Randomized Search: faster, use when many parameters
from scipy.stats import randint, uniform
param_dist = {
'n_estimators': randint(50, 500),
'max_depth': randint(3, 20),
'learning_rate': uniform(0.01, 0.3),
'subsample': uniform(0.6, 0.4)
}
search = RandomizedSearchCV(
xgb.XGBClassifier(), param_dist,
n_iter=100, cv=5, scoring='f1_weighted',
random_state=42, n_jobs=-1
)
search.fit(X_train, y_train)
Tip: For serious tuning, use Optuna or Hyperopt instead of GridSearch. They use Bayesian optimization to find better hyperparameters in fewer trials. Optuna's pruning feature can 10x your search speed.
Learning curves diagnose whether your model suffers from high bias (underfitting) or high variance (overfitting), and whether more data would help.
from sklearn.model_selection import learning_curve
import numpy as np
train_sizes, train_scores, val_scores = learning_curve(
model, X_train, y_train,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5, scoring='f1_weighted', n_jobs=-1
)
# Interpretation:
# Gap between train and val = overfitting (high variance)
# Fix: more data, regularization, simpler model
# Both scores low = underfitting (high bias)
# Fix: more features, complex model, less regularization
# Val score plateaus = more data won't help
Always use sklearn Pipelines to chain preprocessing and modeling. This prevents data leakage, simplifies deployment, and makes cross-validation correct.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
# Define preprocessing for different column types
numeric_pipe = Pipeline([
('impute', SimpleImputer(strategy='median')),
('scale', StandardScaler())
])
categorical_pipe = Pipeline([
('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer([
('num', numeric_pipe, numerical_cols),
('cat', categorical_pipe, categorical_cols)
])
# Full pipeline: preprocess + model
pipeline = Pipeline([
('preprocess', preprocessor),
('model', RandomForestClassifier(n_estimators=200, random_state=42))
])
# Now cross-validation is fully correct
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1_weighted')
# Train final model
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
These are the mistakes that silently destroy model performance. They will give you artificially high scores during development and catastrophic failures in production. Learn to spot and prevent them.
The number one killer of ML projects. Data leakage occurs when information from the test set bleeds into training. Common causes: (1) Fitting a scaler or imputer on the entire dataset before splitting. (2) Including features derived from the target (e.g., customer lifetime value to predict churn). (3) Time-series data split randomly instead of temporally. (4) Duplicate records spanning train and test. Prevention: Always split first, then preprocess. Use sklearn Pipelines. For time series, use TimeSeriesSplit. Check for duplicates with df.duplicated().
When one class dominates (99% negative, 1% positive), a model that predicts "negative" always gets 99% accuracy. Solutions: (1) Use appropriate metrics: F1, AUC-ROC, precision-recall curves, not accuracy. (2) Use class_weight='balanced' in sklearn models. (3) Resample: SMOTE for oversampling minority, random undersampling for majority. (4) Stratified cross-validation with StratifiedKFold. (5) Adjust decision threshold based on precision-recall tradeoff.
# Handling class imbalance
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
# Option 1: built-in class weights
model = RandomForestClassifier(class_weight='balanced')
# Option 2: manual weights
weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
model = xgb.XGBClassifier(scale_pos_weight=weights[1]/weights[0])
# Option 3: SMOTE oversampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
# Option 4: threshold tuning
y_prob = model.predict_proba(X_test)[:, 1]
threshold = 0.3 # lower threshold to catch more positives
y_pred_adjusted = (y_prob >= threshold).astype(int)
Training score is high, test score is low. The model memorized noise instead of learning patterns. Signs: Large gap between train and validation scores. Model performance degrades on new data. Solutions: (1) Get more training data. (2) Reduce model complexity (lower max_depth, fewer features, fewer layers). (3) Add regularization (L1/L2 penalties, dropout). (4) Use cross-validation to detect early. (5) Apply early stopping (boosting, neural networks). (6) Use data augmentation.
SVM, KNN, Neural Networks, PCA, and K-Means all depend on distances between data points. If one feature ranges from 0 to 1000 while another ranges from 0 to 1, the first feature will dominate entirely. Always scale features before using these algorithms. Tree-based models (Random Forest, XGBoost) do not require scaling.
Accuracy treats all classes equally, but some classes may be more important or harder to predict. Use average='weighted' for F1/precision/recall to account for class frequency, or average='macro' to treat all classes as equally important regardless of frequency. Always print the full classification_report to inspect per-class performance.
Highly correlated features (multicollinearity) inflate feature importance, make coefficients unstable in linear models, and waste computation. Detection: Correlation matrix (df.corr()), VIF (Variance Inflation Factor). Solutions: Drop one of correlated pairs (threshold: |r| > 0.9), use PCA, use L1/L2 regularization, or use tree-based models (more robust to correlation).
When working with time-series data, you must split temporally. Random shuffling creates look-ahead bias where your model uses future information to predict the past. Use TimeSeriesSplit for cross-validation. Never use features that incorporate future values (e.g., a rolling average that includes future dates).
It depends on your data type and goal. For tabular data classification, start with Gradient Boosting (XGBoost/LightGBM). For regression, try Linear Regression first, then Random Forest. For clustering, use K-Means for spherical clusters or DBSCAN for arbitrary shapes. For images, use Convolutional Neural Networks. For text, use transformer-based models or TF-IDF with Logistic Regression. Use the Algorithm Selector Tool above to get personalized recommendations.
Precision measures how many of your positive predictions were correct (TP / (TP + FP)). Recall measures how many actual positives you found (TP / (TP + FN)). Use precision when false positives are costly (spam detection). Use recall when false negatives are costly (disease screening). F1 score is the harmonic mean of both, giving a balanced single number.
Use cross-validation to detect overfitting early. Apply regularization (L1/L2). Reduce model complexity (fewer features, shallower trees). Increase training data or use data augmentation. Use dropout for neural networks. Apply early stopping during training. Ensure your validation set is representative of production data.
Data leakage occurs when information from outside the training dataset is used to create the model. Common causes: using test data during preprocessing (fit_transform on the whole dataset instead of fit on train only), including features derived from the target variable, and using future data to predict past events. Always split data before any preprocessing steps and use sklearn Pipelines.
Random Forest trains trees independently in parallel, making it faster and more robust to hyperparameter choices. Gradient Boosting trains trees sequentially, correcting previous errors, and usually achieves higher accuracy but is slower and more prone to overfitting. Start with Random Forest for baselines and switch to Gradient Boosting (XGBoost, LightGBM) when you need maximum performance.