Machine Learning Cheatsheet 2026

The complete quick reference for ML algorithms, evaluation metrics, sklearn code snippets, hyperparameters, data preprocessing, and common pitfalls. One page. Everything you need.

By Michael Lip · Updated May 16, 2026 · ~20 min read

Table of Contents

  1. Algorithm Selector Tool
  2. Supervised Learning
  3. Unsupervised Learning
  4. Evaluation Metrics
  5. Data Preprocessing
  6. Model Selection & Tuning
  7. Common Pitfalls
  8. FAQ

Algorithm Selector Tool

Pick your data type and goal. Get algorithm recommendations with reasoning.

Select both options above to get personalized algorithm recommendations.

1. Supervised Learning

Supervised learning algorithms learn a mapping from input features X to an output y using labeled training data. The two main tasks are classification (predicting a discrete label) and regression (predicting a continuous value).

1.1 Linear Regression

Linear Regression

Regression · Parametric · Linear

When to use: Continuous target variable with approximately linear relationship to features. Good baseline for any regression task. Works best when features are not highly correlated (multicollinearity).

Train: O(n·d²) · Predict: O(d) · Space: O(d)
fit_intercept=True normalize=False
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Coefficients reveal feature importance
print(model.coef_)       # feature weights
print(model.intercept_)  # bias term

Regularized variants: Use Ridge (L2) when features are correlated, Lasso (L1) when you want feature selection, ElasticNet for both.

1.2 Logistic Regression

Logistic Regression

Classification · Parametric · Linear

When to use: Binary or multiclass classification. Strong baseline for any classification task. Excellent when you need probability estimates and interpretable coefficients. Works well with high-dimensional sparse data (text classification).

Train: O(n·d·k) · Predict: O(d·k) · Space: O(d·k)
C=1.0 penalty='l2' solver='lbfgs' max_iter=100 multi_class='auto'
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=1.0, max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)  # probability estimates

# C controls regularization strength (smaller = stronger)

1.3 Decision Trees

Decision Tree

Classification & Regression · Non-parametric · Tree-based

When to use: When interpretability is critical. Handles mixed feature types and missing values natively. No feature scaling needed. Use as a building block for understanding your data before trying ensembles.

Train: O(n·d·log n) · Predict: O(log n) · Space: O(nodes)
max_depth=None min_samples_split=2 min_samples_leaf=1 criterion='gini' max_features=None
from sklearn.tree import DecisionTreeClassifier, export_text

model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)
model.fit(X_train, y_train)

# Print the tree rules
print(export_text(model, feature_names=feature_names))

Key limitation: Prone to overfitting without depth constraints. Always set max_depth or min_samples_leaf.

1.4 Random Forest

Random Forest

Classification & Regression · Ensemble · Bagging

When to use: Excellent general-purpose algorithm for tabular data. Robust to overfitting, handles nonlinear relationships, provides feature importance scores. Great when you need a strong model without much tuning. Parallelizable.

Train: O(T·n·d·log n) · Predict: O(T·log n) · Space: O(T·nodes)
n_estimators=100 max_depth=None max_features='sqrt' min_samples_leaf=1 n_jobs=-1 oob_score=False
from sklearn.ensemble import RandomForestClassifier
import numpy as np

model = RandomForestClassifier(
    n_estimators=200,
    max_depth=15,
    min_samples_leaf=5,
    max_features='sqrt',
    n_jobs=-1,
    random_state=42
)
model.fit(X_train, y_train)

# Feature importance
importances = model.feature_importances_
sorted_idx = np.argsort(importances)[::-1]
for i in sorted_idx[:10]:
    print(f"{feature_names[i]}: {importances[i]:.4f}")

1.5 Support Vector Machines (SVM)

SVM

Classification & Regression · Kernel-based

When to use: Small to medium datasets with clear margin of separation. Effective in high-dimensional spaces (text classification). Kernel trick enables non-linear decision boundaries. Less effective on noisy data with overlapping classes or datasets with more than 100K samples (slow training).

Train: O(n²·d) to O(n³) · Predict: O(sv·d) · Space: O(sv·d)
C=1.0 kernel='rbf' gamma='scale' class_weight=None
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# SVM requires feature scaling!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = SVC(C=1.0, kernel='rbf', probability=True)
model.fit(X_train_scaled, y_train)
y_prob = model.predict_proba(X_test_scaled)

1.6 K-Nearest Neighbors (KNN)

K-Nearest Neighbors

Classification & Regression · Instance-based · Lazy Learner

When to use: Small datasets, simple decision boundaries, or as a baseline. Non-parametric with no training phase. Struggles with high-dimensional data (curse of dimensionality) and large datasets (slow predictions). Requires feature scaling.

Train: O(1) · Predict: O(n·d) · Space: O(n·d)
n_neighbors=5 weights='uniform' metric='minkowski' p=2
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5, weights='distance')
model.fit(X_train_scaled, y_train)  # must scale features first

# Finding optimal k: try odd values to avoid ties
# Typical range: 3 to sqrt(n)

1.7 Gradient Boosting

Gradient Boosting (XGBoost / LightGBM / CatBoost)

Classification & Regression · Ensemble · Boosting

When to use: The go-to algorithm for tabular data competitions and production systems. Sequentially builds trees that correct previous errors. Consistently wins Kaggle competitions. XGBoost, LightGBM, and CatBoost are the three major implementations. LightGBM is fastest, CatBoost handles categoricals best, XGBoost is most widely deployed.

Train: O(T·n·d·log n) · Predict: O(T·depth) · Space: O(T·leaves)
n_estimators=100 learning_rate=0.1 max_depth=6 subsample=0.8 colsample_bytree=0.8 reg_alpha=0 reg_lambda=1
# Option A: sklearn API
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=5
)

# Option B: XGBoost (recommended)
import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='logloss',
    early_stopping_rounds=50,
    n_jobs=-1
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])

# Option C: LightGBM (fastest)
import lightgbm as lgb

model = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.03,
    num_leaves=31,
    subsample=0.8
)

Tip: Always use early stopping with boosting. Set n_estimators high and let early stopping find the optimal number. Reduce learning_rate and increase n_estimators for better performance.

1.8 Neural Networks

Neural Networks (MLP / Deep Learning)

Classification & Regression · Non-linear · Deep Learning

When to use: Large datasets (100K+ samples), image recognition, natural language processing, speech, and any task where feature engineering is difficult. Requires significant data and compute. For tabular data, gradient boosting usually wins unless you have millions of rows. Use sklearn's MLPClassifier for quick experiments; switch to PyTorch/TensorFlow for production deep learning.

Train: O(epochs·n·layers·neurons²) · Predict: O(layers·neurons²)
hidden_layer_sizes=(100,) activation='relu' solver='adam' learning_rate_init=0.001 batch_size='auto' max_iter=200 early_stopping=True
from sklearn.neural_network import MLPClassifier

model = MLPClassifier(
    hidden_layer_sizes=(128, 64, 32),
    activation='relu',
    solver='adam',
    learning_rate_init=0.001,
    early_stopping=True,
    validation_fraction=0.1,
    max_iter=500,
    random_state=42
)
model.fit(X_train_scaled, y_train)  # must scale features first

# For deep learning: use PyTorch or TensorFlow
# import torch.nn as nn
# model = nn.Sequential(nn.Linear(d, 128), nn.ReLU(), ...)

2. Unsupervised Learning

Unsupervised learning finds patterns in data without labeled targets. The two primary tasks are clustering (grouping similar data points) and dimensionality reduction (compressing features while preserving structure).

2.1 K-Means Clustering

K-Means

Clustering · Centroid-based · Partitional

When to use: Clusters are roughly spherical and similarly sized. You know (or can estimate) the number of clusters. Fast and scalable to large datasets. Sensitive to initialization (use k-means++) and outliers.

Train: O(n·k·d·iterations) · Predict: O(k·d)
n_clusters=8 init='k-means++' n_init=10 max_iter=300
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Find optimal k using elbow method or silhouette score
scores = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    scores.append(silhouette_score(X_scaled, labels))

# Best k = max silhouette score
best_k = scores.index(max(scores)) + 2
model = KMeans(n_clusters=best_k, random_state=42)
clusters = model.fit_predict(X_scaled)

2.2 DBSCAN

DBSCAN

Clustering · Density-based

When to use: Clusters have arbitrary shapes, unknown number of clusters, or data contains outliers/noise. Does not require specifying k in advance. Automatically detects noise points. Struggles with varying density clusters (use HDBSCAN instead).

Train: O(n·log n) with spatial indexing · Space: O(n)
eps=0.5 min_samples=5 metric='euclidean'
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Find optimal eps using k-distance graph
nn = NearestNeighbors(n_neighbors=5)
nn.fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
k_distances = np.sort(distances[:, -1])  # find the "elbow"

model = DBSCAN(eps=0.5, min_samples=5)
labels = model.fit_predict(X_scaled)

n_clusters = len(set(labels) - {-1})
n_noise = (labels == -1).sum()
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")

2.3 PCA (Principal Component Analysis)

PCA

Dimensionality Reduction · Linear · Variance-preserving

When to use: Reduce feature count while preserving maximum variance. Preprocessing step before other algorithms. Visualization (project to 2D/3D). Dealing with multicollinearity. Fast and deterministic. Does not work well when the important structure is non-linear.

Train: O(min(n·d², n²·d)) · Transform: O(n·d·k)
n_components=None svd_solver='auto' whiten=False
from sklearn.decomposition import PCA
import numpy as np

# Keep 95% of variance
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_scaled)
print(f"Reduced from {X_scaled.shape[1]} to {X_reduced.shape[1]} features")

# Explained variance per component
print(np.cumsum(pca.explained_variance_ratio_))

# For visualization only
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

2.4 t-SNE

t-SNE

Dimensionality Reduction · Non-linear · Visualization

When to use: Visualizing high-dimensional data in 2D/3D. Preserves local neighborhood structure. Excellent for exploring cluster structure visually. Do not use for preprocessing before ML models (non-deterministic, no inverse transform, does not preserve global distances). Slow on large datasets.

Train: O(n²) or O(n·log n) with Barnes-Hut · No transform method
n_components=2 perplexity=30 learning_rate='auto' n_iter=1000
from sklearn.manifold import TSNE

# Apply PCA first for speed if d > 50
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X_scaled)

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_2d = tsne.fit_transform(X_pca)

# perplexity: try values between 5-50
# Higher = more global structure, lower = more local detail

2.5 UMAP

UMAP

Dimensionality Reduction · Non-linear · Manifold Learning

When to use: Modern alternative to t-SNE that is faster, preserves more global structure, and supports transforming new data points. Works for both visualization and as a preprocessing step. Better at preserving the relative distances between clusters. Requires the umap-learn package.

Train: O(n^1.14) empirically · Transform: O(n) for new points
n_components=2 n_neighbors=15 min_dist=0.1 metric='euclidean'
import umap

reducer = umap.UMAP(
    n_components=2,
    n_neighbors=15,
    min_dist=0.1,
    metric='euclidean',
    random_state=42
)
X_2d = reducer.fit_transform(X_scaled)

# Unlike t-SNE, UMAP can transform new data
X_new_2d = reducer.transform(X_new_scaled)

# n_neighbors: controls local vs global (higher = more global)
# min_dist: how tightly points cluster (lower = tighter)

3. Evaluation Metrics

Choosing the right metric is as important as choosing the right algorithm. A model optimized for accuracy might be useless for imbalanced classes. This section covers all the metrics you need, organized by task type, with when-to-use guidance and sklearn code.

3.1 Classification Metrics

MetricFormulaWhen to UseRange
Accuracy (TP + TN) / Total Balanced classes only. Misleading when one class dominates. 0 to 1
Precision TP / (TP + FP) When false positives are costly (spam detection, fraud alerts). 0 to 1
Recall TP / (TP + FN) When false negatives are costly (disease screening, safety). 0 to 1
F1 Score 2 · (P · R) / (P + R) Balance precision and recall. Use when classes are imbalanced. 0 to 1
AUC-ROC Area under ROC curve Overall discrimination ability. Threshold-independent. Robust to class imbalance. 0.5 to 1
Log Loss -mean(y·log(p)) When you care about calibrated probabilities, not just labels. 0 to inf
Cohen's Kappa (Accuracy - Expected) / (1 - Expected) Accounts for chance agreement. Good for imbalanced multiclass. -1 to 1
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, classification_report,
    confusion_matrix, log_loss
)

# Full classification report (precision, recall, f1 per class)
print(classification_report(y_test, y_pred))

# Individual metrics
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred, average='weighted'):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred, average='weighted'):.4f}")
print(f"F1:        {f1_score(y_test, y_pred, average='weighted'):.4f}")
print(f"AUC-ROC:   {roc_auc_score(y_test, y_prob[:, 1]):.4f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Rows = actual, Columns = predicted
# [[TN, FP], [FN, TP]]

3.2 Regression Metrics

MetricFormulaWhen to UseBest Value
MSE mean((y - ŷ)²) Default regression metric. Penalizes large errors heavily. 0
RMSE sqrt(MSE) Same as MSE but in original units. More interpretable. 0
MAE mean(|y - ŷ|) Robust to outliers. When all errors matter equally. 0
R² Score 1 - SS_res / SS_tot How much variance your model explains. Baseline: 0 (mean predictor). 1
MAPE mean(|y - ŷ| / |y|) Percentage error. Scale-independent. Undefined when y=0. 0%
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error,
    r2_score, mean_absolute_percentage_error
)
import numpy as np

print(f"MSE:  {mean_squared_error(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"MAE:  {mean_absolute_error(y_test, y_pred):.4f}")
print(f"R2:   {r2_score(y_test, y_pred):.4f}")
print(f"MAPE: {mean_absolute_percentage_error(y_test, y_pred):.2%}")

4. Data Preprocessing

Data preprocessing is often 80% of the work in a machine learning project. The quality of your preprocessing directly determines model performance. Always preprocess after splitting into train/test to avoid data leakage.

4.1 Feature Scaling

Many algorithms (SVM, KNN, Neural Networks, PCA, Gradient Descent-based) require features on the same scale. Tree-based algorithms (Random Forest, XGBoost) do not require scaling.

MethodFormulaWhen to Use
StandardScaler (x - mean) / std Default choice. Assumes roughly normal distribution. Sensitive to outliers.
MinMaxScaler (x - min) / (max - min) When you need values in [0, 1]. Good for neural networks. Very sensitive to outliers.
RobustScaler (x - median) / IQR Data contains outliers. Uses median and interquartile range.
MaxAbsScaler x / max(|x|) Sparse data. Does not destroy sparsity.
from sklearn.preprocessing import StandardScaler, RobustScaler

# CRITICAL: fit on train only, transform both
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit + transform
X_test_scaled = scaler.transform(X_test)         # transform only!

# With outliers? Use RobustScaler
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)

4.2 Encoding Categorical Variables

MethodWhen to UseOutput
OrdinalEncoder Ordinal categories (low/med/high). Tree-based models. Integer per category
OneHotEncoder Nominal categories (<15 unique values). Linear models, SVM, KNN. Binary column per category
TargetEncoder High-cardinality categoricals (100+ values). Gradient boosting models. Float (mean target per category)
LabelEncoder Target variable encoding only. Not for features. Integer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

# Best practice: use ColumnTransformer for mixed types
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numerical_cols),
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_cols)
])

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

4.3 Handling Missing Values

from sklearn.impute import SimpleImputer, KNNImputer

# Strategy: 'mean', 'median', 'most_frequent', 'constant'
imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# For more sophisticated imputation
knn_imputer = KNNImputer(n_neighbors=5)
X_train_imputed = knn_imputer.fit_transform(X_train)

# When to use what:
# - median: numeric features with outliers
# - mean: numeric features, normally distributed
# - most_frequent: categorical features
# - KNNImputer: when values are not missing at random
# - Always add a binary "is_missing" indicator column for important features

4.4 Feature Selection

Reducing the number of features improves model performance, reduces overfitting, and speeds up training. Three approaches, from simplest to most thorough.

from sklearn.feature_selection import (
    SelectKBest, f_classif, mutual_info_classif,
    RFE, SequentialFeatureSelector
)

# Method 1: Filter — fast, univariate
selector = SelectKBest(score_func=f_classif, k=20)
X_selected = selector.fit_transform(X_train, y_train)

# Method 2: Wrapper — uses model performance
from sklearn.ensemble import RandomForestClassifier
selector = RFE(RandomForestClassifier(n_estimators=100), n_features_to_select=15)
X_selected = selector.fit_transform(X_train, y_train)

# Method 3: Embedded — L1 regularization (built into model)
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel
lasso = Lasso(alpha=0.01)
selector = SelectFromModel(lasso)
X_selected = selector.fit_transform(X_train, y_train)

# Check which features were selected
selected_mask = selector.get_support()
selected_features = [f for f, s in zip(feature_names, selected_mask) if s]

5. Model Selection & Tuning

You have picked an algorithm and preprocessed your data. Now you need to validate that your model generalizes and tune its hyperparameters for best performance. This section covers the essential techniques.

5.1 Cross-Validation

Never evaluate on training data. Never rely on a single train/test split. Cross-validation gives a reliable estimate of generalization performance.

from sklearn.model_selection import (
    cross_val_score, StratifiedKFold, TimeSeriesSplit
)

# Standard: 5-fold stratified cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='f1_weighted')
print(f"F1: {scores.mean():.4f} +/- {scores.std():.4f}")

# Time series: NEVER shuffle, use temporal split
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X_train, y_train, cv=tscv, scoring='neg_mean_squared_error')

# Common scoring values:
# Classification: 'accuracy', 'f1', 'f1_weighted', 'roc_auc', 'precision', 'recall'
# Regression: 'neg_mean_squared_error', 'neg_mean_absolute_error', 'r2'

5.2 Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Grid Search: exhaustive, use when few parameters
param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [5, 10, 15, None],
    'min_samples_leaf': [1, 5, 10]
}
grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid, cv=5, scoring='f1_weighted', n_jobs=-1
)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best CV F1:  {grid.best_score_:.4f}")

# Randomized Search: faster, use when many parameters
from scipy.stats import randint, uniform
param_dist = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(3, 20),
    'learning_rate': uniform(0.01, 0.3),
    'subsample': uniform(0.6, 0.4)
}
search = RandomizedSearchCV(
    xgb.XGBClassifier(), param_dist,
    n_iter=100, cv=5, scoring='f1_weighted',
    random_state=42, n_jobs=-1
)
search.fit(X_train, y_train)

Tip: For serious tuning, use Optuna or Hyperopt instead of GridSearch. They use Bayesian optimization to find better hyperparameters in fewer trials. Optuna's pruning feature can 10x your search speed.

5.3 Learning Curves

Learning curves diagnose whether your model suffers from high bias (underfitting) or high variance (overfitting), and whether more data would help.

from sklearn.model_selection import learning_curve
import numpy as np

train_sizes, train_scores, val_scores = learning_curve(
    model, X_train, y_train,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring='f1_weighted', n_jobs=-1
)

# Interpretation:
# Gap between train and val = overfitting (high variance)
#   Fix: more data, regularization, simpler model
# Both scores low = underfitting (high bias)
#   Fix: more features, complex model, less regularization
# Val score plateaus = more data won't help

5.4 Complete Pipeline

Always use sklearn Pipelines to chain preprocessing and modeling. This prevents data leakage, simplifies deployment, and makes cross-validation correct.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

# Define preprocessing for different column types
numeric_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale', StandardScaler())
])

categorical_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('encode', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipe, numerical_cols),
    ('cat', categorical_pipe, categorical_cols)
])

# Full pipeline: preprocess + model
pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('model', RandomForestClassifier(n_estimators=200, random_state=42))
])

# Now cross-validation is fully correct
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1_weighted')

# Train final model
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

6. Common Pitfalls

These are the mistakes that silently destroy model performance. They will give you artificially high scores during development and catastrophic failures in production. Learn to spot and prevent them.

Data Leakage

The number one killer of ML projects. Data leakage occurs when information from the test set bleeds into training. Common causes: (1) Fitting a scaler or imputer on the entire dataset before splitting. (2) Including features derived from the target (e.g., customer lifetime value to predict churn). (3) Time-series data split randomly instead of temporally. (4) Duplicate records spanning train and test. Prevention: Always split first, then preprocess. Use sklearn Pipelines. For time series, use TimeSeriesSplit. Check for duplicates with df.duplicated().

Class Imbalance

When one class dominates (99% negative, 1% positive), a model that predicts "negative" always gets 99% accuracy. Solutions: (1) Use appropriate metrics: F1, AUC-ROC, precision-recall curves, not accuracy. (2) Use class_weight='balanced' in sklearn models. (3) Resample: SMOTE for oversampling minority, random undersampling for majority. (4) Stratified cross-validation with StratifiedKFold. (5) Adjust decision threshold based on precision-recall tradeoff.

# Handling class imbalance
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# Option 1: built-in class weights
model = RandomForestClassifier(class_weight='balanced')

# Option 2: manual weights
weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
model = xgb.XGBClassifier(scale_pos_weight=weights[1]/weights[0])

# Option 3: SMOTE oversampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Option 4: threshold tuning
y_prob = model.predict_proba(X_test)[:, 1]
threshold = 0.3  # lower threshold to catch more positives
y_pred_adjusted = (y_prob >= threshold).astype(int)

Overfitting

Training score is high, test score is low. The model memorized noise instead of learning patterns. Signs: Large gap between train and validation scores. Model performance degrades on new data. Solutions: (1) Get more training data. (2) Reduce model complexity (lower max_depth, fewer features, fewer layers). (3) Add regularization (L1/L2 penalties, dropout). (4) Use cross-validation to detect early. (5) Apply early stopping (boosting, neural networks). (6) Use data augmentation.

Not Scaling Features Before Distance-Based Algorithms

SVM, KNN, Neural Networks, PCA, and K-Means all depend on distances between data points. If one feature ranges from 0 to 1000 while another ranges from 0 to 1, the first feature will dominate entirely. Always scale features before using these algorithms. Tree-based models (Random Forest, XGBoost) do not require scaling.

Using Accuracy for Multiclass Problems

Accuracy treats all classes equally, but some classes may be more important or harder to predict. Use average='weighted' for F1/precision/recall to account for class frequency, or average='macro' to treat all classes as equally important regardless of frequency. Always print the full classification_report to inspect per-class performance.

Ignoring Feature Correlations

Highly correlated features (multicollinearity) inflate feature importance, make coefficients unstable in linear models, and waste computation. Detection: Correlation matrix (df.corr()), VIF (Variance Inflation Factor). Solutions: Drop one of correlated pairs (threshold: |r| > 0.9), use PCA, use L1/L2 regularization, or use tree-based models (more robust to correlation).

Training on Future Data (Time Series)

When working with time-series data, you must split temporally. Random shuffling creates look-ahead bias where your model uses future information to predict the past. Use TimeSeriesSplit for cross-validation. Never use features that incorporate future values (e.g., a rolling average that includes future dates).

7. Frequently Asked Questions

Which machine learning algorithm should I use?

It depends on your data type and goal. For tabular data classification, start with Gradient Boosting (XGBoost/LightGBM). For regression, try Linear Regression first, then Random Forest. For clustering, use K-Means for spherical clusters or DBSCAN for arbitrary shapes. For images, use Convolutional Neural Networks. For text, use transformer-based models or TF-IDF with Logistic Regression. Use the Algorithm Selector Tool above to get personalized recommendations.

What is the difference between precision and recall?

Precision measures how many of your positive predictions were correct (TP / (TP + FP)). Recall measures how many actual positives you found (TP / (TP + FN)). Use precision when false positives are costly (spam detection). Use recall when false negatives are costly (disease screening). F1 score is the harmonic mean of both, giving a balanced single number.

How do I prevent overfitting in machine learning?

Use cross-validation to detect overfitting early. Apply regularization (L1/L2). Reduce model complexity (fewer features, shallower trees). Increase training data or use data augmentation. Use dropout for neural networks. Apply early stopping during training. Ensure your validation set is representative of production data.

What is data leakage in machine learning?

Data leakage occurs when information from outside the training dataset is used to create the model. Common causes: using test data during preprocessing (fit_transform on the whole dataset instead of fit on train only), including features derived from the target variable, and using future data to predict past events. Always split data before any preprocessing steps and use sklearn Pipelines.

When should I use Random Forest vs Gradient Boosting?

Random Forest trains trees independently in parallel, making it faster and more robust to hyperparameter choices. Gradient Boosting trains trees sequentially, correcting previous errors, and usually achieves higher accuracy but is slower and more prone to overfitting. Start with Random Forest for baselines and switch to Gradient Boosting (XGBoost, LightGBM) when you need maximum performance.

Related Tools