How do I encode categorical variables for machine learning?

One-hot encoding creates binary columns for each category and works for low-cardinality features (<20 categories). Label encoding assigns integers and is suitable for ordinal categories. Target encoding replaces categories with the mean target value and handles high cardinality well but risks target leakage (use cross-validation). Frequency encoding replaces with occurrence counts. For tree models, label encoding often suffices. For linear models, one-hot encoding is usually necessary.

Feature Engineering Checklist

Q: What is feature engineering in machine learning?

Feature engineering is the process of creating, transforming, and selecting input features to improve model performance. It includes scaling numeric features, encoding categorical variables, extracting information from text and dates, handling missing data, and creating interaction features. Good feature engineering often matters more than algorithm choice and can dramatically improve model accuracy.

Q: Should I normalize or standardize my features?

Standardization (zero mean, unit variance via StandardScaler) is generally preferred because it handles outliers better and is required by many algorithms (SVM, logistic regression, PCA). Normalization (scaling to 0-1 range via MinMaxScaler) is useful when you need bounded values (neural networks, image data) or when the distribution is not Gaussian. Tree-based models (Random Forest, XGBoost) do not require scaling.

Q: How should I handle missing data?

The best approach depends on the missing data mechanism. For MCAR (Missing Completely At Random): mean/median imputation works. For MAR (Missing At Random): use model-based imputation (KNN, iterative imputer). For MNAR (Missing Not At Random): the missingness itself is informative - create a binary indicator column. Always add a missing indicator feature regardless of imputation method, as missingness itself can be predictive.

Q: What are the most impactful feature engineering techniques?

The highest-impact techniques are: target encoding for high-cardinality categoricals, interaction features between related columns, aggregation features (group-by statistics), time-based features (recency, frequency, cyclical encoding), and domain-specific transformations (log for skewed distributions, ratios between related features). Feature selection via mutual information or recursive feature elimination removes noise and improves generalization.

Interactive checklist covering every feature engineering technique you need. Track your progress, expand items for details, and paste sample data to auto-detect types and get tailored suggestions. All processing happens locally in your browser.

Progress

0 / 0 items completed

Data Type Detector

Paste a sample of your data (CSV format, first row = headers) and we will detect column types and suggest relevant feature engineering steps. No data leaves your browser.

Why Feature Engineering Matters

Feature engineering is often the difference between a mediocre model and a great one. While algorithms like gradient boosting and neural networks are powerful, they can only work with the information provided to them. Transforming raw data into informative features allows models to learn patterns more efficiently, reduces overfitting, and improves generalization to unseen data.

Research consistently shows that feature engineering has a larger impact on model performance than algorithm selection or hyperparameter tuning. A well-engineered feature set with logistic regression often outperforms a poorly-prepared dataset with a complex deep learning model. The saying "garbage in, garbage out" applies doubly in machine learning.

Numeric Feature Engineering

Numeric features are the most straightforward to work with, but several transformations can significantly improve model performance. Standardization (z-score normalization) centers features to zero mean and unit variance, which is critical for distance-based algorithms (SVM, KNN), gradient-based optimization (neural networks), and regularized models (Ridge, Lasso). Log transformation compresses right-skewed distributions (income, prices, counts), making them more Gaussian and reducing the influence of outliers.

Binning converts continuous features into discrete intervals. Equal-width bins are simple but can create empty bins with skewed data. Equal-frequency (quantile) bins ensure each bin has the same number of samples. Custom bins based on domain knowledge (e.g., age groups) can capture meaningful thresholds. Polynomial features create interaction and power terms (x1*x2, x1^2) that help linear models capture nonlinear relationships, though they increase dimensionality rapidly.

Categorical Feature Engineering

Categorical features require encoding into numeric representations. The choice of encoding method significantly affects model performance. One-hot encoding creates a binary column per category and is the safest default for low-cardinality features (<20 categories). It does not impose any ordinality but can create very high-dimensional feature spaces for high-cardinality columns.

Target encoding replaces each category with the mean of the target variable for that category. It is extremely effective for high-cardinality features (zip codes, product IDs) but must use cross-validation to prevent target leakage. A regularized variant blends the category mean with the global mean based on sample size.

Frequency encoding replaces each category with its occurrence count or proportion in the training set. It captures the intuition that rare categories behave differently from common ones. Ordinal encoding assigns integers based on a natural ordering (low/medium/high, education levels) and preserves rank information that one-hot encoding destroys.

Text Feature Engineering

Text data requires the most transformation to become useful features. Bag of words (TF-IDF) represents documents as sparse vectors of word frequencies, weighted by inverse document frequency to downweight common words. It is simple, interpretable, and works surprisingly well as a baseline. Word embeddings (Word2Vec, GloVe, FastText) represent words as dense vectors that capture semantic relationships. Document embeddings can be created by averaging word vectors.

Modern NLP typically uses transformer embeddings (BERT, sentence-transformers) that capture contextualized semantics. For feature engineering specifically, useful text-derived features include: text length, word count, character count, average word length, punctuation count, uppercase ratio, named entity counts, sentiment scores, and topic probabilities from LDA.

Date/Time Feature Engineering

Temporal features encode rich information that simple timestamps obscure. Extract cyclical components (hour of day, day of week, month, quarter) using sine/cosine encoding to preserve the circular nature: sin(2*pi*hour/24) and cos(2*pi*hour/24). This ensures that hour 23 is close to hour 0, which simple integer encoding misses.

Recency features measure time elapsed since an event (days since last purchase, hours since registration). Lag features for time series capture previous values (value_t-1, value_t-7). Rolling statistics (rolling mean, std, min, max over a window) capture trends and volatility. Is-holiday and is-weekend binary flags capture behavioral patterns linked to calendar events.

Missing Data Strategies

Missing data is ubiquitous in real-world datasets, and handling it correctly is crucial. First, understand the missing mechanism: MCAR (Missing Completely At Random) means missingness is unrelated to any data. MAR (Missing At Random) means missingness depends on observed data. MNAR (Missing Not At Random) means missingness depends on the missing value itself.

For MCAR, simple imputation (mean, median, mode) works. For MAR, model-based imputation (KNN imputer, iterative imputer/MICE) uses other features to predict missing values. For MNAR, the missingness itself is informative, so always create a missing indicator binary column alongside any imputation. Tree-based models can often learn to split on missingness directly. Never drop rows or columns without understanding the impact on your dataset.

Feature Selection Methods

After engineering features, selecting the most relevant ones reduces noise, prevents overfitting, and speeds up training. Three approaches exist:

Filter methods: Rank features by statistical measures (correlation, mutual information, chi-squared, ANOVA F-statistic). Fast and model-agnostic, but ignore feature interactions.
Wrapper methods: Evaluate feature subsets using the model itself (recursive feature elimination, forward/backward selection). More accurate but computationally expensive. RFE with cross-validation is the gold standard.
Embedded methods: Feature selection happens during model training. L1 regularization drives irrelevant weights to zero. Tree-based feature importance ranks features by their split contribution. SHAP values provide the most reliable importance estimates.

A practical workflow: remove zero-variance features first, then use correlation analysis to handle multicollinearity, then apply mutual information for initial ranking, and finally use RFE or SHAP for final selection.

Frequently Asked Questions

What is feature engineering in machine learning?

Feature engineering is the process of creating, transforming, and selecting input features to improve model performance. It includes scaling numeric features, encoding categorical variables, extracting information from text and dates, handling missing data, and creating interaction features. Good feature engineering often matters more than algorithm choice.

Should I normalize or standardize my features?

Standardization (zero mean, unit variance) is generally preferred because it handles outliers better and is required by many algorithms (SVM, logistic regression, PCA). Normalization (0-1 range) is useful for bounded value needs (neural networks, image data). Tree-based models do not require scaling.

How should I handle missing data?

It depends on the missing mechanism. For MCAR: mean/median imputation works. For MAR: model-based imputation (KNN, iterative imputer). For MNAR: the missingness is informative, so create a binary indicator. Always add a missing indicator feature alongside imputation.

How do I encode categorical variables?

One-hot encoding for low-cardinality features (<20 categories). Target encoding for high-cardinality with cross-validation to prevent leakage. Label/ordinal encoding for ordinal categories. Frequency encoding as a simple alternative. Tree models work with label encoding; linear models need one-hot.

What are the most impactful feature engineering techniques?

Target encoding for high-cardinality categoricals, interaction features between related columns, aggregation features (group-by statistics), time-based features (recency, frequency, cyclical encoding), and domain-specific transformations (log for skewed data, ratios between related features).

Related Tools

About the Author

Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.

Last updated: May 25, 2026