How do I handle mixed feature types (numeric and categorical) in a pipeline?

Use sklearn's ColumnTransformer to apply different preprocessing to different column types. Define a numeric pipeline (SimpleImputer + StandardScaler) and a categorical pipeline (SimpleImputer with most_frequent + OneHotEncoder), then combine them in a ColumnTransformer. Wrap the ColumnTransformer and your model in a Pipeline. This ensures each feature type gets appropriate preprocessing, and the entire workflow is fit-transform safe for cross-validation.

ML Pipeline Builder

Design complete machine learning pipelines visually. Add preprocessing, feature engineering, model selection, and evaluation steps, then generate production-ready Python code for scikit-learn or PyTorch. The pipeline builder validates step ordering, checks for common mistakes, and produces clean, documented code you can copy directly into your project. All computation runs locally in your browser.

Pipeline Configuration

Task Type

Framework

Template

Add Pipeline Steps

Pipeline Steps (0 steps)

Click buttons above to add steps, or select a template.

What Is an ML Pipeline?

A machine learning pipeline is a sequence of data processing and modeling steps organized into a single, reproducible workflow. Instead of writing separate scripts for data cleaning, feature engineering, model training, and evaluation, a pipeline chains these steps together so that data flows through each transformation automatically. This is not just a convenience but a necessity for correct machine learning practice, because pipelines prevent data leakage, ensure reproducibility, and simplify deployment.

In scikit-learn, the Pipeline class is the standard implementation. A Pipeline takes a list of (name, step) tuples, where each step except the last must be a transformer (implementing fit and transform), and the last step can be either a transformer or an estimator (implementing fit and predict). When you call pipeline.fit(X_train, y_train), each transformer is fit and then transforms the data before passing it to the next step. When you call pipeline.predict(X_new), transformers apply their learned transformations (without refitting) and the final estimator makes predictions.

This architecture is powerful because it guarantees that preprocessing learned from training data is applied identically to test data and new data in production. Without a pipeline, it is alarmingly easy to introduce subtle bugs: fitting a scaler on the full dataset before splitting (data leakage), forgetting to apply the same encoding to production data, or using different imputation strategies in training and inference. A pipeline eliminates these risks by encapsulating the entire workflow in a single object.

Anatomy of a Production Pipeline

A typical production ML pipeline has four phases: data preprocessing, feature engineering, model training, and evaluation. Each phase contains one or more steps that must execute in the correct order.

Data preprocessing handles raw data quality issues. Missing value imputation fills NaN entries with mean, median, most frequent value, or a learned constant. Feature scaling normalizes numeric features to a common range: StandardScaler (zero mean, unit variance) for algorithms sensitive to feature magnitude (SVM, KNN, neural networks, logistic regression), or MinMaxScaler (0-1 range) when you need bounded values. Categorical encoding converts text labels to numbers: OneHotEncoder for nominal features (color, country) and OrdinalEncoder for ordered features (low/medium/high). Outlier detection and removal can also be part of preprocessing.

Feature engineering creates new features from existing ones to help the model learn better representations. PolynomialFeatures generates interaction terms and polynomial terms (x1*x2, x1^2). Dimensionality reduction via PCA or TruncatedSVD reduces the number of features while preserving variance, which helps with the curse of dimensionality and speeds up training. Feature selection using SelectKBest, SelectFromModel, or recursive feature elimination removes irrelevant or redundant features, reducing overfitting and improving interpretability.

Model training fits the chosen algorithm to the preprocessed and engineered features. For classification: logistic regression, random forest, gradient boosting (XGBoost, LightGBM), SVM, or neural networks. For regression: linear regression, ridge, lasso, random forest regressor, gradient boosting regressor, or neural networks. The choice of model depends on dataset size, feature types, interpretability requirements, and performance needs.

Evaluation measures model performance on held-out data. For classification: accuracy, precision, recall, F1 score, AUC-ROC, and confusion matrix. For regression: MSE, RMSE, MAE, R-squared, and residual analysis. Cross-validation provides more robust estimates than a single train/test split. Always evaluate on data that was not used for any training or tuning decisions.

Handling Mixed Feature Types

Real-world datasets almost always contain a mix of numeric and categorical features. Applying StandardScaler to a categorical column or OneHotEncoder to a numeric column produces nonsensical results. Scikit-learn's ColumnTransformer solves this by applying different transformations to different column subsets, then concatenating the results.

A typical pattern defines a numeric pipeline (SimpleImputer with mean strategy followed by StandardScaler) and a categorical pipeline (SimpleImputer with most_frequent strategy followed by OneHotEncoder with handle_unknown='ignore'). The ColumnTransformer applies each sub-pipeline to its respective columns. The entire ColumnTransformer is then wrapped in an outer Pipeline with the model as the final step. This nested structure ensures that all preprocessing, including the column-specific transformations, is handled correctly during cross-validation and deployment.

Identifying which columns are numeric vs categorical can be done manually by specifying column names, or dynamically using make_column_selector(dtype_include=np.number) and make_column_selector(dtype_include=object). Dynamic selection is more robust because it adapts to different datasets without code changes, which is important when the pipeline processes data from multiple sources or when schema changes over time.

Pipeline Best Practices

Always use Pipeline for preprocessing + model: Never fit a scaler outside the pipeline. This is the number one source of data leakage in machine learning projects.
Name your steps meaningfully: Use descriptive names like 'imputer', 'scaler', 'classifier' instead of generic names. This makes it easier to access individual steps and tune hyperparameters with double-underscore syntax (e.g., classifier__max_depth).
Use ColumnTransformer for mixed types: Do not apply numeric transformations to categorical columns or vice versa. ColumnTransformer handles this cleanly.
Set handle_unknown='ignore' on encoders: Production data may contain category values not seen during training. OneHotEncoder with handle_unknown='ignore' gracefully handles this by creating zero vectors for unknown categories.
Use memory caching for expensive transformations: Pipeline(steps, memory='cache_dir') caches fitted transformers to disk, avoiding redundant computation during hyperparameter search.
Serialize the entire pipeline: Save the fitted pipeline with joblib.dump(pipeline, 'model.joblib'). This captures all preprocessing steps and the trained model in a single file, making deployment trivial.
Version your pipelines: Use MLflow, DVC, or Weights & Biases to track pipeline configurations, datasets, and performance metrics across experiments.

Common Pipeline Patterns

The Basic Pipeline handles the simplest case: numeric features only, no missing values. It chains StandardScaler and a model. This is appropriate for clean, well-preprocessed datasets where you only need scaling.

The Standard Pipeline adds missing value imputation before scaling. SimpleImputer with mean strategy handles NaN values, then StandardScaler normalizes, then the model trains. This covers most numeric-only datasets.

The Full Pipeline handles real-world messy data with mixed types. A ColumnTransformer routes numeric features through imputation + scaling and categorical features through imputation + encoding. Feature selection or PCA optionally reduces dimensionality. The model trains on the processed features. This is the pattern you should use for production systems.

The Deep Learning Pipeline preprocesses data using sklearn transformers, then feeds the processed features to a PyTorch or TensorFlow model. For tabular data, sklearn preprocessing is often superior to learned embeddings for features that are naturally numeric. For images, text, or sequences, the preprocessing is framework-specific (torchvision transforms, tokenizers). The pipeline builder generates appropriate code for each framework.

Scikit-learn vs PyTorch Pipelines

Scikit-learn pipelines are ideal for tabular data, traditional ML algorithms, and when you need integration with cross-validation and hyperparameter search. The Pipeline API is mature, well-documented, and handles train/test splitting correctly out of the box. For most tabular ML tasks (classification and regression on structured data), scikit-learn pipelines are the right choice.

PyTorch pipelines are necessary when your model requires gradient-based optimization: deep neural networks, custom architectures, GPU acceleration, or when working with non-tabular data (images, text, audio, graphs). PyTorch does not have a built-in Pipeline class equivalent, so pipelines are typically implemented as a sequence of transforms in a DataLoader, with the model defined as an nn.Module. For preprocessing tabular features before feeding to a neural network, it is common to use sklearn preprocessing, convert to tensors, then use a PyTorch DataLoader.

The pipeline builder generates appropriate code for both frameworks. For sklearn, it produces a Pipeline or Pipeline + ColumnTransformer that can be used directly with cross_val_score, GridSearchCV, and joblib serialization. For PyTorch, it produces a custom Dataset class, DataLoader configuration, nn.Module definition, training loop, and evaluation code.

Deploying ML Pipelines

Deployment is where pipelines truly shine. A fitted sklearn Pipeline is a single Python object that encapsulates the entire prediction workflow: receive raw input, preprocess it exactly as during training, and return predictions. This eliminates the training-serving skew problem where production preprocessing differs from training preprocessing.

The simplest deployment approach is joblib.dump(pipeline, 'model.joblib') to serialize the pipeline, then joblib.load('model.joblib') in a Flask or FastAPI endpoint. The endpoint receives JSON input, converts it to a DataFrame, calls pipeline.predict(df), and returns the result. This approach works for low-to-medium traffic (hundreds of requests per second).

For higher scale, use specialized serving frameworks. MLflow provides model packaging, versioning, and serving with a REST API. BentoML offers optimized serving with batching and autoscaling. AWS SageMaker, Google Vertex AI, and Azure ML provide managed endpoints with monitoring, A/B testing, and auto-scaling. For edge deployment, ONNX conversion (skl2onnx) compiles sklearn pipelines to optimized runtime format with 10-100x speedup.

Regardless of deployment method, always include monitoring: track prediction latency, input data distributions (detect drift), prediction distributions (detect model degradation), and error rates. A pipeline that worked in development can silently degrade in production if the input data distribution shifts.

Testing ML Pipelines

ML pipelines should be tested like any software component. Unit tests verify individual steps: does the imputer handle all-NaN columns? Does the encoder handle unknown categories? Integration tests verify the full pipeline: can it fit, transform, and predict without errors? Performance tests verify model quality: does the pipeline achieve acceptable metrics on a validation set?

A practical test suite for an ML pipeline includes: (1) smoke test with a tiny dataset (10 rows) to verify the pipeline runs end-to-end, (2) schema test to verify input/output shapes match expectations, (3) determinism test to verify the same random seed produces identical results, (4) edge case tests with missing values, empty strings, extreme outliers, and unseen categories, and (5) performance regression test to verify metrics do not degrade below a threshold when retraining on updated data.

Frequently Asked Questions

What is an ML pipeline and why should I use one?

An ML pipeline chains preprocessing, feature engineering, and modeling into a single reproducible workflow. It prevents data leakage by ensuring preprocessing is fit only on training data, makes code cleaner and easier to deploy, and works seamlessly with cross-validation and hyperparameter search.

What is the difference between a Transformer and an Estimator in sklearn?

A Transformer implements fit() and transform() (e.g., StandardScaler, OneHotEncoder). An Estimator implements fit() and predict() (e.g., RandomForestClassifier). A Pipeline chains Transformers followed by a final Estimator.

How do I handle mixed feature types in a pipeline?

Use ColumnTransformer to apply different preprocessing to different column types. Define separate sub-pipelines for numeric features (imputer + scaler) and categorical features (imputer + encoder), then combine them.

What preprocessing steps should every ML pipeline include?

At minimum: missing value imputation, feature scaling (for scale-sensitive algorithms), and categorical encoding. Optional but recommended: feature selection, dimensionality reduction, and outlier handling.

How do I deploy an sklearn Pipeline to production?

Save with joblib.dump(pipeline, 'model.joblib'). In production, load with joblib.load and call pipeline.predict(new_data). For containerized deployment, use Flask/FastAPI, or managed platforms like MLflow, BentoML, or AWS SageMaker.

Related Tools

About the Author

Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.

Last updated: May 25, 2026