Transfer Learning Guide

Select your source task, target task, dataset size, and compute budget to get a personalized transfer learning strategy recommendation. Explore the model compatibility matrix showing how 10 popular pretrained models perform across 8 common target tasks. Calculate the optimal layer freezing strategy for your specific scenario. All processing is local to your browser.

Strategy Decision Tree

Model Compatibility Matrix

Compatibility score (1-5 stars) for pretrained models across target tasks. Higher = better expected transfer performance.

Layer Freezing Strategy Calculator

7 (High)
5 (Medium)
Frozen (keep pretrained weights) Fine-tuned (low LR) Trained (full LR)

Fine-Tuning Learning Rate Estimator

--
Frozen Layers LR
--
Middle Layers LR
--
Classification Head LR
--
Warmup Steps

What Is Transfer Learning?

Transfer learning is the machine learning technique of reusing a model trained on one task as the starting point for a model on a different but related task. Instead of training from randomly initialized weights, you begin with weights that already encode useful representations learned from a large dataset. This is arguably the most impactful practical technique in modern deep learning, enabling state-of-the-art results on tasks where training from scratch would require orders of magnitude more data and compute.

The theoretical foundation of transfer learning rests on the observation that neural networks learn hierarchical features. In vision models, early convolutional layers learn edge detectors and texture patterns that are universal across image domains. Middle layers compose these into parts (wheels, eyes, leaves) that are domain-specific but task-agnostic. Final layers combine parts into task-specific classifications. By transferring early and middle layer representations, you avoid relearning universal features and focus new training on task-specific adaptation.

In NLP, the same principle applies through language models like BERT and GPT. Pretraining on large text corpora learns syntactic structure, semantic relationships, world knowledge, and reasoning patterns. Fine-tuning on a specific task (sentiment analysis, question answering, named entity recognition) adapts these general language capabilities to the target domain with far less labeled data than training from scratch.

The Three Transfer Learning Strategies

Feature extraction is the simplest strategy: freeze all pretrained layers and train only a new classification head (typically one or two fully connected layers). The pretrained model acts as a fixed feature extractor, converting inputs into high-dimensional representations that the new head learns to classify. This works best when the source and target domains are similar and the target dataset is small (hundreds to low thousands of examples). Training is fast because only the head parameters are updated.

Fine-tuning unfreezes some or all pretrained layers and trains them with a small learning rate while training the new head with a larger learning rate. This allows the model to adapt its learned representations to the target domain while still benefiting from the pretrained initialization. Fine-tuning is the most versatile strategy and works well across a wide range of dataset sizes and domain similarities. The key decisions are how many layers to unfreeze and what learning rates to use.

Training from scratch initializes all weights randomly and trains the entire model on the target dataset. This is appropriate only when the target dataset is very large AND very different from the source domain. Even in this case, practitioners often find that pretrained initialization still helps, because the optimization landscape from a pretrained starting point is more favorable than from random initialization. The bar for choosing to train from scratch over fine-tuning is very high.

Domain Similarity and Its Impact

The effectiveness of transfer learning depends critically on the similarity between the source and target domains. When domains are similar (ImageNet to wildlife photos, Wikipedia text to news articles), even feature extraction with a frozen backbone works well. When domains are moderately different (ImageNet to satellite imagery, general text to legal documents), fine-tuning the upper layers bridges the gap. When domains are very different (natural images to X-ray scans, English text to code), either extensive fine-tuning or domain-specific pretraining is needed.

Research by Yosinski et al. (2014) demonstrated this gradient of transferability. They systematically measured how transferable each layer of a neural network is by transferring subsets of layers between related and unrelated tasks. They found that first-layer features are general and transfer well regardless of task, while higher-layer features become increasingly task-specific and transfer less well. This finding directly informs the layer freezing strategy: freeze more layers when domains are similar, fewer when they differ.

Negative transfer occurs when transfer learning hurts performance compared to training from scratch. This typically happens when the source and target domains are so different that the pretrained features are misleading rather than helpful. Negative transfer is rare when using established pretrained models (ImageNet, BERT) but can occur with very domain-specific source models or when fine-tuning hyperparameters are poorly chosen (especially learning rate too high, which destroys the pretrained features).

Layer Freezing Strategies

Layer freezing determines which parameters are updated during fine-tuning. The most common approach is gradual unfreezing (Howard & Ruder, 2018): start by training only the head, then progressively unfreeze deeper layers over the course of training. This allows the head to adapt first, providing stable gradients for when deeper layers are unfrozen. An alternative is to unfreeze all layers from the start but use discriminative learning rates: earlier layers get a learning rate 2-10x smaller than later layers. This achieves a similar effect more simply.

The optimal number of frozen layers depends on three factors. First, domain similarity: higher similarity means more layers can remain frozen. Second, dataset size: larger datasets allow more layers to be trained without overfitting. Third, model depth: deeper models have more layers that learn increasingly abstract features, so the middle layers may need more adaptation. As a rough guideline for a 50-layer model: freeze 80-90% of layers for tiny datasets with high domain similarity, 50-70% for medium datasets, and 0-30% for large datasets with low domain similarity.

Pretrained Model Selection

Choosing the right pretrained model involves balancing accuracy, efficiency, and compatibility with your target task. For computer vision, the landscape includes: ResNet (reliable, well-studied, good baseline), EfficientNet (best accuracy-efficiency tradeoff for classification), ConvNeXt (modern CNN matching ViT accuracy), Vision Transformer (ViT) (state-of-the-art for large datasets, needs more data), and CLIP (vision-language model, excellent for zero-shot and few-shot scenarios).

For NLP: BERT/RoBERTa (bidirectional encoder, best for classification and extraction tasks), T5/FLAN-T5 (encoder-decoder, flexible for any text-to-text task), GPT-2/LLaMA (decoder-only, best for generation), and DeBERTa (enhanced BERT with disentangled attention, often best on benchmarks). Model size matters: BERT-base (110M params) is sufficient for many tasks and trainable on consumer GPUs, while BERT-large (340M) or larger models need more compute but give better accuracy.

Practical Fine-Tuning Recipes

Common Pitfalls and Solutions

Frequently Asked Questions

What is transfer learning and why does it work?

Transfer learning reuses a model trained on one task as a starting point for a different task. It works because neural networks learn hierarchical features: early layers capture universal patterns useful across tasks, while later layers learn task-specific features. Transferring representations leverages knowledge from large datasets and expensive training.

When should I use feature extraction vs fine-tuning vs training from scratch?

Feature extraction (freeze all, train head only) for small, similar-domain datasets. Fine-tuning (unfreeze some layers, low LR) for medium-to-large datasets or different domains. Training from scratch only for very large datasets that are very different from the source. Fine-tuning rarely performs worse than from scratch.

How many layers should I freeze during fine-tuning?

With tiny datasets, freeze 80-90% of layers. With medium datasets, unfreeze the top 30-50%. With large datasets, unfreeze everything but use discriminative learning rates (lower for early layers). Start frozen and gradually unfreeze while monitoring validation performance.

Which pretrained model should I use for my task?

Vision: EfficientNet for efficiency, ConvNeXt/ViT for accuracy. Detection: COCO-pretrained YOLO/DETR. NLP: BERT/RoBERTa for classification, T5 for text-to-text, GPT-2/LLaMA for generation. Match model size to your compute budget.

What learning rate should I use for fine-tuning?

10-100x smaller than training from scratch. NLP: 2e-5 to 5e-5. Vision backbone: 1e-5 to 3e-5. Vision head: 1e-4 to 3e-4. Use discriminative rates (lower for early layers) and always apply warmup for the first 5-10% of training steps.

Related Tools

About the Author

Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.

Last updated: May 25, 2026