Transfer Learning Guide

Q: What is transfer learning and why does it work?

Transfer learning is the practice of reusing a model trained on one task (source) as the starting point for a different task (target). It works because neural networks learn hierarchical features: early layers capture universal low-level patterns (edges, textures, word frequencies) that are useful across many tasks, while later layers learn task-specific features. By transferring the learned representations, you leverage the knowledge from large datasets and expensive training runs, achieving better performance with less data and compute on your target task.

Q: When should I use feature extraction vs fine-tuning vs training from scratch?

Use feature extraction (freeze all pretrained layers, train only a new head) when your target dataset is small and similar to the source domain. Use fine-tuning (unfreeze some or all layers with a small learning rate) when your dataset is medium-to-large or moderately different from the source domain. Train from scratch only when your dataset is very large AND very different from the source domain (e.g., medical 3D scans when the source is ImageNet photos). In practice, fine-tuning is the most common choice and rarely performs worse than training from scratch.

Q: How many layers should I freeze during fine-tuning?

The number of layers to freeze depends on dataset size and domain similarity. With very small datasets (hundreds of examples), freeze most layers and only train the last 1-2 blocks plus the classification head. With medium datasets (thousands), unfreeze the top 30-50% of layers. With large datasets (tens of thousands+), unfreeze everything but use a lower learning rate for earlier layers (discriminative fine-tuning). A practical approach is to start fully frozen, then gradually unfreeze layers from top to bottom while monitoring validation performance.

Q: Which pretrained model should I use for my task?

For image classification, start with EfficientNet-B0 to B3 for efficiency or ConvNeXt/ViT for maximum accuracy. For object detection, use models pretrained on COCO (YOLOv8, DETR). For NLP, use BERT or RoBERTa for classification/NER, T5 for sequence-to-sequence tasks, and GPT-2/LLaMA for text generation. For medical imaging, consider models pretrained on RadImageNet or BiomedCLIP. Match the model size to your compute budget: larger models give better accuracy but need more memory and training time.

Q: What learning rate should I use for fine-tuning?

Fine-tuning learning rates are typically 10-100x smaller than training from scratch. Common ranges: 1e-5 to 5e-5 for BERT-style NLP models, 1e-4 to 1e-3 for CNN image models with frozen backbone, 1e-5 to 1e-4 for unfrozen CNN models. Use discriminative learning rates (lower LR for earlier layers, higher for later layers) to preserve learned features while adapting task-specific layers. Always use learning rate warmup for the first 5-10% of training, and consider cosine or linear decay schedules.

Select your source task, target task, dataset size, and compute budget to get a personalized transfer learning strategy recommendation. Explore the model compatibility matrix showing how 10 popular pretrained models perform across 8 common target tasks. Calculate the optimal layer freezing strategy for your specific scenario. All processing is local to your browser.

Strategy Decision Tree

Source Domain

Target Task

Target Dataset Size

Compute Budget

Model Compatibility Matrix

Compatibility score (1-5 stars) for pretrained models across target tasks. Higher = better expected transfer performance.

Layer Freezing Strategy Calculator

Total Model Layers

Domain Similarity (1-10) 7 (High)

Dataset Size Factor (1-10) 5 (Medium)

■ Frozen (keep pretrained weights) ■ Fine-tuned (low LR) ■ Trained (full LR)

Fine-Tuning Learning Rate Estimator

Frozen Layers LR

Middle Layers LR

Classification Head LR

Warmup Steps

What Is Transfer Learning?

Transfer learning is the machine learning technique of reusing a model trained on one task as the starting point for a model on a different but related task. Instead of training from randomly initialized weights, you begin with weights that already encode useful representations learned from a large dataset. This is arguably the most impactful practical technique in modern deep learning, enabling state-of-the-art results on tasks where training from scratch would require orders of magnitude more data and compute.

The theoretical foundation of transfer learning rests on the observation that neural networks learn hierarchical features. In vision models, early convolutional layers learn edge detectors and texture patterns that are universal across image domains. Middle layers compose these into parts (wheels, eyes, leaves) that are domain-specific but task-agnostic. Final layers combine parts into task-specific classifications. By transferring early and middle layer representations, you avoid relearning universal features and focus new training on task-specific adaptation.

In NLP, the same principle applies through language models like BERT and GPT. Pretraining on large text corpora learns syntactic structure, semantic relationships, world knowledge, and reasoning patterns. Fine-tuning on a specific task (sentiment analysis, question answering, named entity recognition) adapts these general language capabilities to the target domain with far less labeled data than training from scratch.

The Three Transfer Learning Strategies

Feature extraction is the simplest strategy: freeze all pretrained layers and train only a new classification head (typically one or two fully connected layers). The pretrained model acts as a fixed feature extractor, converting inputs into high-dimensional representations that the new head learns to classify. This works best when the source and target domains are similar and the target dataset is small (hundreds to low thousands of examples). Training is fast because only the head parameters are updated.

Fine-tuning unfreezes some or all pretrained layers and trains them with a small learning rate while training the new head with a larger learning rate. This allows the model to adapt its learned representations to the target domain while still benefiting from the pretrained initialization. Fine-tuning is the most versatile strategy and works well across a wide range of dataset sizes and domain similarities. The key decisions are how many layers to unfreeze and what learning rates to use.

Training from scratch initializes all weights randomly and trains the entire model on the target dataset. This is appropriate only when the target dataset is very large AND very different from the source domain. Even in this case, practitioners often find that pretrained initialization still helps, because the optimization landscape from a pretrained starting point is more favorable than from random initialization. The bar for choosing to train from scratch over fine-tuning is very high.

Domain Similarity and Its Impact

The effectiveness of transfer learning depends critically on the similarity between the source and target domains. When domains are similar (ImageNet to wildlife photos, Wikipedia text to news articles), even feature extraction with a frozen backbone works well. When domains are moderately different (ImageNet to satellite imagery, general text to legal documents), fine-tuning the upper layers bridges the gap. When domains are very different (natural images to X-ray scans, English text to code), either extensive fine-tuning or domain-specific pretraining is needed.

Research by Yosinski et al. (2014) demonstrated this gradient of transferability. They systematically measured how transferable each layer of a neural network is by transferring subsets of layers between related and unrelated tasks. They found that first-layer features are general and transfer well regardless of task, while higher-layer features become increasingly task-specific and transfer less well. This finding directly informs the layer freezing strategy: freeze more layers when domains are similar, fewer when they differ.

Negative transfer occurs when transfer learning hurts performance compared to training from scratch. This typically happens when the source and target domains are so different that the pretrained features are misleading rather than helpful. Negative transfer is rare when using established pretrained models (ImageNet, BERT) but can occur with very domain-specific source models or when fine-tuning hyperparameters are poorly chosen (especially learning rate too high, which destroys the pretrained features).

Layer Freezing Strategies

Layer freezing determines which parameters are updated during fine-tuning. The most common approach is gradual unfreezing (Howard & Ruder, 2018): start by training only the head, then progressively unfreeze deeper layers over the course of training. This allows the head to adapt first, providing stable gradients for when deeper layers are unfrozen. An alternative is to unfreeze all layers from the start but use discriminative learning rates: earlier layers get a learning rate 2-10x smaller than later layers. This achieves a similar effect more simply.

The optimal number of frozen layers depends on three factors. First, domain similarity: higher similarity means more layers can remain frozen. Second, dataset size: larger datasets allow more layers to be trained without overfitting. Third, model depth: deeper models have more layers that learn increasingly abstract features, so the middle layers may need more adaptation. As a rough guideline for a 50-layer model: freeze 80-90% of layers for tiny datasets with high domain similarity, 50-70% for medium datasets, and 0-30% for large datasets with low domain similarity.

Pretrained Model Selection

Choosing the right pretrained model involves balancing accuracy, efficiency, and compatibility with your target task. For computer vision, the landscape includes: ResNet (reliable, well-studied, good baseline), EfficientNet (best accuracy-efficiency tradeoff for classification), ConvNeXt (modern CNN matching ViT accuracy), Vision Transformer (ViT) (state-of-the-art for large datasets, needs more data), and CLIP (vision-language model, excellent for zero-shot and few-shot scenarios).

For NLP: BERT/RoBERTa (bidirectional encoder, best for classification and extraction tasks), T5/FLAN-T5 (encoder-decoder, flexible for any text-to-text task), GPT-2/LLaMA (decoder-only, best for generation), and DeBERTa (enhanced BERT with disentangled attention, often best on benchmarks). Model size matters: BERT-base (110M params) is sufficient for many tasks and trainable on consumer GPUs, while BERT-large (340M) or larger models need more compute but give better accuracy.

Practical Fine-Tuning Recipes

Vision (ImageNet pretrained): Use AdamW optimizer, learning rate 1e-4 to 3e-4 for the head and 1e-5 to 3e-5 for the backbone. Train for 20-50 epochs with cosine LR decay. Use standard augmentation (random crop, horizontal flip, color jitter).
NLP (BERT pretrained): Use AdamW with learning rate 2e-5 to 5e-5, linear warmup over 6-10% of total steps, then linear decay. Train for 3-5 epochs. Use a batch size of 16-32. Weight decay of 0.01.
Low-resource scenarios (<1K samples): Use feature extraction first. If insufficient, fine-tune only the last 1-2 blocks. Apply heavy regularization (dropout 0.3-0.5, weight decay 0.1). Consider data augmentation to artificially expand the dataset.
Domain adaptation (very different domains): Consider intermediate pretraining on unlabeled data from the target domain (e.g., continue BERT pretraining on legal texts before fine-tuning for legal classification). This bridges the domain gap more effectively than direct fine-tuning.
Multi-task fine-tuning: Fine-tune on multiple related tasks simultaneously. This can improve performance on each individual task through shared representations and acts as implicit regularization.

Common Pitfalls and Solutions

Catastrophic forgetting: Fine-tuning with too high a learning rate destroys pretrained features. Solution: use learning rates 10-100x smaller than training from scratch, and always use warmup.
Overfitting on small datasets: The fine-tuned model memorizes the training set. Solution: freeze more layers, increase regularization, use augmentation, try feature extraction instead.
Underfitting despite large dataset: The model capacity is insufficient. Solution: use a larger pretrained model, unfreeze more layers, increase the learning rate slightly.
Distribution shift: The target data distribution differs significantly from what the model expects. Solution: normalize inputs using the same statistics as the pretrained model (e.g., ImageNet mean/std for ImageNet-pretrained CNNs).
Label mismatch: Source and target tasks have different label spaces or granularity. Solution: always replace the classification head. For hierarchical labels, consider initializing the new head using the closest source classes.

Frequently Asked Questions

What is transfer learning and why does it work?

Transfer learning reuses a model trained on one task as a starting point for a different task. It works because neural networks learn hierarchical features: early layers capture universal patterns useful across tasks, while later layers learn task-specific features. Transferring representations leverages knowledge from large datasets and expensive training.

When should I use feature extraction vs fine-tuning vs training from scratch?

Feature extraction (freeze all, train head only) for small, similar-domain datasets. Fine-tuning (unfreeze some layers, low LR) for medium-to-large datasets or different domains. Training from scratch only for very large datasets that are very different from the source. Fine-tuning rarely performs worse than from scratch.

How many layers should I freeze during fine-tuning?

With tiny datasets, freeze 80-90% of layers. With medium datasets, unfreeze the top 30-50%. With large datasets, unfreeze everything but use discriminative learning rates (lower for early layers). Start frozen and gradually unfreeze while monitoring validation performance.

Which pretrained model should I use for my task?

Vision: EfficientNet for efficiency, ConvNeXt/ViT for accuracy. Detection: COCO-pretrained YOLO/DETR. NLP: BERT/RoBERTa for classification, T5 for text-to-text, GPT-2/LLaMA for generation. Match model size to your compute budget.

What learning rate should I use for fine-tuning?

10-100x smaller than training from scratch. NLP: 2e-5 to 5e-5. Vision backbone: 1e-5 to 3e-5. Vision head: 1e-4 to 3e-4. Use discriminative rates (lower for early layers) and always apply warmup for the first 5-10% of training steps.

Related Tools

About the Author

Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.

Last updated: May 25, 2026