Model Complexity Calculator

Build your neural network layer by layer and instantly see the parameter count, FLOPS, and memory footprint at each stage. Add Dense, Conv2D, LSTM, Multi-Head Attention, and BatchNorm layers with custom dimensions. Compare your architecture against popular models like ResNet, VGG, BERT, and GPT variants. Calculate memory requirements at different numerical precisions. Everything runs locally in your browser.

Add Layer

Model Summary

0
Total Parameters
0
Total FLOPS
0 MB
Memory (fp32)
0 MB
Memory (fp16/bf16)
0 MB
Memory (int8)
0
Layers
# Layer Config Parameters FLOPS Output Shape

Architecture Comparison

Compare parameter count, FLOPS, and memory footprint of popular architectures.

Memory at Different Precisions

Counting Parameters in Neural Networks

Understanding the parameter count of a neural network is fundamental to estimating its memory requirements, computational cost, and modeling capacity. Each layer type has a precise formula for counting trainable parameters, and summing across all layers gives the total model size. This number directly determines how much GPU memory the model consumes, how long training takes, and how much data is needed to train it effectively without overfitting.

A Dense (fully connected) layer maps an input of dimension I to an output of dimension O. Each input connects to each output through a learned weight, giving I * O weights. Adding O bias terms, the total is I * O + O parameters. A Dense layer from 768 to 3072 (common in BERT) has 768 * 3072 + 3072 = 2,362,368 parameters. Dense layers are parameter-heavy because every input-output pair has a dedicated weight.

A Conv2D layer with kernel size K, C_in input channels, and C_out output channels has K * K * C_in * C_out + C_out parameters. A 3x3 convolution from 64 to 128 channels has 3 * 3 * 64 * 128 + 128 = 73,856 parameters. Convolutions are parameter-efficient because the same kernel is applied across all spatial positions (weight sharing), but they can be computationally expensive because the kernel must be applied at every position in the feature map.

An LSTM layer with input dimension I and hidden dimension H has 4 * (I * H + H * H + H) parameters. The factor of 4 comes from the four gates (input, forget, cell, output), each of which has a weight matrix for both input and hidden state plus a bias vector. An LSTM with input 300 and hidden 512 has 4 * (300 * 512 + 512 * 512 + 512) = 1,665,024 parameters.

Multi-Head Attention with model dimension D has approximately 4 * D * D parameters: three projection matrices for queries, keys, and values (each D x D) plus one output projection (D x D). The actual count also includes biases: 4 * (D * D + D). For BERT-base (D = 768), each attention layer has 4 * (768 * 768 + 768) = 2,362,368 parameters. With multiple heads, the computation is split (D/H per head) but the total parameter count is the same.

BatchNorm and LayerNorm each have 2 * features parameters: a learned scale (gamma) and shift (beta) per feature dimension. A BatchNorm layer on 256 channels has 512 parameters. While these layers add minimal parameters, they are critical for training stability and convergence speed.

Understanding FLOPS

FLOPS (Floating Point Operations) measures the computational cost of a forward pass through the network. While parameter count tells you how big the model is, FLOPS tells you how expensive it is to run. A model with a large embedding table has many parameters but relatively few FLOPS (embeddings are just lookups). A model with many small convolutions applied to high-resolution feature maps has fewer parameters but many FLOPS.

For a Dense layer, FLOPS equal 2 * I * O (one multiply and one add per weight). For Conv2D, FLOPS equal 2 * K * K * C_in * C_out * H_out * W_out, where H_out and W_out are the spatial dimensions of the output feature map. This is why convolutions on large feature maps are computationally expensive. For attention, the dominant cost is the attention matrix computation: 2 * n^2 * d per head, where n is sequence length and d is head dimension. This quadratic scaling is why long sequences are so expensive for Transformers.

FLOPS do not directly correspond to wall-clock time because modern GPUs achieve different FLOPS rates for different operations. Matrix multiplications on tensor cores achieve near-peak throughput, while element-wise operations and memory-bound operations are much slower. This is why a model with high FLOPS in well-structured matrix operations (Dense, Conv, Attention) may actually run faster than a model with fewer FLOPS but many element-wise operations.

Memory Footprint and Precision

The memory footprint of a model depends on the numerical precision used to store its parameters. In fp32 (32-bit floating point), each parameter occupies 4 bytes. In fp16 or bf16 (16-bit), each parameter occupies 2 bytes. In int8 (8-bit integer quantization), each parameter occupies 1 byte. Halving the precision roughly halves the memory, enabling larger models or larger batch sizes on the same hardware.

For inference, the memory requirement is straightforward: parameters times bytes per parameter, plus activation memory for the current batch. For training, memory is much higher: you need space for parameters, gradients (same size as parameters), optimizer states (2x parameters for Adam in fp32), and activations for the entire batch (needed for backpropagation). A 7B parameter model needs approximately 14 GB for fp16 inference, but 60-100+ GB for fp32 training with Adam.

Quantization reduces memory at the cost of precision. int8 quantization halves the memory compared to fp16 with minimal accuracy loss for inference (typically less than 1% degradation). 4-bit quantization (NF4, GPTQ) further halves memory and is increasingly popular for deploying large language models on consumer GPUs. The tradeoff between precision and accuracy depends on the model architecture and task: vision models are generally more robust to quantization than language models.

Architecture Comparison

The evolution of neural network architectures shows a clear trend toward higher efficiency: achieving better accuracy with fewer parameters and FLOPS. VGG-16 (2014) uses 138M parameters, with the vast majority (123M) in its three fully connected layers. ResNet-50 (2015) achieves better accuracy with only 25.6M parameters by replacing fully connected layers with global average pooling and using residual connections. EfficientNet-B0 (2019) achieves ResNet-50 level accuracy with just 5.3M parameters through neural architecture search and compound scaling.

In NLP, the trend is toward larger models rather than more efficient ones. BERT-base (2018) has 110M parameters, GPT-2 (2019) has 124M to 1.5B, GPT-3 (2020) has 175B, and the largest models now exceed 1 trillion parameters. However, distillation and pruning techniques can compress these models by 2-10x with minimal accuracy loss. DistilBERT retains 97% of BERT's accuracy with 40% fewer parameters. TinyBERT achieves similar accuracy with 7.5x fewer parameters through task-specific distillation.

Practical Tips for Architecture Selection

Frequently Asked Questions

How do I count parameters in a neural network?

Dense layers: I*O + O. Conv2D: K*K*C_in*C_out + C_out. LSTM: 4*(I*H + H*H + H). Multi-Head Attention: 4*(D*D + D). BatchNorm/LayerNorm: 2*features. Sum all layer parameters for the total count.

What is the difference between parameters and FLOPS?

Parameters measure model size (number of trainable weights, determining memory). FLOPS measure computational cost per forward pass (determining speed). A model can have few parameters but high FLOPS or vice versa. Both affect practical deployment constraints.

How much GPU memory does a model need?

For inference: parameters * bytes_per_param (4 for fp32, 2 for fp16, 1 for int8). For training, add gradients (same size), optimizer states (8 bytes/param for Adam), and activations (scales with batch size). A 7B model needs ~14 GB fp16 inference, ~100 GB fp32 training.

How do FLOPS scale with model size?

For Transformers, FLOPS per token are approximately 2 * num_parameters. Attention scales quadratically with sequence length. For CNNs, FLOPS depend on spatial resolution, kernel size, and channels. Larger models require proportionally more compute per sample.

How do popular architectures compare in complexity?

VGG-16: 138M params, 15.5 GFLOPS. ResNet-50: 25.6M params, 4.1 GFLOPS. EfficientNet-B0: 5.3M params, 0.4 GFLOPS. BERT-base: 110M params. GPT-2: 124M-1.5B. Modern architectures achieve much better accuracy/parameter ratios than older ones.

Related Tools

About the Author

Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.

Last updated: May 25, 2026