Attention Mechanism Visualizer

Enter a sentence and watch how a transformer assigns attention weights between every pair of tokens. Explore self-attention (how each token attends to all others in the same sequence) and cross-attention (encoder-decoder interaction). The heatmap updates in real time. All computation runs locally in your browser with zero server calls.

Input & Configuration

--
Tokens
--
Avg Attention Entropy
--
Max Attention Weight
--
Most Attended Token

Attention Heatmap

Click "Visualize Attention" to generate the heatmap.
Row = Query token (attends FROM) Col = Key token (attends TO) 0 → 1

Top Attention Arcs (Selected Token)

Click a token in the heatmap row labels to inspect its outgoing attention.

What Is the Attention Mechanism?

The attention mechanism, introduced in the landmark paper "Attention Is All You Need" (Vaswani et al., 2017), is the core building block of transformer models. It allows every position in a sequence to directly attend to every other position in a single operation, overcoming the sequential bottleneck of recurrent architectures. The mechanism computes a weighted sum of value vectors, where the weights are determined by the compatibility between query and key vectors.

The fundamental computation is: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V. Here Q (queries), K (keys), and V (values) are linear projections of the input. The dot product QK^T measures how much each query should attend to each key. Dividing by sqrt(d_k) prevents the dot products from growing too large (which would push softmax into saturation). The softmax converts scores into a probability distribution, and the result is a weighted sum of values.

Self-Attention vs Cross-Attention

In self-attention, the queries, keys, and values all come from the same sequence. This allows the model to build rich contextual representations by letting each token gather information from all other tokens. For example, in the sentence "The bank by the river was steep," the word "bank" can attend strongly to "river" to resolve its meaning. Self-attention is used in both the encoder and decoder of standard transformers, and exclusively throughout encoder-only models like BERT.

In cross-attention, queries come from one sequence (typically the decoder) while keys and values come from another (typically the encoder output). This is how a translation model connects source and target languages: each position in the translation can selectively attend to relevant positions in the source sentence. The cross-attention heatmap reveals which source tokens the model focuses on when generating each target token.

Multi-Head Attention

Rather than performing a single attention function, transformers use multi-head attention: running h parallel attention operations with different learned projections, then concatenating the results. Different heads tend to specialize. In practice, some heads capture syntactic dependencies (subject-verb agreement), others capture semantic relationships (coreference, synonymy), others focus on positional proximity, and some develop more abstract patterns. This visualizer simulates four such heads with different characteristic behaviors.

Reading the Heatmap

Each cell (i, j) in the heatmap shows how much token i attends to token j. Brighter green indicates higher attention weight. A bright diagonal means each token attends strongly to itself (common in early layers). Off-diagonal bright cells reveal long-range dependencies. A nearly uniform row means the token is "confused" or distributing attention evenly (high entropy). A very peaked row means the token attends strongly to one specific other token (low entropy, high focus).

The attention entropy metric in the stats row measures how concentrated or diffuse the attention distribution is on average. Lower entropy means sharper, more focused attention. Higher entropy means the model is looking broadly at many tokens simultaneously. Temperature controls softmax sharpness: lower temperature produces sharper (more peaked) attention distributions.

Practical Implications of Attention Patterns

Analyzing attention weights can reveal what a model has learned. Probing studies have shown that specific BERT heads capture syntactic structure like dependency parsing arcs, that some heads track sentence boundaries, and that certain heads specialize in tracking which pronouns refer to which entities. However, attention weights are not a direct explanation of model predictions — high attention to a token does not necessarily mean that token causally influenced the output. Gradient-based attribution methods provide more reliable explanations, but attention visualizations remain valuable for building intuition about transformer behavior.

For practitioners, understanding attention patterns helps with debugging (why is the model confused here?), architecture decisions (how many heads are needed?), and pruning (which heads can be removed without hurting performance?). Research has found that many attention heads in large pretrained transformers are redundant and can be pruned with minimal accuracy loss.

Related Tools

Frequently Asked Questions

What does the attention heatmap actually show?

Each row represents a query token (the token that is attending) and each column represents a key token (the token being attended to). The brightness of cell (i, j) shows how much of its attention token i allocates to token j. Each row sums to 1.0 because the softmax output is a probability distribution. A bright cell means strong focus; a dim cell means little to no attention.

Why do some heads show diagonal patterns?

A bright diagonal means each token attends primarily to itself. This is common in early transformer layers and in certain specialized heads. It can indicate that the token's representation at that layer does not strongly depend on context. In contrast, off-diagonal bright cells indicate that a token is drawing information from distant positions in the sequence, which is what makes transformers powerful for capturing long-range dependencies.

What is attention entropy and why does it matter?

Attention entropy measures how spread out (uniform) or concentrated (peaked) the attention distribution is for a given query token. High entropy means the token is attending roughly equally to many other tokens — it is not focusing on any specific context. Low entropy means the token is strongly focused on one or a few specific tokens. Tasks requiring precise reference resolution (like coreference) typically show lower entropy heads, while tasks requiring broad context integration show higher entropy.

How does temperature affect attention?

Temperature scales the logits before softmax. Lower temperature (e.g., 0.5) sharpens the distribution — the highest-scoring key gets an even higher probability and others get suppressed, producing more focused attention. Higher temperature (e.g., 2.0) flattens the distribution — attention spreads more evenly across tokens. In real transformers, the 1/sqrt(d_k) scaling factor acts as a fixed temperature. The temperature slider here lets you explore how this parameter shapes attention behavior.

Can I use attention weights to explain model predictions?

Attention weights are often misinterpreted as explanations. They show where the model looks, but not what it uses. Research (Jain & Wallace, 2019; Wiegreffe & Pinter, 2019) has debated whether attention is explanation. The consensus is: attention can provide useful intuition but is not a reliable causal explanation on its own. Gradient-based methods like Integrated Gradients or SHAP provide more principled attribution. Use attention visualizations for exploration and hypothesis generation, not for definitive explanations.

About the Author

Michael Lip builds open-source ML tools and developer utilities at zovo.one. ml0x is part of the Zovo Tools network, a collection of free, privacy-first tools for developers and data scientists. No tracking, no accounts required, no data leaves your browser.

Last updated: May 28, 2026