The Attention Mechanism Explained Visually
Attention is the core innovation behind transformers. Let me break it down without the math jargon.
The Intuition
Imagine reading a sentence: "The cat sat on the mat because it was tired."
When you read "it," your brain instantly connects it to "cat." That's attention—selectively focusing on relevant parts of the input.
How It Works
textInput: "The cat sat on the mat" ↓ [Query] [Key] [Value] ↓ Attention Scores ↓ Weighted Output
Step 1: Create Q, K, V
Each word gets three representations:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information do I provide?"
Step 2: Calculate Scores
pythonscores = Q @ K.T / sqrt(d_k) # Scaled dot product attention_weights = softmax(scores) output = attention_weights @ V
Step 3: Weighted Combination
The output is a weighted sum of values, where weights come from how well queries match keys.
Multi-Head Attention
Instead of one attention operation, transformers use multiple "heads":
textHead 1: Focuses on syntax Head 2: Focuses on semantics Head 3: Focuses on position ...
This lets the model capture different types of relationships simultaneously.
Why It Matters
Attention enables:
- Long-range dependencies (unlike RNNs)
- Parallel processing (unlike sequential models)
- Interpretability (you can visualize what attends to what)
Practical Example
In translation, when generating "chat" (French for cat):
- High attention on "cat" (semantic match)
- Some attention on "the" (article agreement)
- Low attention on "mat" (irrelevant)
This selective focus is what makes modern NLP work.