Attention Is All You Need — The Paper That Rewired AI
The 2017 paper that killed RNNs, invented the Transformer, and launched the modern AI era — explained for beginners and intermediates with 11 original manga panels.
The seven-word title that ended one era of AI and launched another. A beginner-friendly, technically honest tour through the paper that birthed every LLM you've ever heard of.

The Dark Ages of Language AI
To understand why "Attention Is All You Need" was a thunderclap, you first need to appreciate how painful life was before it. Cast your mind back to 2016. AI researchers around the world were working incredibly hard on language problems — translation, summarisation, question answering — but they were doing so with a fundamental handicap baked into their tools.
The dominant models at the time were Recurrent Neural Networks (RNNs) and their smarter cousin, the Long Short-Term Memory network (LSTM). Both were designed to handle sequences: text goes in word by word, the model builds up a hidden memory state as it reads, and produces an output at the end.
The intuition seems sensible. After all, we read left to right (in English). Why shouldn't a machine? The problem, as we'll see, was catastrophic at scale.
How an RNN Actually Works
Imagine you're a gold fish with a tiny little notepad. Every time you read a new word, you scribble something on your notepad, then erase half of it to make room for the next word. By the time you reach the end of a 500-word paragraph, your notepad is a smeared mess of partial impressions. That's an RNN.
More precisely: an RNN processes tokens one at a time, left to right. At each step, it combines the current word's embedding with a hidden state vector (its "memory" from all previous words) and produces a new hidden state. That hidden state is passed forward to the next step.
Information degrades as it passes along the chain. Early words become "forgotten."
LSTMs: A Better Notepad, Same Problem
LSTMs (invented by Hochreiter & Schmidhuber in 1997) were the RNN's upgrade. Instead of one hidden state, they have three "gates" — input, forget, and output — plus a separate "cell state" that acts as a longer-term memory. They were genuinely better at remembering things across longer sequences.
But LSTMs didn't solve the core architectural problem. They still processed one token at a time, sequentially. And at massive scale, that was the killer.

Three Problems That Crippled the Old Models
Before we get to the solution, let's be precise about the pain. The pre-Transformer era had three interconnected crises, and solving any one of them would have been significant. The Transformer paper solved all three simultaneously.
Problem 1: Sequential Bottleneck
RNNs and LSTMs process tokens one at a time. Step 2 cannot begin until Step 1 finishes. This means you cannot parallelize training across GPU cores. Training was agonisingly slow — weeks or months for large models.
Problem 2: Vanishing Gradients
When you train a neural network with backpropagation, you compute gradients (error signals) and push them backwards through the chain. In a long sequence, those gradients shrink exponentially as they travel. Early tokens barely learn anything.
Problem 3: Long-Range Amnesia
In the sentence "The trophy didn't fit in the suitcase because it was too big" — what does "it" refer to? The trophy. A human knows instantly. An RNN processing hundreds of words between "trophy" and "it" often forgot the connection entirely.
The Telephone Game at Scale
The vanishing gradient problem is best understood through the "telephone game" (Chinese Whispers). You whisper a sentence to the first person in a chain. By the time it reaches the 20th person, the message is garbled beyond recognition. In an RNN, the gradient signal is that whisper — and long sequences were destroying it.
LSTMs reduced the garbling with their gating mechanisms, but didn't eliminate it. And crucially, every single token in the sequence still had to wait for the one before it to finish processing. At a time when researchers were starting to dream about training on billions of words, this was a scaling cliff.

A Spark Before the Fire — Bahdanau Attention (2014)
To be accurate about the history: the 2017 paper didn't invent attention from scratch. In 2014, Dzmitry Bahdanau and colleagues published a paper that added an "attention mechanism" on top of existing encoder-decoder RNNs for machine translation.
The idea was elegant: when generating each output word, instead of squishing the entire input sentence into one fixed-size vector, the model learns to "look back" at different parts of the input and assign weights — attention scores — to each input word. Generate "Hund" in German? Pay more attention to "dog" in the English source.
Bahdanau et al. (2014) showed attention worked. But they bolted it on top of RNNs — the sequential backbone was still there, just with a better look-back mechanism. It was like putting a turbocharged engine in a horse-drawn carriage.
The 2017 breakthrough came when Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin at Google Brain asked a radical question: what if we got rid of the carriage altogether?
"We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely." — Vaswani et al., Attention Is All You Need (2017)
June 2017 — The Paper Drops
Eight Google Brain researchers quietly uploaded a pre-print to arXiv on June 12, 2017. The title was almost cheeky — "Attention Is All You Need" — a pun on The Beatles' "All You Need Is Love" and a direct challenge to the field: attention mechanisms alone are sufficient. No recurrence. No convolutions. Just attention.
The abstract was direct. They proposed the Transformer architecture, showed it achieved state-of-the-art on English-to-German translation (28.4 BLEU, surpassing the previous best by more than 2 BLEU points), trained it in a fraction of the time, and made a claim that would prove prophetic: this architecture was far more parallelisable and required significantly less time to train.
At the time, the machine learning community took notice but didn't immediately grasp the full magnitude. It looked like a better translation model. What it actually was, in retrospect: the foundation of every major AI system built in the next decade.

Self-Attention — Every Word Watches Every Word
This is the heart of it. Everything else in the Transformer paper is (brilliant) supporting machinery. Self-attention is the engine.
Here's the core idea in plain English: when processing any word, the model looks at every other word in the sentence simultaneously and calculates a relevance score. Instead of passing information along a chain one step at a time, every token "talks to" every other token in parallel.
The Library Analogy
Imagine a library where every book can send a little messenger to every other book, asking: "Hey, are you relevant to me?" Each pair of books gives an answer — a number from 0 to 1. The books with higher scores get to "share" more of their information when the library compiles its final report.
In the sentence "The cat sat on the mat because it was tired" — when the model processes "it", the self-attention mechanism computes a score between "it" and every other word. The word "cat" gets a very high score (because "it" refers to the cat), while words like "the" and "on" get low scores. This is done in one parallel operation — no sequential chain required.
The Math Behind It (Don't Panic)
The paper formalises this with three vectors derived from each word's embedding: a Query (Q), a Key (K), and a Value (V). Think of it like a search engine:
Query (Q): "What am I looking for?" — what the current word wants to know.
Key (K): "What do I offer?" — what each other word has to advertise.
Value (V): "Here's my actual content" — what each word shares if chosen.
Attention score = softmax( Q · Kᵀ / √d_k ) · V
The division by √d_k (square root of the dimension) is a stabilising trick — without it, the dot products can get very large and the softmax function becomes extremely "peaky" (everything goes to one word), which hurts training. The softmax then converts raw scores into a probability distribution — so all the weights add up to 1.0.
"it" pays most attention to "cat" (0.72 weight) — this is how the model resolves co-reference. Higher = brighter.
The critical breakthrough isn't just that these scores are computed — it's how: all pairs are computed in parallel using matrix multiplication. A sentence of 512 tokens doesn't require 512 sequential steps. It requires one big matrix operation that modern GPUs execute extremely fast. This is the parallelisation breakthrough that made scaling possible.

Multi-Head Attention — Many Perspectives at Once
Here's where the paper goes from clever to ingenious. A single self-attention computation gives you one view of how words relate. But language is rich — words relate to each other in many different ways simultaneously.
Consider the sentence "She gave him the book she wrote":
— "she" and "him" have a grammatical subject/object relationship
— "she" (first occurrence) and "she" (second) have a co-reference relationship
— "book" and "wrote" have a semantic relationship (you write books)
— "gave" and "book" have a verb-object relationship
One attention head would have to pick one of these. Multi-head attention runs several attention computations in parallel, each in a different "subspace" of the representation. The results are then concatenated and projected back to the original dimension.
The original Transformer used 8 attention heads. Modern LLMs like GPT-3 use 96, and models like Claude use even more. Each head develops its own specialisation during training — not by design, but emergently, because the model learns that different heads can capture different useful patterns.
Multi-head attention is like having a team of editors review your essay simultaneously. One editor focuses on grammar, one on logical flow, one on vocabulary, one on argument structure. You get all their feedback at once, then synthesise it. No editor has to wait for the previous one to finish.

Positional Encoding — Teaching Order Without Recurrence
Here's a subtle but critical problem. In an RNN, word order is implicit — you literally process word 1, then word 2, then word 3. The order is baked into the architecture. But in a Transformer, all words are processed in parallel. If you showed it "Dog bites man" and "Man bites dog" simultaneously, the attention mechanism alone would see the same set of words and might produce the same result.
That's obviously catastrophic for language. "The bank by the river" and "river by the bank the" mean very different things.
The solution: Positional Encoding. Before feeding word embeddings into the Transformer, you add a unique positional signal to each one. The paper uses a clever combination of sine and cosine functions at different frequencies:
PE(pos, 2i) = sin(pos / 10000^(2i/d))PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Where pos is the word's position and i is the dimension. The result: each position gets a unique, smooth vector that the model can learn to interpret. The sine/cosine waves at different frequencies are like a musical chord unique to each seat in the stadium.
Why sinusoids and not just the number 1, 2, 3...? Because sinusoids generalise. They allow the model to learn relative positions (word 2 is one step after word 1) not just absolute ones. And they handle sequences longer than those seen in training gracefully, because the wave patterns extend naturally.
Modern variants like RoPE (Rotary Position Embedding, used in Llama and GPT-NeoX) and ALiBi have since improved on the original scheme — but they're all descendants of this 2017 insight.

The Full Transformer — Encoder + Decoder
The original paper was designed for sequence-to-sequence tasks — specifically machine translation. The architecture has two halves that work together: an Encoder that reads the input (the English sentence) and a Decoder that generates the output (the German translation).
The Encoder
The encoder stack (6 identical layers in the original paper) processes the entire input sentence in parallel. Each layer has two sub-components: (1) multi-head self-attention (all words attend to all other words) and (2) a feed-forward neural network applied to each position independently. Both sub-components use residual connections (the input is added back to the output) and layer normalisation — both stability tricks borrowed from computer vision.
The Decoder
The decoder is similar but has three sub-components per layer. The first is masked self-attention — like encoder self-attention, but masked so that when generating word N, the model can only attend to words 1 through N-1 (it can't cheat by looking at future words). The second is cross-attention — the decoder attends to the encoder's output, connecting the input sentence to the generation process. The third is the same feed-forward network as in the encoder.
BERT (2018) uses only the encoder — great for understanding tasks (classification, named entity recognition). GPT-1/2/3/4 use only the decoder — great for generation tasks (writing, code, conversation). The full encoder-decoder design lives on in models like T5 and BART, used heavily for translation and summarisation.
One more key ingredient: Feed-Forward layers. After each attention block, every position's representation passes through a small, identical 2-layer neural network. In the original paper, the inner dimension of this network was 2048 — 4× the model's embedding dimension of 512. In GPT-3, it's 4× 12,288 = 49,152. These layers are believed to act as "fact storage" — where knowledge learned during training gets encoded.

The Impact Timeline — 2017 to Now
The Transformer paper wasn't just a research curiosity. It was a platform. Within a year, the entire field had pivoted. Within five years, it had generated a trillion-dollar industry. Here's the direct lineage:

Is It The Most Important AI Paper Ever Written?
Let's be honest about this. The question is fascinating precisely because it's not entirely settled — there are serious candidates.
The Case For: Yes, Unambiguously
No single paper has had a more direct and immediate commercial and scientific impact in the modern AI era. Every frontier LLM in existence today is a Transformer. The trillion-dollar AI industry of the mid-2020s is built on this foundation. With 200,000+ citations, it's a runaway leader in citation counts for an ML paper. The research it unlocked — in text, image (ViT), audio (Whisper), protein structure (AlphaFold 2), video (Sora), code (Copilot) — spans essentially every domain of AI.
The Case For Other Contenders
Backpropagation (Rumelhart et al., 1986) — Without the ability to train neural networks at all, there's nothing to build on.
ImageNet + AlexNet (Krizhevsky et al., 2012) — The moment deep learning proved itself to the world, launching the modern deep learning era.
Word2Vec (Mikolov et al., 2013) — Showed that word embeddings encode semantic meaning; a prerequisite for Transformer input representations.
Scaling Laws (Kaplan et al., 2020) — Proved that LLM capabilities grow predictably with compute and data, enabling the investment thesis behind GPT-3 and everything after.
RLHF (Christiano et al., 2017) — The alignment technique that turned raw LLMs into assistants humans actually want to use.
The honest verdict: In the specific context of modern generative AI — LLMs, multimodal models, and the AI products billions of people use daily — "Attention Is All You Need" is the clearest single point of origin. Without backprop it couldn't exist, but without this paper, it wouldn't have become what it is. It's the right answer to the question "which paper made today's AI possible?"
"We are all standing on the shoulders of eight people who asked: what if recurrence isn't actually necessary?" — A reasonable paraphrase of the entire modern AI research community

Where Do We Go From Here?
The Transformer is dominant but not invincible. Researchers are actively working on what comes next — and several serious challengers are emerging.
The Current Limitations
Self-attention has a quadratic complexity problem. If your sequence has N tokens, the attention matrix is N × N. Double the sequence length, quadruple the compute. For long documents — books, codebases, hours of audio — this becomes brutally expensive. The context window you experience in Claude or GPT-4 represents enormous engineering effort to extend what was originally a very limited range.
What's Being Explored
Mamba / State Space Models (SSMs) process sequences in linear time, not quadratic — a genuine architectural alternative that some researchers believe could eventually rival or exceed Transformers for long-context tasks. Flash Attention (Dao et al., 2022) is an algorithmic optimisation that makes standard attention dramatically more memory-efficient without changing the math. Mixture of Experts (MoE) architectures — used in GPT-4 and Gemini — activate only a subset of parameters per token, allowing models with trillions of total parameters to run at the cost of a much smaller model.
Multimodality is the frontier. The Transformer's attention mechanism generalises naturally to images (patch tokens), audio (spectrogram tokens), video (frame tokens), and structured data. A single Transformer can in principle process all of these simultaneously — and models like GPT-4o and Gemini Ultra are moving rapidly in this direction.
The question researchers are now asking: is intelligence primarily a function of architecture, or of scale and data? The scaling laws suggest it's mostly the latter. If that's true, the Transformer need not be dethroned — it just needs to be fed more.
We are, by most accounts, somewhere in the middle of the most important technological transition in human history. Every model at the frontier — the ones writing code, passing medical exams, generating video — shares a common ancestor. Eight researchers. One arXiv pre-print. Twelve hundred lines of Python. June 12, 2017.

The Seven Words That Changed Everything
We started this piece in 2016, watching RNNs and LSTMs struggle through their sequential chains, watching gradient signals vanish like whispers in a long corridor. We watched researchers work incredibly hard to coax these architectures to handle longer contexts, more complex language, bigger training sets — and hit wall after wall.
Then eight people asked a simple question — what if you just... paid attention? — and rewired the entire field.
The Transformer is not magic. It's mathematics: query-key-value lookups, scaled dot-product attention, layer normalisation, residual connections, feed-forward networks. Every piece is graspable. The genius was in the combination — and in the willingness to abandon the assumption that sequences must be processed sequentially.
The models you interact with today — the ones that draft your emails, explain your code, answer your questions about transformer architecture with exhaustive detail — are all running on this foundation. Claude (the AI that helped outline this post) is a Transformer. GPT-4 is a Transformer. Grok is a Transformer. Gemini is a Transformer. They are, in the deepest technical sense, all direct descendants of that arXiv upload.
Understanding "Attention Is All You Need" is not just historical curiosity. It's the grammar of modern AI. Once you understand it, you have a lens through which almost everything in the field makes sense — the scaling laws, the context window debates, the encoder vs. decoder architecture choices, the multimodal experiments, the efficiency research.
The paper is free. The arXiv link still works. It's 15 pages and reads more clearly than most ML papers. If this post piqued your curiosity: go read it. You'll understand it now.
Test what you read
Quick quiz

About the Author
Ajay Walia
AI {IT Architect} focusing on local-first multi-agent AI engineering, zero-data-egress systems. Ideator, Creator and Executor on Curious Bit.
Keep Reading

I Built, My Own Screenshot App for macOS (No More Clunky Screenshots)
How I got fed up with macOS native screenshot chaos, missed Greenshot from my Windows days, and built a lightweight menu bar screenshot app using Swift and Claude.

I Built This Blog Without Writing a Single Line of Code (Almost)
A comic-style walkthrough of how I built curiousbit.netlify.app using Claude, Codex, Hugo, Tailwind, and Netlify — zero frontend experience required.

OpenAI Just Killed the Voice Assistant — And Built Something Far More Dangerous
GPT-Realtime-2 doesn't just answer questions — it reasons out loud, calls tools mid-sentence, and translates 70 languages live. The voice assistant era is over. The voice agent era has begun.