How AI Models Understand Language Using Self-Attention
This patent describes a neural network architecture, known as the Transformer, that uses a "self-attention" mechanism to process sequences of information, like words in a sentence, by weighing the importance of different parts of the input to generate an output sequence.
Original patent title: “Attention-based sequence transduction neural networks”
What this patent covers
The actual claim
This patent covers the Transformer — the neural network architecture that powers every major AI language model built since 2018. The core idea: instead of reading text one word at a time (left to right, like all previous models), the Transformer reads the entire input at once and figures out which words matter to which other words. It does this through "self-attention" — for every word in a sentence, it asks: which other words should I pay attention to when understanding this one? In "The bank is on the river," self-attention lets the model link "bank" to "river" and figure out it means a riverbank, not a financial institution. This happens simultaneously across all words, in parallel, rather than sequentially. The model uses an encoder (which processes the input) and a decoder (which generates the output), with self-attention layers in both. Positional encoding tells the model where each word sits in the sequence, since it no longer reads left-to-right.
What this patent does NOT cover
The boundaries
- Language models that process text sequentially — one word or token at a time — rather than in parallel (this rules out RNNs and LSTMs, the dominant architectures before 2018)
- Attention mechanisms where one sequence queries a different sequence — this patent specifically covers self-attention within a single sequence, not cross-attention between encoder and decoder
- Convolutional approaches that look at fixed-size windows of text rather than the full sequence at once
- Models without an explicit encoder-decoder structure — covers sequence-to-sequence tasks like translation, not pure generation models
- Any architecture that does not explicitly compute queries, keys, and values as the mechanism for determining relevance between positions
These exclusions are unique to PatentBrief — derived from the actual claim language, not patent-office boilerplate.
What made this novel
Before the Transformer, the dominant theory in AI language modeling was that you had to read text the way humans do — left to right, one word at a time, building up a "memory" of what came before. The problem: that memory degrades. By the time a model got to the end of a long sentence, it had mostly forgotten the beginning. The Transformer threw that out entirely. It doesn't have memory of sequence — it has attention over the whole thing at once. Every word can see every other word simultaneously, and the model learns which relationships matter. This made it dramatically faster to train (because you're not constrained by sequential processing) and dramatically better at long-range dependencies. The paper that introduced this architecture — 'Attention Is All You Need,' published in 2017 — became the most cited paper in machine learning history within a few years.
Schematic visualization of the patent's claim structure. Hand-drawn diagrams in progress for each landmark patent.
Where you've seen this
Real-world examples
Google Translate
ChatGPT
Bard
Microsoft Copilot
BERT (Bidirectional Encoder Representations from Transformers)
GPT-3 and subsequent GPT models
DALL-E (for text-to-image generation)
Why it matters
The bigger picture
If you've used ChatGPT, Claude, Gemini, Copilot, or any AI tool built after 2018, you've used something built directly on this architecture. The Transformer didn't improve language AI — it replaced everything that came before it. BERT (Google, 2018), GPT-2 (OpenAI, 2019), GPT-3 (2020), GPT-4, Claude, Gemini — all Transformers. All of them trace back to eight Google Brain researchers who filed this patent in June 2018. The paper accompanying it was initially considered for rejection at a top ML conference; the authors had to push to get it accepted. Within two years it had made the previous state-of-the-art models obsolete. Within five years it had created an entirely new industry worth hundreds of billions of dollars. The patent itself is owned by Google — which means Google holds IP on the core architecture that OpenAI, Anthropic, Meta, Mistral, and every other AI lab is building on.
Filed
June 28, 2018
Granted
October 22, 2019
Claim 1 — Plain English
What this patent covers
This patent covers the Transformer — the neural network architecture that powers every major AI language model built since 2018. The core idea: instead of reading text one word at a time (left to right, like all previous models), the Transformer reads the entire input at once and figures out which words matter to which other words. It does this through "self-attention" — for every word in a sentence, it asks: which other words should I pay attention to when understanding this one? In "The bank is on the river," self-attention lets the model link "bank" to "river" and figure out it means a riverbank, not a financial institution. This happens simultaneously across all words, in parallel, rather than sequentially. The model uses an encoder (which processes the input) and a decoder (which generates the output), with self-attention layers in both. Positional encoding tells the model where each word sits in the sequence, since it no longer reads left-to-right.
The clever bit
Before the Transformer, the dominant theory in AI language modeling was that you had to read text the way humans do — left to right, one word at a time, building up a "memory" of what came before. The problem: that memory degrades. By the time a model got to the end of a long sentence, it had mostly forgotten the beginning. The Transformer threw that out entirely. It doesn't have memory of sequence — it has attention over the whole thing at once. Every word can see every other word simultaneously, and the model learns which relationships matter. This made it dramatically faster to train (because you're not constrained by sequential processing) and dramatically better at long-range dependencies. The paper that introduced this architecture — 'Attention Is All You Need,' published in 2017 — became the most cited paper in machine learning history within a few years.
What it does not cover
- Language models that process text sequentially — one word or token at a time — rather than in parallel (this rules out RNNs and LSTMs, the dominant architectures before 2018)
- Attention mechanisms where one sequence queries a different sequence — this patent specifically covers self-attention within a single sequence, not cross-attention between encoder and decoder
- Convolutional approaches that look at fixed-size windows of text rather than the full sequence at once
- Models without an explicit encoder-decoder structure — covers sequence-to-sequence tasks like translation, not pure generation models
- Any architecture that does not explicitly compute queries, keys, and values as the mechanism for determining relevance between positions
Patent Journey
From filing to today
Patent Filed
2018
Patent Granted
2019 · 1yr after filing
Active Today
2026
Expires
2038
PatentBrief Score
Impact Score
High impact
Citation count
33/40
Moderately cited
Claim breadth
20/20
Very broad protection
Recency
10/20
Granted 5–10 years ago
Assignee scale
20/20
Major technology company
PatentBrief Impact Score — based on citation count, claim breadth, recency, and assignee scale. Not a legal assessment.
The original legal language
Original claims
33 claims as filed with the patent office.
Citations
Patent lineage
Stay in the loop
Get a weekly digest of new patents.
One email per week. No spam. Unsubscribe anytime.
Keep exploring
Related patents you should know
US 12564871 · 2026
A Fixture for Cleaning Showerheads with Multiple Separate Chambers
This patent describes a cleaning device for showerheads that uses a fixture with three or more separate internal compartments and channels to direct cleaning fluid to the showerhead's upper surfaces.
ASM IP HOLDING BV
US 12324579 · 2025
Surgical Stapler Battery Health Check During Operation
This patent describes a powered surgical stapler that can detect if some of its rechargeable battery cells are damaged while it's actually firing staples, helping ensure the procedure finishes safely.
CILAG GMBH INT
US 12471982 · 2025
Surgical Tool That Combines Energy Treatment and Stapling
CILAG's patent details a surgical instrument that applies therapeutic energy to tissue, monitors its properties, then deploys staples, adapting the stapling based on the initial energy treatment and monitoring.
CILAG GMBH INT
US 11918209 · 2024
Real-Time Surgical Instrument Status on Live Video During Operations
This patent describes a surgical system that shows live video from inside the body and overlays important information about the surgical tool directly onto the screen, helping surgeons operate more precisely.
CILAG GMBH INT
US 8697359 · 2014
How to Use CRISPR-Cas9 to Edit Genes in Human Cells
This patent describes a method and system for precisely altering gene expression in eukaryotic cells, including human cells, using an engineered CRISPR-Cas9 system that targets and cleaves specific DNA sequences.
Massachusetts Institute of Technology
US 4683195 · 1987
How to Make Many Copies of a Specific DNA Segment
This patent describes the Polymerase Chain Reaction (PCR), a fundamental process for making millions of copies of a specific DNA or RNA segment from a tiny sample, enabling its detection.
Cetus Corp
Semantically similar
You might also find these interesting
US 6370526 · 2002 · International Business Machines Corp
Google AdWords — The Auction System That Made Search Profitable
US 6285999 · 2001 · Leland Stanford Junior University
How Websites Get Ranked by Who Links to Them
US 7479949 · 2009 · Apple Inc
How Touchscreens Tell the Difference Between Your Finger Gestures
US 4405829 · 1983 · Massachusetts Institute of Technology
How RSA Public-Key Encryption Secures Digital Messages
Patent monitoring