How AI Models Understand Language Using Self-Attention
This patent describes a neural network architecture, known as the Transformer, that uses a "self-attention" mechanism to process sequences of information, like words in a sentence, by weighing the importance of different parts of the input to generate an output sequence.
Patent Number
US 10452978
Status
Active
Filing Date
June 28, 2018
Grant Date
October 22, 2019
Expiration
~June 2038 (estimated)
Claims
33
Assignee
Google LLC
Inventors
Noam M. Shazeer, Aidan Nicholas Gomez, Lukasz Mieczyslaw Kaiser, Jakob D. Uszkoreit, Llion Owen Jones, Niki J. Parmar, Illia Polosukhin, Ashish Teku Vaswani
Citations
44 forward · 35 backward
What it covers
This patent covers the Transformer — the neural network architecture that powers every major AI language model built since 2018. The core idea: instead of reading text one word at a time (left to right, like all previous models), the Transformer reads the entire input at once and figures out which words matter to which other words. It does this through "self-attention" — for every word in a sentence, it asks: which other words should I pay attention to when understanding this one? In "The bank is on the river," self-attention lets the model link "bank" to "river" and figure out it means a riverbank, not a financial institution. This happens simultaneously across all words, in parallel, rather than sequentially. The model uses an encoder (which processes the input) and a decoder (which generates the output), with self-attention layers in both. Positional encoding tells the model where each word sits in the sequence, since it no longer reads left-to-right.
What it doesn't cover
- —Language models that process text sequentially — one word or token at a time — rather than in parallel (this rules out RNNs and LSTMs, the dominant architectures before 2018)
- —Attention mechanisms where one sequence queries a different sequence — this patent specifically covers self-attention within a single sequence, not cross-attention between encoder and decoder
- —Convolutional approaches that look at fixed-size windows of text rather than the full sequence at once
- —Models without an explicit encoder-decoder structure — covers sequence-to-sequence tasks like translation, not pure generation models
- —Any architecture that does not explicitly compute queries, keys, and values as the mechanism for determining relevance between positions
The clever bit
Before the Transformer, the dominant theory in AI language modeling was that you had to read text the way humans do — left to right, one word at a time, building up a "memory" of what came before. The problem: that memory degrades. By the time a model got to the end of a long sentence, it had mostly forgotten the beginning. The Transformer threw that out entirely. It doesn't have memory of sequence — it has attention over the whole thing at once. Every word can see every other word simultaneously, and the model learns which relationships matter. This made it dramatically faster to train (because you're not constrained by sequential processing) and dramatically better at long-range dependencies. The paper that introduced this architecture — 'Attention Is All You Need,' published in 2017 — became the most cited paper in machine learning history within a few years.
Why it matters
If you've used ChatGPT, Claude, Gemini, Copilot, or any AI tool built after 2018, you've used something built directly on this architecture. The Transformer didn't improve language AI — it replaced everything that came before it. BERT (Google, 2018), GPT-2 (OpenAI, 2019), GPT-3 (2020), GPT-4, Claude, Gemini — all Transformers. All of them trace back to eight Google Brain researchers who filed this patent in June 2018. The paper accompanying it was initially considered for rejection at a top ML conference; the authors had to push to get it accepted. Within two years it had made the previous state-of-the-art models obsolete. Within five years it had created an entirely new industry worth hundreds of billions of dollars. The patent itself is owned by Google — which means Google holds IP on the core architecture that OpenAI, Anthropic, Meta, Mistral, and every other AI lab is building on.
Real-world examples
- 1.Google Translate
- 2.ChatGPT
- 3.Bard
- 4.Microsoft Copilot
- 5.BERT (Bidirectional Encoder Representations from Transformers)
- 6.GPT-3 and subsequent GPT models
- 7.DALL-E (for text-to-image generation)
Generated by PatentBrief · Not legal advice · patentbrief.org
US 10452978 · 2026