PatentBrief

How AI Models Understand Language Using Self-Attention

This patent describes a neural network architecture, known as the Transformer, that uses a "self-attention" mechanism to process sequences of information, like words in a sentence, by weighing the importance of different parts of the input to generate an output sequence.

Granted 2019activeExpires 2038Owned by Google LLCInvented by Noam M. Shazeer, Aidan Nicholas Gomez, Lukasz Mieczyslaw Kaiser + 5 more

Original patent title: “Attention-based sequence transduction neural networks

What this patent covers

The actual claim

This patent covers the Transformer — the neural network architecture that powers every major AI language model built since 2018. The core idea: instead of reading text one word at a time (left to right, like all previous models), the Transformer reads the entire input at once and figures out which words matter to which other words. It does this through "self-attention" — for every word in a sentence, it asks: which other words should I pay attention to when understanding this one? In "The bank is on the river," self-attention lets the model link "bank" to "river" and figure out it means a riverbank, not a financial institution. This happens simultaneously across all words, in parallel, rather than sequentially. The model uses an encoder (which processes the input) and a decoder (which generates the output), with self-attention layers in both. Positional encoding tells the model where each word sits in the sequence, since it no longer reads left-to-right.

What this patent does NOT cover

The boundaries

  • Language models that process text sequentially — one word or token at a time — rather than in parallel (this rules out RNNs and LSTMs, the dominant architectures before 2018)
  • Attention mechanisms where one sequence queries a different sequence — this patent specifically covers self-attention within a single sequence, not cross-attention between encoder and decoder
  • Convolutional approaches that look at fixed-size windows of text rather than the full sequence at once
  • Models without an explicit encoder-decoder structure — covers sequence-to-sequence tasks like translation, not pure generation models
  • Any architecture that does not explicitly compute queries, keys, and values as the mechanism for determining relevance between positions

These exclusions are unique to PatentBrief — derived from the actual claim language, not patent-office boilerplate.

What made this novel

Before the Transformer, the dominant theory in AI language modeling was that you had to read text the way humans do — left to right, one word at a time, building up a "memory" of what came before. The problem: that memory degrades. By the time a model got to the end of a long sentence, it had mostly forgotten the beginning. The Transformer threw that out entirely. It doesn't have memory of sequence — it has attention over the whole thing at once. Every word can see every other word simultaneously, and the model learns which relationships matter. This made it dramatically faster to train (because you're not constrained by sequential processing) and dramatically better at long-range dependencies. The paper that introduced this architecture — 'Attention Is All You Need,' published in 2017 — became the most cited paper in machine learning history within a few years.

Attention-based sequence trans…(Primary claim)ai mlsoftwaretelecommunicationsconsumer electronics

Schematic visualization of the patent's claim structure. Hand-drawn diagrams in progress for each landmark patent.

Where you've seen this

Real-world examples

01

Google Translate

02

ChatGPT

03

Bard

04

Microsoft Copilot

05

BERT (Bidirectional Encoder Representations from Transformers)

06

GPT-3 and subsequent GPT models

07

DALL-E (for text-to-image generation)

Why it matters

The bigger picture

If you've used ChatGPT, Claude, Gemini, Copilot, or any AI tool built after 2018, you've used something built directly on this architecture. The Transformer didn't improve language AI — it replaced everything that came before it. BERT (Google, 2018), GPT-2 (OpenAI, 2019), GPT-3 (2020), GPT-4, Claude, Gemini — all Transformers. All of them trace back to eight Google Brain researchers who filed this patent in June 2018. The paper accompanying it was initially considered for rejection at a top ML conference; the authors had to push to get it accepted. Within two years it had made the previous state-of-the-art models obsolete. Within five years it had created an entirely new industry worth hundreds of billions of dollars. The patent itself is owned by Google — which means Google holds IP on the core architecture that OpenAI, Anthropic, Meta, Mistral, and every other AI lab is building on.

Filed

June 28, 2018

Granted

October 22, 2019

Claim 1 — Plain English

What this patent covers

This patent covers the Transformer — the neural network architecture that powers every major AI language model built since 2018. The core idea: instead of reading text one word at a time (left to right, like all previous models), the Transformer reads the entire input at once and figures out which words matter to which other words. It does this through "self-attention" — for every word in a sentence, it asks: which other words should I pay attention to when understanding this one? In "The bank is on the river," self-attention lets the model link "bank" to "river" and figure out it means a riverbank, not a financial institution. This happens simultaneously across all words, in parallel, rather than sequentially. The model uses an encoder (which processes the input) and a decoder (which generates the output), with self-attention layers in both. Positional encoding tells the model where each word sits in the sequence, since it no longer reads left-to-right.

The clever bit

Before the Transformer, the dominant theory in AI language modeling was that you had to read text the way humans do — left to right, one word at a time, building up a "memory" of what came before. The problem: that memory degrades. By the time a model got to the end of a long sentence, it had mostly forgotten the beginning. The Transformer threw that out entirely. It doesn't have memory of sequence — it has attention over the whole thing at once. Every word can see every other word simultaneously, and the model learns which relationships matter. This made it dramatically faster to train (because you're not constrained by sequential processing) and dramatically better at long-range dependencies. The paper that introduced this architecture — 'Attention Is All You Need,' published in 2017 — became the most cited paper in machine learning history within a few years.

What it does not cover

  • Language models that process text sequentially — one word or token at a time — rather than in parallel (this rules out RNNs and LSTMs, the dominant architectures before 2018)
  • Attention mechanisms where one sequence queries a different sequence — this patent specifically covers self-attention within a single sequence, not cross-attention between encoder and decoder
  • Convolutional approaches that look at fixed-size windows of text rather than the full sequence at once
  • Models without an explicit encoder-decoder structure — covers sequence-to-sequence tasks like translation, not pure generation models
  • Any architecture that does not explicitly compute queries, keys, and values as the mechanism for determining relevance between positions

Patent Journey

From filing to today

Patent Filed

2018

Patent Granted

2019 · 1yr after filing

Active Today

2026

Expires

2038

PatentBrief Score

Impact Score

83/ 100

High impact

Citation count

33/40

Moderately cited

Claim breadth

20/20

Very broad protection

Recency

10/20

Granted 5–10 years ago

Assignee scale

20/20

Major technology company

PatentBrief Impact Score — based on citation count, claim breadth, recency, and assignee scale. Not a legal assessment.

The original legal language

Original claims

33 claims as filed with the patent office.

Citations

Patent lineage

Cites earlier patents

35

earlier patents this invention cites as foundations

View prior art →

Cited by later patents

44

later patents that build on this invention

View patents →

Stay in the loop

Get a weekly digest of new patents.

One email per week. No spam. Unsubscribe anytime.

Keep exploring

Related patents you should know

US 12564871 · 2026

A Fixture for Cleaning Showerheads with Multiple Separate Chambers

This patent describes a cleaning device for showerheads that uses a fixture with three or more separate internal compartments and channels to direct cleaning fluid to the showerhead's upper surfaces.

ASM IP HOLDING BV

US 12324579 · 2025

Surgical Stapler Battery Health Check During Operation

This patent describes a powered surgical stapler that can detect if some of its rechargeable battery cells are damaged while it's actually firing staples, helping ensure the procedure finishes safely.

CILAG GMBH INT

US 12471982 · 2025

Surgical Tool That Combines Energy Treatment and Stapling

CILAG's patent details a surgical instrument that applies therapeutic energy to tissue, monitors its properties, then deploys staples, adapting the stapling based on the initial energy treatment and monitoring.

CILAG GMBH INT

US 11918209 · 2024

Real-Time Surgical Instrument Status on Live Video During Operations

This patent describes a surgical system that shows live video from inside the body and overlays important information about the surgical tool directly onto the screen, helping surgeons operate more precisely.

CILAG GMBH INT

US 8697359 · 2014

How to Use CRISPR-Cas9 to Edit Genes in Human Cells

This patent describes a method and system for precisely altering gene expression in eukaryotic cells, including human cells, using an engineered CRISPR-Cas9 system that targets and cleaves specific DNA sequences.

Massachusetts Institute of Technology

US 4683195 · 1987

How to Make Many Copies of a Specific DNA Segment

This patent describes the Polymerase Chain Reaction (PCR), a fundamental process for making millions of copies of a specific DNA or RNA segment from a tiny sample, enabling its detection.

Cetus Corp

Semantically similar

You might also find these interesting

SEARCH ALL

US 6370526 · 2002 · International Business Machines Corp

Google AdWords — The Auction System That Made Search Profitable

US 6285999 · 2001 · Leland Stanford Junior University

How Websites Get Ranked by Who Links to Them

US 7479949 · 2009 · Apple Inc

How Touchscreens Tell the Difference Between Your Finger Gestures

US 4405829 · 1983 · Massachusetts Institute of Technology

How RSA Public-Key Encryption Secures Digital Messages

Patent monitoring

Get notified when Google LLC files a new patent

Get notified when this company files a new patent. Weekly digest · Confirm via email · Unsubscribe anytime.

Last reviewed: May 25, 2026 · PatentBrief is not a law firm and this is not legal advice.