How Computers Find Similar Text Using Compact Data Structures
This patent describes a method for efficiently identifying similar text records, like documents or product reviews, by using special compact data structures that store text terms probabilistically and then analyzing them with machine learning.
Original patent title: “Scalable text analysis using probabilistic data structures”
This patent describes a method for efficiently identifying similar text records, like documents or product reviews, by using special compact data structures that store text terms probabilistically and then analyzing them with machine learning. Granted to Amazon Technologies Inc in 2020 with 23 claims and 18 forward citations.
Key facts
Coverage
What does this patent actually cover?
This system (ClaimclaimA numbered sentence at the end of a patent that legally defines what the inventor owns. The most important section.Read more → 1) takes a piece of text, such as a product review, and uses a "hashing-based function" to map its words (e.g., "excellent") to specific spots in a "probabilistic data structure." This data structure acts like a compact, fuzzy summary of many other text records. When a word is mapped, the system updates an entry in this structure to indicate the word's presence. Importantly, these entries can represent multiple words (Claim 1), making the structure very efficient. After updating, the system applies a "dimensionality reduction algorithm" to simplify the data, then feeds this into a "similarity detection algorithm" to figure out how much the new text is like other texts it has seen. For example, it could find customer reviews that discuss similar product features.
The gap
What does this patent NOT cover?
- Does not cover systems that store every single word explicitly in a traditional database for similarity comparison, as it relies on probabilistic storage where entries can represent more than one text term.
- Does not cover similarity detection that doesn't use a probabilistic data structure as the initial input for further analysis.
- Does not cover text analysis methods that do not involve applying a hashing-based function to text terms to update the data structure.
- Does not cover systems that omit the step of applying a dimensionality reduction algorithm on the probabilistic data structure before generating similarity indications.
- Does not cover combining data structures without using bit-level Boolean operations or vector instructions, as specified in ClaimclaimA numbered sentence at the end of a patent that legally defines what the inventor owns. The most important section.Read more → 3.
These exclusions are unique to PatentBrief — derived from the actual claim language, not patent-office boilerplate.
What made this novel
The noveltynoveltyThe requirement that an invention be different from anything publicly known before its priority date.Read more → lies in using probabilistic data structures, where multiple terms can share entries, as the direct input for machine learning algorithms like dimensionality reduction and similarity detection. This allows for highly scalable text analysis without needing to store full text or traditional, large term-frequency matrices.
Schematic visualization of the patent's claim structure. Hand-drawn diagrams in progress for each landmark patent.
Where you've seen this
Real-world examples
Amazon product recommendation systems
Customer review analysis for sentiment and trends
Content moderation for online platforms
Document clustering in large datasets
Spam detection in email services
Why it matters
The bigger picture
This patent is important for processing huge amounts of text data efficiently, which is common in cloud services and e-commerce. By using probabilistic data structures, it allows for faster and more resource-friendly analysis of customer reviews, product descriptions, or documents. This efficiency helps companies quickly identify trends, recommend products, or moderate content without needing vast storage for every single word.
Filed
June 14, 2016
Granted
December 29, 2020
Market context
Who's building on this
Companies in this space
Amazon Technologies Inc. is the assigneeassigneeThe entity that owns the patent — usually the inventor's employer or a company.Read more → and continues to build on and utilize such technologies for its vast e-commerce, cloud computing (AWS), and digital content services. Other major cloud providers like Google and Microsoft, as well as companies in data analytics and AI, also develop and use similar scalable text processing techniques.
Market impact
This type of technology enables companies to process and understand massive volumes of unstructured text data more efficiently, which is crucial for modern internet services. It underpins features like personalized recommendations, improved search results, and automated content analysis, allowing for better user experiences and more targeted advertising across various platforms.
Claim 1 — Plain English
What this patent covers
This system (Claim 1) takes a piece of text, such as a product review, and uses a "hashing-based function" to map its words (e.g., "excellent") to specific spots in a "probabilistic data structure." This data structure acts like a compact, fuzzy summary of many other text records. When a word is mapped, the system updates an entry in this structure to indicate the word's presence. Importantly, these entries can represent multiple words (Claim 1), making the structure very efficient. After updating, the system applies a "dimensionality reduction algorithm" to simplify the data, then feeds this into a "similarity detection algorithm" to figure out how much the new text is like other texts it has seen. For example, it could find customer reviews that discuss similar product features.
The clever bit
The novelty lies in using probabilistic data structures, where multiple terms can share entries, as the direct input for machine learning algorithms like dimensionality reduction and similarity detection. This allows for highly scalable text analysis without needing to store full text or traditional, large term-frequency matrices.
What it does not cover
- Does not cover systems that store every single word explicitly in a traditional database for similarity comparison, as it relies on probabilistic storage where entries can represent more than one text term.
- Does not cover similarity detection that doesn't use a probabilistic data structure as the initial input for further analysis.
- Does not cover text analysis methods that do not involve applying a hashing-based function to text terms to update the data structure.
- Does not cover systems that omit the step of applying a dimensionality reduction algorithm on the probabilistic data structure before generating similarity indications.
- Does not cover combining data structures without using bit-level Boolean operations or vector instructions, as specified in Claim 3.
Patent timeline
Application submitted to the patent office
Application published, typically 18 months after filing
Patent officially issued
PatentBrief Score
Impact Score
Strong
Citation count
26/40
Moderately cited
Claim breadth
15/20
Broad claimsclaimsThe numbered statements at the end of a patent that legally define what the inventor owns.Read more →
Recency
10/20
Granted 5–10 years ago
Assignee scale
20/20
Major company or institution
PatentBrief Impact Score — based on citation count, claim breadth, recency, and assignee scale. Not a legal assessment.
Heuristic Value Estimate
What this patent might be worth
$187K – $599K
Midpoint $374K · 10.0 yr remaining · industry ×1.6
Heuristic only — blends forward/backward citation counts, claim scope, time remaining, litigation history, and CPC-derived industry baseline. Real valuations need a professional appraisal.
The original legal language
Original claims
23 claims as filed with the patent office.
Concepts involved
Citations
Patent lineage
Cite this patent
Waugh, R. M. (2020). How Computers Find Similar Text Using Compact Data Structures (U.S. Patent No. 10,878,335). U.S. Patent and Trademark Office. https://patentbrief.org/patent/us/10878335/bert-bidirectional-encoder-representations
Auto-generated from the patent record. Double-check author order and the issue date against the official USPTO document before submitting.
Embed
Add this patent to your site
Drop this plain-English patent card into any blog post or article — free, no signup. It always links back to the full breakdown here.
<div data-patentlens-widget data-patent-number="US10878335"></div> <script src="https://patentbrief.org/embed.js" async></script>
Stay in the loop
Get a weekly digest of new patents.
One email per week. No spam. Unsubscribe anytime.
Keep exploring
Related patents you should know
US 4683195 · 1987
How to Make Billions of Copies of a DNA Segment
This patent describes the Polymerase Chain Reaction (PCR), a method to rapidly create many copies of a specific piece of DNA or RNA, enabling its detection and analysis.
Cetus Corp
US 8697359 · 2014
How to Edit Genes in Human Cells Using an Engineered CRISPR System
This patent describes an engineered CRISPR-Cas9 system for precisely cutting DNA in eukaryotic cells to change how genes work, opening the door for gene editing in complex organisms.
Massachusetts Institute of Technology
US 7657849 · 2010
How the iPhone's Slide-to-Unlock Gesture Works
Apple's 2010 patent describes unlocking a device by dragging a specific graphical image across the touchscreen along a predefined path, a gesture that became iconic with the original iPhone.
Apple Inc
US 4733665 · 1988
How Doctors Implant a Permanent Stent Using a Balloon
This patent describes the method for placing a permanent, expandable wire mesh tube inside a blood vessel or other body tube using a balloon-tipped catheter to widen it and keep it open.
Expandable Grafts Partnership
US 4965188 · 1990
How to Make Many Copies of a DNA Piece with Heat
This patent describes the Polymerase Chain Reaction (PCR) method, a technique to make millions of copies of a specific DNA segment using a heat-resistant enzyme and repeated temperature changes.
Cetus Corp
US 4235871 · 1980
How to Encapsulate Active Materials in Lipid Bubbles Efficiently
This patent describes a method for trapping biologically active substances inside tiny, multi-layered fat bubbles called liposomes, using a specific water-in-oil emulsion and gel-forming process to improve how much material gets captured.
Individual
More to explore
More in Software & Internet
US 4405829 · 1983 · Massachusetts Institute of Technology
How RSA Public-Key Encryption Keeps Digital Messages Secret
US 6285999 · 2001 · Leland Stanford Junior University
How Websites Get Ranked by Importance
US 5960411 · 1999 · Amazon com Inc
How Amazon's One-Click Ordering Works for Online Purchases
US 7669123 · 2010 · Facebook Inc
Displaying Friends' Activities in a Social Network Feed
New to patents?
Common Questions
Frequently Asked Questions
What does How Computers Find Similar Text Using Compact Data Structures cover?
This patent describes a method for efficiently identifying similar text records, like documents or product reviews, by using special compact data structures that store text terms probabilistically and then analyzing them with machine learning.
Who owns patent US 10878335?
Amazon Technologies Inc owns this patent, granted in 2020.
When does this patent expire?
This patent is expected to expire on December 29, 2040, when the invention enters the public domain.
What is patent US 10878335 cited by?
This patent has been cited by 18 later patents that build on its ideas.
What problem does this patent solve?
This patent is important for processing huge amounts of text data efficiently, which is common in cloud services and e-commerce. By using probabilistic data structures, it allows for faster and more resource-friendly analysis of customer reviews, product descriptions, or documents. This efficiency helps companies quickly identify trends, recommend products, or moderate content without needing vast storage for every single word.
What does this patent NOT cover?
Does not cover systems that store every single word explicitly in a traditional database for similarity comparison, as it relies on probabilistic storage where entries can represent more than one text term.
Same assignee
More from Amazon Technologies Inc
Patent monitoring



