# How Computers Find Similar Text Using Compact Data Structures

> This patent describes a method for efficiently identifying similar text records, like documents or product reviews, by using special compact data structures that store text terms probabilistically and then analyzing them with machine learning.

- **Patent:** US 10878335
- **Original title:** Scalable text analysis using probabilistic data structures
- **Owner:** Amazon Technologies Inc
- **Granted:** 2020
- **Status:** Active
- **Times cited:** 18
- **Field:** software, ai_ml, ecommerce, telecommunications, consumer_electronics

## What it does

This system (Claim 1) takes a piece of text, such as a product review, and uses a "hashing-based function" to map its words (e.g., "excellent") to specific spots in a "probabilistic data structure." This data structure acts like a compact, fuzzy summary of many other text records. When a word is mapped, the system updates an entry in this structure to indicate the word's presence. Importantly, these entries can represent multiple words (Claim 1), making the structure very efficient. After updating, the system applies a "dimensionality reduction algorithm" to simplify the data, then feeds this into a "similarity detection algorithm" to figure out how much the new text is like other texts it has seen. For example, it could find customer reviews that discuss similar product features.

## What it does NOT cover

- Does not cover systems that store every single word explicitly in a traditional database for similarity comparison, as it relies on probabilistic storage where entries can represent more than one text term.
- Does not cover similarity detection that doesn't use a probabilistic data structure as the initial input for further analysis.
- Does not cover text analysis methods that do not involve applying a hashing-based function to text terms to update the data structure.
- Does not cover systems that omit the step of applying a dimensionality reduction algorithm on the probabilistic data structure before generating similarity indications.
- Does not cover combining data structures without using bit-level Boolean operations or vector instructions, as specified in Claim 3.

## The clever bit

The novelty lies in using probabilistic data structures, where multiple terms can share entries, as the direct input for machine learning algorithms like dimensionality reduction and similarity detection. This allows for highly scalable text analysis without needing to store full text or traditional, large term-frequency matrices.

## Real-world examples

1. Amazon product recommendation systems
2. Customer review analysis for sentiment and trends
3. Content moderation for online platforms
4. Document clustering in large datasets
5. Spam detection in email services

## Why it matters

This patent is important for processing huge amounts of text data efficiently, which is common in cloud services and e-commerce. By using probabilistic data structures, it allows for faster and more resource-friendly analysis of customer reviews, product descriptions, or documents. This efficiency helps companies quickly identify trends, recommend products, or moderate content without needing vast storage for every single word.

## Frequently asked questions

### What does How Computers Find Similar Text Using Compact Data Structures cover?

This patent describes a method for efficiently identifying similar text records, like documents or product reviews, by using special compact data structures that store text terms probabilistically and then analyzing them with machine learning.

### Who owns patent US 10878335?

Amazon Technologies Inc owns this patent, granted in 2020.

### When does this patent expire?

This patent is expected to expire on December 29, 2040, when the invention enters the public domain.

### What is patent US 10878335 cited by?

This patent has been cited by 18 later patents that build on its ideas.

### What problem does this patent solve?

This patent is important for processing huge amounts of text data efficiently, which is common in cloud services and e-commerce. By using probabilistic data structures, it allows for faster and more resource-friendly analysis of customer reviews, product descriptions, or documents. This efficiency helps companies quickly identify trends, recommend products, or moderate content without needing vast storage for every single word.

### What does this patent NOT cover?

Does not cover systems that store every single word explicitly in a traditional database for similarity comparison, as it relies on probabilistic storage where entries can represent more than one text term.

**Full plain-English explainer:** https://patentbrief.org/patent/us/10878335/bert-bidirectional-encoder-representations

**Original patent:** https://patents.google.com/patent/US10878335

---

_Source: PatentBrief — https://patentbrief.org. Patent facts are from public records; the plain-English explanation is PatentBrief's._


## Related patents

Semantically similar inventions in the PatentBrief corpus:

- [How Computers Find Hidden Connections Between Different Fields of Knowledge](https://patentbrief.org/patent/us/6523026/google-search-query-processing) — A method for finding related ideas in completely different subjects by using math to map how words appear together, even when the subjects use different vocabulary.
- [Teaching Computers to Understand Document Similarity Using AI](https://patentbrief.org/patent/us/10909459/federated-learning) — This patent describes a way to train a computer program (a neural network) to understand how similar documents are to each other, by showing it examples and teaching it to group similar ones together and separate dissimilar ones.
- [How Groupon Automatically Categorizes Merchant Services Using Text Analysis](https://patentbrief.org/patent/us/9330167/amazon-rds) — A system that automatically scans merchant websites and uses high-precision search queries to label their services, helping platforms like Groupon organize thousands of business listings.
- [How Computers Match and Join Messy Data from Different Sources](https://patentbrief.org/patent/us/9607103/amazon-athena) — A method for merging datasets by identifying related but non-identical items using flexible matching rules rather than strict equality.
- [How Computers Calculate Probabilities in Large Knowledge Bases](https://patentbrief.org/patent/us/9361579/large-scale-probabilistic-ontology-reasoning) — A method for finding answers in a database of uncertain facts by ignoring probabilities to find a solution first, then calculating how likely that solution is based on the underlying evidence.