# Teaching Computers to Understand Document Similarity Using AI

> This patent describes a way to train a computer program (a neural network) to understand how similar documents are to each other, by showing it examples and teaching it to group similar ones together and separate dissimilar ones.

- **Patent:** US 10909459
- **Original title:** Content embedding using deep metric learning algorithms
- **Owner:** Cognizant Technology Solutions US Corp
- **Granted:** 2021
- **Status:** Active
- **Times cited:** 53
- **Field:** software, ai_ml, telecommunications

## What it does

This patent explains how to train a computer program, specifically a neural network, to create a 'space' where documents can be placed based on their meaning. Imagine you have a target document (like an article about dogs). You also give the program a 'favored' document (another article about dogs) and several 'unfavored' documents (articles about cats, cars, or anything else). The program learns by trying to make the 'dog' documents closer together in its 'space' and further away from the 'non-dog' documents. It does this by adjusting its internal settings, called parameters, to minimize a 'loss' function. This loss function measures how well it's separating the favored document from the unfavored ones relative to the target document. For instance, a training set might include an article about 'Golden Retrievers' (target), another about 'Labradors' (favored), and articles about 'Siamese Cats' and 'Electric Cars' (unfavored). The system adjusts itself so that the 'Golden Retriever' and 'Labrador' articles are 'close' in its internal representation, while the 'Siamese Cat' and 'Electric Car' articles are 'far' from the 'Golden Retriever' article.

## What it does NOT cover

- Does not cover methods that do not use a neural network for training.
- Does not cover training methods that do not involve a target document, a favored document, and at least two unfavored documents.
- Does not cover systems that do not calculate a 'loss' based on the distance between document representations.
- Does not cover methods where the computer program is not 'trained' using adjustable parameters.
- Does not cover creating an embedding space without using document vectors as input.

## The clever bit

The core idea is teaching the AI not just to recognize what a document is about, but to learn the *relative similarity* between documents. By explicitly training it to bring 'good' matches closer and push 'bad' matches further away from a reference, it learns a nuanced understanding of meaning that's more effective than simply classifying documents.

## Real-world examples

1. Search engine result ranking
2. Product recommendation systems
3. Content similarity detection
4. Plagiarism detection tools
5. Customer feedback analysis

## Why it matters

This technology is foundational for many modern AI applications that deal with understanding and organizing large amounts of text or other data. It enables search engines, recommendation systems, and content moderation tools to better grasp the meaning and relationships between different pieces of information.

## Frequently asked questions

### What does Teaching Computers to Understand Document Similarity Using AI cover?

This patent describes a way to train a computer program (a neural network) to understand how similar documents are to each other, by showing it examples and teaching it to group similar ones together and separate dissimilar ones.

### Who owns patent US 10909459?

Cognizant Technology Solutions US Corp owns this patent, granted in 2021.

### When does this patent expire?

This patent is expected to expire on February 2, 2041, when the invention enters the public domain.

### What is patent US 10909459 cited by?

This patent has been cited by 53 later patents that build on its ideas.

### What problem does this patent solve?

This technology is foundational for many modern AI applications that deal with understanding and organizing large amounts of text or other data. It enables search engines, recommendation systems, and content moderation tools to better grasp the meaning and relationships between different pieces of information.

### What does this patent NOT cover?

Does not cover methods that do not use a neural network for training.

**Full plain-English explainer:** https://patentbrief.org/patent/us/10909459/federated-learning

**Original patent:** https://patents.google.com/patent/US10909459

---

_Source: PatentBrief — https://patentbrief.org. Patent facts are from public records; the plain-English explanation is PatentBrief's._


## Related patents

Semantically similar inventions in the PatentBrief corpus:

- [How Facebook Uses Deep Learning to Predict What You Might Like](https://patentbrief.org/patent/us/10402750/automl-neural-architecture-search) — A method for training AI models to recommend new content by comparing a user's past interactions with unseen items in a social network.
- [How Computers Find Hidden Connections Between Different Fields of Knowledge](https://patentbrief.org/patent/us/6523026/google-search-query-processing) — A method for finding related ideas in completely different subjects by using math to map how words appear together, even when the subjects use different vocabulary.
- [How AI Learns New Tasks Using Old Data Labels](https://patentbrief.org/patent/us/11062228/gpt-3-few-shot-learning) — A method for helping AI models understand new topics by grouping similar labels from different datasets into a shared, broader category.
- [How Computers Find Similar Text Using Compact Data Structures](https://patentbrief.org/patent/us/10878335/bert-bidirectional-encoder-representations) — This patent describes a method for efficiently identifying similar text records, like documents or product reviews, by using special compact data structures that store text terms probabilistically and then analyzing them with machine learning.
- [How Computers Use Memory Networks to Answer Questions](https://patentbrief.org/patent/us/10664744/watson-question-answering-system-deepqa) — A method for AI to search through large amounts of stored information by repeatedly 'hopping' through memory to find the most relevant facts for answering a question.