PatentBrief · Patent BriefUS 10909459

Teaching Computers to Understand Document Similarity Using AI

This patent describes a way to train a computer program (a neural network) to understand how similar documents are to each other, by showing it examples and teaching it to group similar ones together and separate dissimilar ones.

Patent Number

US 10909459

Status

Active

Filing Date

June 9, 2017

Grant Date

February 2, 2021

Expiration

~June 2037 (estimated)

Claims

Assignee

Cognizant Technology Solutions US Corp

Inventors

Diego Guy M. Legrand, Nigel Duffy, Petr TSATSIN, Philip M. Long

Citations

53 forward · 103 backward

What it covers

This patent explains how to train a computer program, specifically a neural network, to create a 'space' where documents can be placed based on their meaning. Imagine you have a target document (like an article about dogs). You also give the program a 'favored' document (another article about dogs) and several 'unfavored' documents (articles about cats, cars, or anything else). The program learns by trying to make the 'dog' documents closer together in its 'space' and further away from the 'non-dog' documents. It does this by adjusting its internal settings, called parameters, to minimize a 'loss' function. This loss function measures how well it's separating the favored document from the unfavored ones relative to the target document. For instance, a training set might include an article about 'Golden Retrievers' (target), another about 'Labradors' (favored), and articles about 'Siamese Cats' and 'Electric Cars' (unfavored). The system adjusts itself so that the 'Golden Retriever' and 'Labrador' articles are 'close' in its internal representation, while the 'Siamese Cat' and 'Electric Car' articles are 'far' from the 'Golden Retriever' article.

What it doesn't cover

—Does not cover methods that do not use a neural network for training.
—Does not cover training methods that do not involve a target document, a favored document, and at least two unfavored documents.
—Does not cover systems that do not calculate a 'loss' based on the distance between document representations.
—Does not cover methods where the computer program is not 'trained' using adjustable parameters.
—Does not cover creating an embedding space without using document vectors as input.

The clever bit

The core idea is teaching the AI not just to recognize what a document is about, but to learn the *relative similarity* between documents. By explicitly training it to bring 'good' matches closer and push 'bad' matches further away from a reference, it learns a nuanced understanding of meaning that's more effective than simply classifying documents.

Why it matters

This technology is foundational for many modern AI applications that deal with understanding and organizing large amounts of text or other data. It enables search engines, recommendation systems, and content moderation tools to better grasp the meaning and relationships between different pieces of information.

Real-world examples

1.Search engine result ranking
2.Product recommendation systems
3.Content similarity detection
4.Plagiarism detection tools
5.Customer feedback analysis

Generated by PatentBrief · Not legal advice · patentbrief.org

US 10909459 · 2026