Teaching Computers to Understand Document Similarity Using AI
This patent describes a way to train a computer program (a neural network) to understand how similar documents are to each other, by showing it examples and teaching it to group similar ones together and separate dissimilar ones.
Patent Number
US 10909459
Status
Active
Filing Date
June 9, 2017
Grant Date
February 2, 2021
Expiration
~June 2037 (estimated)
Claims
22
Assignee
Cognizant Technology Solutions US Corp
Inventors
Diego Guy M. Legrand, Nigel Duffy, Petr TSATSIN, Philip M. Long
Citations
53 forward · 103 backward
What it covers
This patent explains how to train a computer program, specifically a neural network, to create a 'space' where documents can be placed based on their meaning. Imagine you have a target document (like an article about dogs). You also give the program a 'favored' document (another article about dogs) and several 'unfavored' documents (articles about cats, cars, or anything else). The program learns by trying to make the 'dog' documents closer together in its 'space' and further away from the 'non-dog' documents. It does this by adjusting its internal settings, called parameters, to minimize a 'loss' function. This loss function measures how well it's separating the favored document from the unfavored ones relative to the target document. For instance, a training set might include an article about 'Golden Retrievers' (target), another about 'Labradors' (favored), and articles about 'Siamese Cats' and 'Electric Cars' (unfavored). The system adjusts itself so that the 'Golden Retriever' and 'Labrador' articles are 'close' in its internal representation, while the 'Siamese Cat' and 'Electric Car' articles are 'far' from the 'Golden Retriever' article.
What it doesn't cover
- —Does not cover methods that do not use a neural network for training.
- —Does not cover training methods that do not involve a target document, a favored document, and at least two unfavored documents.
- —Does not cover systems that do not calculate a 'loss' based on the distance between document representations.
- —Does not cover methods where the computer program is not 'trained' using adjustable parameters.
- —Does not cover creating an embedding space without using document vectors as input.
The clever bit
The core idea is teaching the AI not just to recognize what a document is about, but to learn the *relative similarity* between documents. By explicitly training it to bring 'good' matches closer and push 'bad' matches further away from a reference, it learns a nuanced understanding of meaning that's more effective than simply classifying documents.
Why it matters
This technology is foundational for many modern AI applications that deal with understanding and organizing large amounts of text or other data. It enables search engines, recommendation systems, and content moderation tools to better grasp the meaning and relationships between different pieces of information.
Real-world examples
- 1.Search engine result ranking
- 2.Product recommendation systems
- 3.Content similarity detection
- 4.Plagiarism detection tools
- 5.Customer feedback analysis
Generated by PatentBrief · Not legal advice · patentbrief.org
US 10909459 · 2026