How Computers Match and Join Messy Data from Different Sources
A method for merging datasets by identifying related but non-identical items using flexible matching rules rather than strict equality.
Patent Number
US 9607103
Status
Active
Filing Date
January 23, 2013
Grant Date
March 28, 2017
Expiration
~January 2033 (estimated)
Claims
23
Assignee
Ab Initio Technology LLC
Inventors
Arlen Anderson
Citations
15 forward · 99 backward
What it covers
This patent describes a way for computer systems to combine data from two different sources even when the information doesn't match perfectly. Instead of looking for identical values, the system uses a 'variant relation' to determine if two objects are close enough to be considered a match, such as checking if the mathematical distance between two values falls below a specific threshold. Once these matches are identified, the system evaluates the surrounding data and joins the records together to create a new, combined dataset. For example, it could link 'John Smith' in one database with 'J. Smith' in another by recognizing they are variants of the same person based on defined similarity rules.
What it doesn't cover
- —Does not cover simple database joins that rely on exact matches (e.g., matching primary keys that are identical).
- —Does not cover human-manual data entry or manual reconciliation processes.
- —Does not cover matching methods that are strictly limited to equivalence relations (where A must equal B).
- —Does not cover the storage hardware itself, only the logical method of processing and joining the data.
The clever bit
The system explicitly allows for 'non-equivalence relations,' meaning it can chain matches through intermediate data elements to find connections that aren't directly obvious, effectively building a bridge between disparate data points.
Why it matters
In large-scale data processing, data is rarely clean or perfectly formatted across different departments or companies. This patent provides a formal framework for 'fuzzy' data integration, which is essential for business intelligence, customer relationship management, and regulatory compliance where records must be consolidated despite inconsistent naming or formatting.
Real-world examples
- 1.Enterprise data integration platforms
- 2.Customer data platforms (CDP) for deduplication
- 3.Automated financial record reconciliation
- 4.Master data management systems
Generated by PatentBrief · Not legal advice · patentbrief.org
US 9607103 · 2026