Generative AI and data quality: a virtuous circle for innovation
Aggregating multiple databases with Record Lineage
Aggregating multiple databases with Record Lineage enables data from different sources to be grouped and unified, while retaining the links between records and their original sources.
The need
Our customer wanted to publish, on a single portal, a database resulting from the pooling of records from 12 source databases.
As overlaps existed between the various source databases, it was necessary to deduplicate so that portal visitors would have a single view of each record.
In addition, since portal users can correct and/or enrich the information published (=Crowdsourcing), it was necessary to maintain, for each entry in the aggregated database, a link to the corresponding record(s) in the source databases (=Record Lineage), in order to pass on corrections at source.
This use case focused on cultural venues. However, it can also be used for corporate or individual listings (CRM), product databases, etc.
Proposed solution
Verification + geolocation* of postal addresses.
Verification of postal codes, translation of postal codes into INSEE codes.
Harmonization of data from each of the 12 source databases to obtain a single target format.
Multi-criteria (name, address) and multi-strategy (phonetic, Levenshtein distance, N-gram, etc.) deduplication .
Record Lineage: preservation throughout the processing chain of each record's identifier and its original source database.
Automation of the entire processing chain in both directions (source bases → aggregated base AND aggregated base → source bases) to propagate any updates and enrichments that may occur on either side.
Earnings
A single view of each registration on the portal, thanks to deduplication.
The possibility for the owners of the 12 source databases tocrowdsource* corrections and apply them to their own database.
Up-to-date data on the portal , including both the latest modifications made to the source databases AND corrections/enrichments by crowdsourcing.
Complete automation of the process , enabling corrections to be propagated in both directions at regular intervals.