Open Data, Intranet, Data Reuse, Teamwork
Aggregating Multiple Databases with Record Lineage
Our client wanted to publish, on a single portal, a database resulting from the pooling of records fetched from 12 source databases.
Since overlaps existed between the different source databases, it was necessary to deduplicate so that portal visitors had a single view of each record.
Additionally, since the users of the portal were able to correct and / or enrich the data published (= Crowdsourcing), it was necessary to maintain, for each entry in the aggregated database, a link to the corresponding record(s) in the source databases (= Record Lineage), in order to pass on the record-corrections to the source.
Solution provided by Tale of Data
Verification + geolocation of postal addresses.
Verification of postal codes, translation of postal codes into INSEE codes.
Harmonization of data from each of the 12 source databases in order to obtain a single target format.
Multi-criteria (name, address) and multi-strategy (phonetics, Levenshtein distance, N-gram,…) deduplication.
Record Lineage: keep track of each record throughout the processing chain, maintaining a link to its unique identifier as well as its original source database.
Automation of the entire processing chain in both directions (source databases → aggregated database AND aggregated database → source databases) in order to propagate the updates that may occur on each side.
A single view of each record on the portal, thanks to deduplication algorithms.
The possibility for the data owners of the 12 source databases to collect updates by crowdsourcing, in order to apply them to their database.
Up-to-date data on the portal including both the latest modifications made to the source databases AND the corrections / enhancements by crowdsourcing.
Complete automation of the process that allows updates to be propagated back and forth at regular intervals.
Standardization of data from heterogeneous sources
Our client, a major player in passenger and freight transport, wanted to reduce the time spent collecting the input data needed to complete a project: i.e. several weeks or even months.
The client's Data teams therefore began to design an intranet portal on which internal project managers could find the data needed to carry out their projects in just a few clicks.
The problem: each service producing potentially reusable data published a data sheet on this data in a specific format, so there were several hundred different formats.
The purpose of the portal was to allow visitors to search datasets produced by different departments. Harmonization of the data sheets was therefore an essential prerequisite for the success of the portal project.
Solution provided by Tale of Data
Specification of the single format for the data sheet (= pivot format).
Importing the pivot format into Tale of Data: Tale of Data uses the target format to automatically suggest to the user the data transformations necessary to go from the current format to the pivot format.
Use of Tale of Data by the customer's Data team to create, for each input data sheet format, the lists of data transformations required to obtain an output data sheet in the pivot format.
Automation of the entire process: daily, new data sheets are submitted by the various departments to the customer's private cloud (Microsoft Azure). Tale of Data retrieves these sheets and automatically applies the relevant transformations to them (depending on the originating department and the nature of the data sheet).
Once in the pivot format, the records are deduplicated, then sent by Tale of Data to the portal (using the portal API) where they are indexed in order to be available for project managers.
Tens of millions of euros saved thanks to a drastic reduction in the start-up time of new projects.
The portal is now routinely used by project managers to gather the data they need for their projects.
The rate of data reuse is increasing sharply: a significant decrease in the number of datasets purchased from external service providers because the project manager had no way of knowing that they were already owned by the company.
The standardization of places (location of construction sites, warehouses, depots, etc.) makes it possible launch accurate geospatial searches on the portal.
The chances of failure have been greatly reduced as new projects start faster and with the right input.
Reconciliation of automotive repositories
Our client, a major player in consumer credit, wanted to offer any buyer of a used vehicle a one-click financing plan online.
The partner websites selling used vehicles mainly used ARGUS (sometimes JATO) as automotive repository. On the other hand the algorithms for producing financing plan were based on another automotive repository: EUROTAX.
To allow the buyer to receive a financing plan within seconds, it was necessary to establish a unique match between the entries of the two repositories, that did not have any common key. The problem was complex due to the differences in the description of the vehicles between the two repositories.
Solution provided by Tale of Data
The use of special joins (called "full-text“ joins) designed by Tale of Data (approximately 100,000 entries per repository):
The creation of a composite key for each repository by concatenation of several fields (ex: model, long version label, number of doors, year of commissioning, ...)
The composite key is matched with the composite keys of other repositories that have the most "words" in common. Additionally, words are weighted according to their frequency in the corpus of composite keys (principle: the rarer a word is in the corpus, the greater is its weight and therefore, the more reliable is the correspondence)
Elimination of multiple matches using arbitration numerical fields (such as the price including tax or the CO2 emission level): these fields are not standardized enough to be included in the composite key, but they prove to be very effective to make choices when a vehicle from one repository is matched with several vehicles from another repository. We will therefore pick the one with the nearest price and then, the nearest CO2 emission rate.
Thanks to the involvement of business experts (who have in-depth knowledge of automotive repositories) the fields involved in the composite key as well as the arbitration fields could be determined in an optimum manner.
The rate of unique matches has increased as follows:
55%* without Tale of Data, with the first approach which consisted of asking Data Scientists to code in python specific string-matching algorithms.
95% with the composite key approach proposed by Tale of Data
The remaining 5% of multiple matches did not present a significant difference in terms of the financing plan generated, therefore, the Tale of Data approach was validated after a week by the client's business teams.
*for 55% of vehicles from the first repository a unique match was found in the second repository.