Multiple database aggregation with record lineage
Our client wanted to be able to publish on just one portal a database created by pooling records from 12 source databases.
Overlaps between different source databases made deduplication necessary to ensure that just one view of each record was offered to portal visitors.
Moreover, the users of the portal having the possibility to correct and/or enrich the published information (=Crowdsourcing), it was necessary to keep, for each entry in the aggregated database, a link to the corresponding record(s) in the source databases (= Record Lineage), in order to pass on the corrections at the source.
Although this particular use case concerns cultural sites, it can be applied identically to lists of businesses and individuals (CRM), product databases, etc.
Verification + . geolocation* of of postal addresses.
Check of postcodes, translation of postcodes into INSEE codes.
Harmonization of the data from each of the 12 source databases to obtain a single target format.
Record Lineage : preservation throughout the processing chain of the identifier of each record as well as its original source base.
Automation of the entire processing chain in both directions (source databases → aggregated database AND aggregated database → source databases) in order to propagate the updates and enrichments occurring at each end.
Single view of each record on the portal, thanks to deduplication.
The owners of the 12 source databases had to be able to retrieve crowdsourcing*corrections in order to be able to apply them to their own databases.
Up-to-date data on the portal including both the latest modifications made in the source databases AND corrections / enrichments by crowdsourcing.
Full process automation , propagating corrections in both directions and at regular intervals.
Standardization of data from different sources
Our client, a major passenger and freight transport company, wanted to reduce the time (weeks or even months) required to collect the input data needed for a project.
The client's data teams therefore began designing an intranet portal where internal project managers could in just a few clicks find the data they needed for their projects.
The problem: each department producing potentially reusable data provided this information in a datasheet published in a format of its own, meaning that several hundred different formats existed.
The purpose of the portal was to allow cross-searches of the data sets generated by different departments. Datasheet harmonization was therefore essential to the success of the portal project.
Single format created for datasheets.
Format import: Tale of Data uses the target format to automatically suggest to the user the data transformations that will be needed to move from the current to the target format.
The client's data team used Tale of Data to produce, for each input datasheet format, lists of the data transformations that would be needed to create an output datasheet.
Full process automation : each day the various departments upload new datasheets onto the client’s private Cloud (Microsoft Azure). Tale of Data retrieves these records and automatically applies the relevant transformations to them (depending on the department of origin and the nature of the datasheet).
Once in pivot format, the cards are deduplicatedand then sent by Tale of Data to the portal (via API) where they are indexed in order to be available for research.
Tens of millions of euro saved thanks to a dramatic reduction in new project start-up times.
The portal is now regularly used by project managers to gather the data they need for their projects.
The rate of data reuse is rising fast and the number of datasets purchased from external providers has decreased significantly- This is because project managers previously had no way of knowing whether their company already had the data.
By standardizing locations (of construction sites, warehouses, depots, etc.) targeted geospatial searches can be performed in the portal on data sets.
Risk of failure has been greatly reduced because projects now start faster and with the right input data.
Reconciliation of automotive standards
Our client, a leader in the field of consumer credit, wanted to offer an online financing plan to all buyers of used cars in the click of a mouse.
While partner sites selling used vehicles mostly use Argus (sometimes JATO), the algorithms creating our client's financing plan were based on EUROTAX.
In order to receive a financing plan in a matter of seconds, a unique match had to be made with entries in repositories that do not share a common key and whose different descriptions of vehicles make matching far from easy.
Use of special joins (called "full-text")*. designed by Tale of Data (about 100,000 entries per repository) :
Composite key created for each repository by concatenating several fields (e.g. model, long version label, number of doors, launch year, etc.)
The composite key is then matched against those of the other repositories with the highest number of shared "words". Words are also weighted for rarity within the composite key (principle: the rarer a word is in the key, the more credible the match)
Elimination of multiple matches using numerical fields called arbitration fields (e.g. price including VAT or level of CO2 emissions). These fields are not standardized enough to be entered in the composite key, but are very helpful during selection when a vehicle from one repository is matched against several vehicles from a different repository. The selected vehicle will be the one closest in price and CO2 emissions.
Theinvolvement of business experts (who have in-depth knowledge of automotive repositories) meant that decisions on which fields to include in the composite key and on arbitration fields were as accurate as possible.
The unique match rate has risen:
From 55% in the original approach, when the client's data scientists were being asked to code in Python matching algorithms for character strings - algorithms that the business regularly rejected over a period of several months
To 95% using the composite key and business involvement approach suggested by Tale of Data
The remaining 5% of multiple matches presented no significant differences in terms of the financing plan generated. The Tale of Data approach was validated after one week by the client's business teams.