Organizations face many challenges related to the exponential growth of data. Setting aside issues related to data storage and protection, the most important issues are:
Data analysis: organizations must be able to analyze data to transform it into useful and actionable information, improve operations and make informed decisions,
Data quality, a prerequisite for analysis: it is essential to ensure the integrity of the data in order to guarantee accurate, relevant and appropriate results.
Data quality solutions, how to choose the right data quality tool?
What is data quality, the quality of data? 🤔
Data quality is a set of metrics that allow you to judge the relevance and usability of your data. Data quality means being able to measure the accuracy, completeness, integrity and timeliness of your data:
accuracy means that the data is correct and consistent,
completeness means that the data are not partial,
Integrity means that the data is protected against unauthorized changes, deletions and additions,
Current means that the data is up to date.
In many organizations today, data is produced at high speed and on a large scale, making it difficult to manage and control. Indeed, this data can be :
incomplete or incorrect or even aberrant,
recorded in different formats and in different storage systems, which complicates their interpretation.
To remedy these difficulties, the implementation of a data quality policy is a major challenge. It is because the data will be of high quality that the decisions taken will be informed, and this in all sectors of activity or disciplines. Data quality processes are essential for confidence and accuracy, both in terms of the quantity of information collected and its reliability.
The more efficiently your data is collected, monitored, corrected and harmonized, the better your conclusions and the more relevant your decisions will be.
It It is therefore fundamental to determine how to control and improve the quality of data in order to put in place the governance rules to guarantee this quality in a sustainable manner.
Why is data quality a problem in business?
Data quality is actually a recurring problem for these main reasons:
Human input regularly creates new inconsistencies or duplicates (in CRM, ERP, HR software, etc.). Some of these errors can be avoided by advanced input controls (e.g. immediate verification of a city name, a postal code). However, not all errors can be avoided, especially those involving consistency between information entered in different fields/areas.
For example, sensors are not free of failure: they can emit outliers, or have erratic behavior in the time gap between two measurements.
In Machine Learning, predictive models may have been trained on good quality data, but when they are put into production, it is to confront them with data that these models have never seen. If the quality of the input data decreases over time (missing data, outliers), the accuracy of the predictions, which is by nature very sensitive to data quality, will decrease significantly. The predictive model can end up doing anything.
Putting AI into production therefore requires continuous data quality control.
Data quality How to detect data entry errors?
The first step in a data quality control process is error detection to correct incomplete, incorrect or outlier data.
The main sources of anomalies in the data
Errors in data, even marginal ones, can have a huge impact on business decisions, as long as those decisions are based on :
of dashboards built from data of insufficient quality data of insufficient quality, possibly with duplicates (e.g. duplicates in a customer database are a major obstacle to identifying the best customers - absence of a Single Customer View),
Predictive models, which are more technical (neural network, random forest, logistic regression) are, in essence, extremely sensitive to inaccurate or incomplete data during the learning phase.xtra sensitive to inaccurate or incomplete data during the learning phase.
Data anomalies can come from a variety of sources: erroneous or illegible manual entries, transmission failures, conversion problems, incomplete or inadequate processes, etc. It is important to be able to identify the sources and types of errors in order to understand, prevent and correct them.
Implementing regular automated quality control rules then ensures that errors are caught and can be corrected before they affect decision making.
Working on data quality means recognizing that it can be influenced by humans, but not only. Data entry errors can also be caused by what is called "bad encoding" or poor transcription.
It can be tricky to detect data entry errors, especially when there are duplicates, but especially "near duplicates". For example, when a letter is mistyped (the typo) it is extremely difficult, if not impossible, to detect with tools such as Excel or even SQL.
Improving data quality requires a certain mindset: recognizing that these errors can exist, even if you don't see them at first 😇.
Detect data quality problems in data with specialized tools
To go from "blind" to "seeing", it is possible to use solutions with artificial intelligence functions, such as fuzzy logic. This technique allows to detect input errors, when the data are close. This is what we call "near duplicates". Fuzzy logic makes it possible to compare names of people who have been entered differently, such as :
Emma Dupontand 'Emma Dupond'
Emma Dupond' and 'Emma née Dupond' (the word 'née' is added)
Evilatrie' or 'Malorie' or even 'Mallorie'.
Traditional tools, such as Excel, are not very well suited to identify 'matching' data. By using more advanced solutions, based on artificial intelligence, it is possible to :
detect anomalies much more efficiently, correct them, normalize textual data, deduplicate and thus improve data quality,
automate these detection/correction operations in order to integrate them into data pipelines.
If the very first step is awareness, i.e. admitting that you have anomalies in your data, you must also admit that this has a cost for the organization.
However, the real cost of data quality problems can be difficult to assess.
At Tale of Data, we propose to slice and dice these costs on two dimensions. This can make it easier for you to measure the impact of data quality issues on your business:
The dimension: hidden costs / direct costs (direct in the sense of visible)
The dimension: operational / strategic
Here is a matrix to illustrate our point:
"Companies need to take pragmatic, targeted steps to improve their enterprise data quality if they want to accelerate the digital transformation of their organization," says Gartner, whose recent study estimates that poor data quality costs companies an average of $12.9 million per year.
But anomaly detection is not the only challenge of data quality: working with heterogeneous data is also a challenge that must be met.
How to deal with heterogeneous data?
Heterogeneous data management has become increasingly necessary with the explosion of data and the proliferation of data sources in organizations.
Data is very rarely analyzed on its own. To analyze it, it is often necessary to combine it with other data, to group it together or to enrich it.
To process heterogeneous data, to make them more coherent, homogeneous between them, and thus facilitate their use through their combination, two analyses are necessary:
Identify the sources: it is important to first identify all the data sources and their respective formats. This is not the most exciting step to take, but it is the one that will dictate the success of your project to put this heterogeneous data in quality.
Harmonize the format: This step consists of creating a common format for all data, no matt