Data quality is one of the many challenges that every organization must face in the face of exponential data growth. If we set aside the problems associated with data storage and protection, the most important issues are as follows:
data analysis: organizations need to be able to analyze data to transform it into useful, actionable information, improve operations and make informed decisions,
data quality, an essential prerequisite for analysis: it is essential to ensure data integrity in order to guarantee accurate, relevant and appropriate results.
Contents
What is data qualitydata quality? 🤔
Visit data qualityor data quality, is a set of set of metrics to judge the relevance and usability of your data. Dealing with it means being able to measure the accuracy, completeness, integrity and timeliness of your data:
l'accuracy means that the data is correct and consistent,
the completeness means that the data are not partial,
l'integrity means that data is protected against unauthorized modification, deletion or addition,
l'news means that the data is up to date.
In many organizations today, data is produced at high speed and on a massive scale, making it difficult to manage and control. Data can be :
incomplete, incorrect or even aberrant,
recorded in different formats and on different storage systems, which complicates their interpretation.
To overcome these difficulties the implementation of a data quality policy is a major challenge. High-quality data is the key to informed decision-making in all sectors and disciplines. Essential to confidence and accuracy, the processes of data quality processes are crucial to both the quantity and reliability of the information gathered.
The more efficiently your data is collected, monitored, corrected and harmonized, the better your conclusions and the more relevant your decisions.
It It is therefore essential to determine how to control and improve data quality, in order to put in place the governance rules needed to guarantee this quality over the long term.
Pour une introduction plus détaillée sur ces fondamentaux et découvrir comment les appliquer concrètement dans votre entreprise, consultez notre article dédié :👉 Qu'est-ce que la Data Quality ? Tout savoir pour maîtriser la qualité des données en entreprise
Cet article explore également les enjeux stratégiques et opérationnels de la Data Quality, tout en mettant en lumière des approches pratiques et des outils pour vous accompagner dans votre démarche.
Why is data quality a business issue?
Visit data quality is actually a recurring problem for these main reasons:
The data entry regularly create new inconsistencies or duplications (in CRM, ERP, HR software...). Some of these errors can be avoided by advanced data entry checks (e.g.: immediate verification of a town name or zip code).
However, not all errors can be avoided, especially those involving consistency between information entered in different fields/zones. This type of error is mainly identified by our customers in situations of data migration to a new tool.
For example, in the IOT field, sensors are not sensors are not failure-free: they may They may emit outliers, or behave erratically in the time between two measurements.
In Machine Learning, predictive models may have been trained on high-quality data, but when we put them into production, it's only to confront them with data they've never seen. data that these models have never seen before.. If the quality of input data declines over time (missing data, outliers), the accuracy of predictions, by its very nature very sensitive to the data qualitywill drop significantly. The predictive model may end up doing just about anything.
Putting AI into production therefore requires continuous monitoring of data quality.
Data quality how to detect data entry errors?
The first step in a data quality control process is error detection, in order to correct incomplete, incorrect or aberrant data..
The main sources of data anomalies
Data errors, no matter how small, can have an enormous an enormous impact on company decisionswhen these decisions are based on :
of dashboards built from data of insufficient quality, possibly containing duplicates (e.g.: duplicates in a customer database are a major obstacle to identifying the best customers - absence of Single Customer View),
predictive modelsThe more technical predictive models (neural network, random forest, logistic regression) are inherently extensively sensitive to inaccurate or incomplete data during the learning phase.
Data anomalies can have many different a wide variety of sources erroneous or illegible manual entries, transmission failures, conversion problems, incomplete or unsuitable processes, etc. It is important to be able to identify the sources and types of errors, so that we can understand, prevent and correct them.
Implementing regular, automated quality control rules then ensures that errors are spotted and can be corrected before they affect decision-making.
Working on data quality means recognizing that it can be influenced by humans, but not exclusively. Data entry errors can also be caused by so-called "bad encoding" or poor transcription.
It can be tricky to detect typing errors, especially when you're dealing with duplicates, or even "near-duplicates". For example, when a letter is mistyped (the typo), it is extremely difficult, if not impossible, to detect with tools such as Excel or even SQL.
To improve data quality, you need to be in a certain frame of mind: recognizing that these errors can exist, even if you can't see them at first glance 😇.
Detect data quality problems in data with specialized functions
To go from "blind" to "sighted", it is possible to use solutions with artificial intelligence functions, such as fuzzy logic. This technique can detect input errors when the data is close together. We call these "near-duplicates". Fuzzy logic makes it possible to compare names of people who have been entered differently, such as :
Emma Dupontand 'Emma Dupond'
Emma Dupond' and 'Emma née Dupond (the word 'née' is added)
'Evilatrie' or 'Malorie' or even 'Mallorie'.
Traditional tools, such as Excel, are ill-suited to identifying 'matching' data. By using more advanced solutions, based on artificial intelligence, it is possible to :
detect anomalies much more effectively, correct them, normalize textual data, deduplicate and thus improve data quality,
automate these detection/correction operations and integrate them into data pipelines.
If the very first step is awareness, i.e. admitting that you have anomalies in your data, you also have to admit that this has a cost for the organization.
However, the real cost of data quality problems can be difficult to assess.
At Tale of Data, we offer a two-dimensional breakdown of these costs. This can make it easier for you to measure the impact of data quality on your business:
The : hidden costs / direct costs (direct in the sense of visible)
The : operational / strategic
Here is a matrix to illustrate what we mean:
"Companies need to take pragmatic, targeted steps to improve the quality of their corporate data if they are to accelerate the digital transformation of their organization," says Gartnerwhose recent study estimates that poor data quality costs companies an average of $12.9 million a year.
But detecting anomalies is not the only challenge for data quality working with heterogeneous data is also a challenge.
How do you process heterogeneous data to improve data quality?
Heterogeneous data management has become increasingly necessary with the explosion of data and the proliferation of data sources in organizations.
Data is rarely analyzed on its own. To analyze it, it is often necessary to combine it with other data, to group it together or enrich it.
To process heterogeneous data, make them more coherent and consistent with each other, and thus facilitate their use through their combination, two analyses are necessary:
identify sources : It's important to first identify all data sources and their respective formats. It's not the most exciting step to take, but it's the one that will dictate the success of your heterogeneous data quality project.
harmonize the format : This stage involves creating a common format for all data, wherever it comes from. Choosing this format can be tricky, but crucial. It will then be used so that all your data can be interpreted by a computer system. Without this format harmonization, it's impossible to link data together. This is an important issue when you have to carry out actions such as quality control of product catalog data. It is therefore essential to transform or "normalize" your data, according to the standard you have chosen.
To illustrate harmonization, let's take the example of a company collecting product references from different suppliers. Harmonizing data means using the same format for each type of information.
You may have to decide on :
How many characters should a product reference contain: 8 or 12 or more?
Will the character string consist exclusively of numbers? or letters? or a mix of the two?
Will the beginning of the character string have a particular meaning: the first 2 letters will be the country of manufacture, a warehouse code, a supplier code?
When handling heterogeneous data sources (i.e. from different "silos"), you need to create mapping tables and "repositories of repositories". Fuzzy logic is indispensable for matching two representations of the same entity. For example, in the case of a product database, the solution you use must be capable of automatically matching (with a confidence coefficient if possible) the following two products:
HUAWEI MediaPad M5 10.8
HUAWEI M5 10.8"
Heterogeneous data processing is therefore essential for exploiting the richness of corporate data and building bridges between information from different sources.
A policy of data quality policy, with the right metrics to assess accuracy and completeness, is necessary, but not sufficient. To guarantee lasting quality, and therefore a sustainable policy, it is essential to succeed in the industrialization stage. No data quality without automating the quality control of incomplete or incorrect data from different storage systems.
Data quality Why automate your data processing? 🤷♂️
The first answer that comes to mind is thatautomate data quality reduces the drudgery and inefficiency of manual cleaning.
However, there are other reasons why automating Data Quality Management (DQM) is essential. A common misconception is that we can solve all data quality problems once and for all, and move on to other things. Unfortunately, this is not the case.
With the exponential growth in the volume of data produced by companies, the automation of Data Quality Management is no longer an option.
Another mistake with far-reaching consequences is to assume that if you need to automate the processing of data qualityprocessing, we might as well develop scripts or computer programs to solve data quality problems. This is counterproductive for several reasons:
First of all data quality concerns at least as much, if not more, the IT as it does to the business. An IT specialist will be able to identify and correct date-related problems, but the problems specific to the company's business will escape him. He not be able to identify that a value is outlier without business expertise. Often it is correlations of several fields that are outliers (e.g. if field C1 contains the value V1 and field C2 contains the value V2, then the value of field C3 must be greater than or equal to V3).
Then, business business needs evolveAs business rules change (regulations, directives, strategy), the need to update the code each time quickly becomes a problem. Over time, this becomes extremely expensive, all the more so as an IT department has a host of other tasks and not always the resources available to respond to immediate business needs.
Last but not least, businesses don't understand the codeso they have no possibility of upgrading the solution. They are therefore totally dependent on an IT department that is often overwhelmed by its other tasks.
At Tale of Data, we believe that a data quality solution must be usable by both the business and IT departments. That's why choosing the right solution is fundamental.
Solutions for data qualityHow to choose the right data quality tool?
Today, most companies want to become data-driven data-driven".
Conversely, none of them can achieve this without first being Data-Quality-DrivenWe invite you to read our customer testimonial on implementing a data-driven strategy with Tale of Data.
There are now new-generation platforms that make this possible with far less time and money than was the case just a few years ago.
A platform for data quality platform should offer, as a minimum, the following functionalities:
connect to all your data sources: databases, files, CRM, ERP,
automatically discover/audit data available within the company,
automatic detection of anomalies (including fuzzy logic for textual data),
offer powerful data rectification, standardization and deduplication functions,
enable business users to add customized control and validation rules,
enable IT and business departments to work together, because when it comes to data qualitythey can't succeed without each other,
automate and plan processing chains: detection / correction / quality maintenance,
alert in real time when anomalies are detected: data inevitably deteriorates over time.
👉 The data quality is not a "one shot" operation, but a sustainable policy to be implemented over the long term.