top of page

Data quality: how to conduct a good data quality strategy?




Data quality is one of the many challenges that every organization must face in the face of exponential data growth. If we set aside the problems associated with data storage and protection, the most important issues are as follows:

  • Data analysis: organizations must be able to analyze data to transform it into useful and actionable information, improve operations and make informed decisions,

  • Data quality, a prerequisite for analysis: it is essential to ensure the integrity of the data in order to guarantee accurate, relevant and appropriate results.


Contents




What is data qualitydata quality? 🤔


Visit data qualityor data quality, is a set of set of metrics to judge the relevance and usability of your data. Dealing with it means being able to measure the accuracy, completeness, integrity and timeliness of your data:

  • l'accuracy means that the data is correct and consistent,

  • the completeness means that the data are not partial,

  • l'integrity means that data is protected against unauthorized modification, deletion or addition,

  • l'news means that the data is up to date.

In many organizations today, data is produced at high speed and on a large scale, making it difficult to manage and control. Indeed, this data can be :

  • incomplete or incorrect or even aberrant,

  • recorded in different formats and in different storage systems, which complicates their interpretation.


To overcome these difficulties the implementation of a data quality policy is a major challenge. High-quality data is the key to informed decision-making in all sectors and disciplines. Essential to confidence and accuracy, the processes of data quality processes are crucial to both the quantity and reliability of the information gathered.


The more efficiently your data is collected, monitored, corrected and harmonized, the better your conclusions and the more relevant your decisions will be.


It It is therefore essential to determine how to control and improve data quality, in order to put in place the governance rules needed to guarantee this quality over the long term.

Why is data quality a problem in business?


Visit data quality is actually a recurring problem for these main reasons:

  • The data entry regularly create new inconsistencies or duplications (in CRM, ERP, HR software...). Some of these errors can be avoided by advanced data entry checks (e.g.: immediate verification of a city name or zip code).

However, not all errors can be avoided, especially those involving consistency between information entered in different fields/zones. This type of error is mainly identified by our customers in situations of data migration to a new tool.


  • For example, in the IOT field, sensors are not sensors are not failure-free: they may They may emit outliers, or behave erratically in the time between two measurements.


  • In Machine Learning, predictive models may have been trained on high-quality data, but when we put them into production, it's only to confront them with data they've never seen. data that these models have never seen before.. If the quality of input data declines over time (missing data, outliers), the accuracy of predictions, by nature very sensitive to the data qualitywill drop significantly. The predictive model may end up doing just about anything.


Putting AI into production therefore requires continuous data quality control.

Data quality how to detect data entry errors?


The first step in a data quality control process is error detection, in order to correct incomplete, incorrect or aberrant data..


The main sources of anomalies in the data


Data errors, no matter how small, can have an enormous have an enormous impact on company decisionswhen these decisions are based on :


  • of dashboards built from data of insufficient quality, possibly containing duplicates (e.g.: duplicates in a customer database are a major obstacle to identifying the best customers - absence of a Single Customer View),


  • predictive modelsThe more technical predictive models (neural network, random forest, logistic regression) are inherently extensively sensitive to inaccurate or incomplete data during the learning phase.

Data anomalies can have many different a wide variety of sources erroneous or illegible manual entries, transmission failures, conversion problems, incomplete or unsuitable processes, etc. It is important to be able to identify the sources and types of errors, so that we can understand, prevent and correct them.


Implementing regular automated quality control rules then ensures that errors are caught and can be corrected before they affect decision making.


Working on data quality means recognizing that it can be influenced by humans, but not only. Data entry errors can also be caused by what is called "bad encoding" or poor transcription.


It can be tricky to detect data entry errors, especially when there are duplicates, but especially "near duplicates". For example, when a letter is mistyped (the typo) it is extremely difficult, if not impossible, to detect with tools such as Excel or even SQL.


To improve data quality, you need to be in a certain frame of mind: recognizing that these errors can exist, even if you can't see them at first glance 😇.

Detect data quality problems in data with specialized functions

To go from "blind" to "sighted", it is possible to use solutions with artificial intelligence functions, such as fuzzy logic. This technique can detect input errors when the data is close together. We call these "near-duplicates". Fuzzy logic makes it possible to compare names of people who have been entered differently, such as :

  • Emma Dupontand 'Emma Dupond'

  • Emma Dupond' and 'Emma née Dupond (the word 'née' is added)

  • 'Evilatrie' or 'Malorie' or even 'Mallorie'.

Traditional tools, such as Excel, are not very well suited to identify 'matching' data. By using more advanced solutions, based on artificial intelligence, it is possible to :

  • detect anomalies much more efficiently, correct them, normalize textual data, deduplicate and thus improve data quality,

  • automate these detection/correction operations in order to integrate them into data pipelines.

If the very first step is awareness, i.e. admitting that you have anomalies in your data, you must also admit that this has a cost for the organization.

However, the real cost of data quality problems can be difficult to assess.

At Tale of Data, we offer a two-dimensional breakdown of these costs. This can make it easier for you to measure the impact of data quality on your business:

  1. The : hidden costs / direct costs (direct in the sense of visible)

  2. The : operational / strategic


Here is a matrix to illustrate our point:


data-quality-matrix
Poor data quality matrix

"Companies need to take pragmatic, targeted steps to improve the quality of their corporate data if they are to accelerate the digital transformation of their organization," says Gartnerwhose recent study estimates that poor data quality costs companies an average of $12.9 million a year.


But detecting anomalies is not the only challenge for data quality working with heterogeneous data is also a challenge.

How do you process heterogeneous data to improve data quality?


Heterogeneous data management has become increasingly necessary with the explosion of data and the proliferation of data sources in organizations.

Data is very rarely analyzed on its own. To analyze it, it is often necessary to combine it with other data, to group it together or to enrich it.


To process heterogeneous data, to make them more coherent, homogeneous between them, and thus facilitate their use through their combination, two analyses are necessary:

  • identify sources : It's important to first identify all data sources and their respective formats. It's not the most exciting step to take, but it's the one that will dictate the success of your heterogeneous data quality project.


  • harmonize the format : This stage involves creating a common format for all data, wherever it comes from. Choosing this format can be tricky, but crucial. It will then be used so that all your data can be interpreted by a computer system. Without this format harmonization, it's impossible to link data together. This is an important issue when you need to carry out actions such as quality control of product catalog data. It is therefore essential to transform or "normalize" your data, according to the standard you have chosen.

To illustrate harmonization, let's take the example of a company collecting product references from different suppliers. Harmonizing the data will consist in using the same format for each type of information.

You may have to decide on :

  • How many characters should a product reference contain: 8 or 12 or more?

  • Will the string consist exclusively of numbers? or letters? or a mix of both?

  • Will the beginning of the string have a particular meaning: the first 2 letters will be the country of manufacture, a warehouse code, the code of a supplier?

When handling heterogeneous data sources (i.e. from different silos), you need to create mapping tables and repositories. Fuzzy logic is essential to match two representations of the same entity. For example, in the case of a product database, the solution you will use must be able to automatically match (with a confidence coefficient if possible) the following two products:

  • HUAWEI MediaPad M5 10.8

  • HUAWEI M5 10.8"

Processing heterogeneous data is therefore essential to exploit the richness of the company's data and build bridges between information from different sources.


A policy of data quality policy, with the right metrics to assess accuracy and completeness, is necessary, but not sufficient. To guarantee lasting quality, and therefore a sustainable policy, it is essential to succeed in the industrialization stage. No data quality without automating the quality control of incomplete or incorrect data from different storage systems.



strategie-data


Data quality Why automate your data processing? 🤷‍♂️


The first answer that comes to mind is thatautomate data quality reduces the drudgery and inefficiency of manual cleaning.


However, there are other reasons why automating Data Quality Management (DQM) is essential. A common misconception is that we can solve all data quality problems once and for all, and move on to other things. Unfortunately, this is not the case.


With the exponential growth in the volume of data produced by companies, the automation of Data Quality Management is no longer an option.

Another mistake with far-reaching consequences is to assume that if you need to automate the processing of data qualityprocessing, we might as well develop scripts or computer programs to solve data quality problems. This is counterproductive for several reasons:

  1. First of all data quality concerns at least as much, if not more, the IT as it does for the business. An IT specialist will be able to identify and correct date-related problems, but the problems specific to the company's business will escape him. He not be able to identify that a value is outlier without business expertise. Often it is correlations of several fields that are outliers (e.g. if field C1 contains the value V1 and field C2 contains the value V2, then the value of field C3 must be greater than or equal to V3).

  2. Then, business business needs evolveAs business rules change (regulations, directives, strategy), the need to update the code each time quickly becomes a problem. Over time, this becomes extremely expensive, all the more so as an IT department has a host of other tasks and not always the resources available to respond to immediate business needs.

  3. Last but not least, businesses don't understand the codeso they have no possibility of upgrading the solution. They are therefore totally dependent on an IT department that is often overwhelmed by its other tasks.

At Tale of Data, we believe that a data quality solution must be usable by both the business and IT departments. That's why choosing the right solution is fundamental.

Solutions for data qualityHow to choose the right data quality tool?


Today, most companies want to become data-driven data-driven".


Conversely, none can achieve this without first being Data-Quality-DrivenWe invite you to read our customer testimonial on implementing a data-driven strategy with Tale of Data.


There are now new generation platforms that make this possible with much less time and money than just a few years ago.

A platform for data quality platform should offer, as a minimum, the following functionalities:

  • connect to all your data sources: databases, files, CRM, ERP,

  • Automatically discover/audit the data available within the company,

  • automatic detection of anomalies (with, among others, fuzzy logic for textual data),

  • offer powerful data adjustment, standardization and deduplication capabilities

  • Allow business users to add custom control and validation rules,

  • enable IT and business departments to work together, because when it comes to data qualitythey can't succeed without each other,

  • automate and plan processing chains: detection / correction / quality maintenance,

  • alert in real time in case of detection of anomalies: the data inevitably degrades over time.

👉 The data quality is not a "one shot" operation, but a sustainable policy to be implemented over the long term.
bottom of page