Tale of Data glossary

A word you don't understand?

We are aware that not everyone speaks the language of data.

Find below the definitions of the words followed by *.

Et si vous pouviez enfin maîtriser le vocabulaire de la Data Quality ?

La qualité des données est essentielle, mais sans un langage commun, les projets data et les initiatives liées à l’IA peuvent rapidement devenir complexes. "Le Langage de la Data Quality" a été conçu pour que tous, des équipes métiers aux experts techniques, puissent parler le même langage.

Téléchargez gratuitement Le Langage de la Data Quality en remplissant ce formulaire et simplifiez votre compréhension, améliorez la collaboration entre vos équipes et optimisez vos décisions.

Mokup Le language de la data quality (2).png

Recevez gratuitement votre glossaire
Data Quality

Algorithm

Set of operating rules specific to a calculation; sequence of formal rules (source: Le Robert).

Fuzzy matching algorithm

Algorithmic procedure based on an approximate match between two inputs, rather than an exact match. In practice, different algorithms are made available in Tale of Data to take into account, for example, the specificities of French or English phonetics. Other approaches are suggested, such as giving more weight to consonants, or using proven mathematical procedures like Levenshtein's distance*.

API or Application Programming Interface

A software interface that "connects" one piece of software or service to another in order to exchange data and functionality.

API Call

Service request made to an API to retrieve or send data between different applications.

BAN - National Address Base

The Base Adresse Nationale is the official address database for France. It is an "open" database, meaning that access and use are free to users, whether private or public.

Relational database

In computing, a relational database is one in which information is organized in two-dimensional tables called relations or tables. According to this relational model, a database consists of one or more relations (Source Wikipedia).

BCBS 239

Banking standard designed to increase banks' ability to aggregate financial risk data, produce reports and improve the quality of risk data.

Churn

Used to designate the loss of customers or subscribers. This term is mainly used by telecom companies and banks. In particular, it is used to measure the average duration of a subscription to an offer or service (subscription to a TV sports package, a magazine, a newspaper, etc.). It is one of the main indicators of customer satisfaction (source: journal du net).

Cluster

Distributed mode of operation on several servers, enabling large amounts of data to be processed in parallel.

IRIS code

Ilots Regroupés pour l'Information Statistique" are homogeneous territorial divisions created by INSEE. Each elementary cell groups together 2,000 inhabitants.

Connectors

Means of connecting to a data source of a particular type (e.g. a SQL Server database, or an Azure Blob Storage file server, etc.) -> see the Architecture section.

Core banking legacy

A legacy system, or "inherited" system, is an IT system (such as an ERP) that still meets requirements, but can no longer evolve. The organization still relies on this system, but may be limited by its inability to interact with the latest analytical tools, such as those hosted in the cloud.

Crowd sourcing

A form of organization that calls on contributions from a large number of people to enrich and improve content. Wikipedia, for example, is an encyclopedia whose content is enriched by a very large number of contributors.

Data Catalog

A centralized metadata repository for managing, searching and documenting the data available in an organization, making it easier to discover and use.

Data Discovery

A method for exploring the data available in a computer system to discover its structure, content and interrelationships, thus facilitating data understanding and analysis.

Data driven

An adjective that translates as "data-driven". In other words, it's a company that relies on data analysis to make decisions and guide its development, rather than intuition.

Data lake

A centralized storage space for large-scale storage of structured, semi-structured and unstructured data, facilitating subsequent analysis and processing.

Data Lineage

Representation to trace the origin and path of data through different systems and processes, ensuring transparency and facilitating compliance and impact analysis.

Data Observability

Ability to monitor and understand the state of data in a system, using metrics and visualizations to ensure its quality, integrity and performance.

Data Product

A set of organized, ready-to-use data, often combined with tools and interfaces that enable them to be used efficiently to meet specific needs.

Data Quality

A set of processes and techniques designed to ensure that data is accurate, complete, reliable and relevant to its intended use.

Data scientist

As a data specialist, he or she collects, processes, analyzes and makes sense of data to improve business performance.

Data Stories

Data-driven storytelling, using visualizations and analytics to communicate information and insights in a clear and engaging way.

Datavisualization (dataviz)

A method of communicating figures or raw information by transforming them into easy-to-read visual objects: points, bars, curves, maps. The new version of Tale of Data will include a DataViz module. It will be accessible to all users of the solution, as well as to those wishing to use only this module.

Databases

Organized collections of data, stored and accessible electronically from a computer system, enabling data to be managed, manipulated and queried efficiently.

Deduplication

Method for eliminating duplicates.

Distance from Levenshtein

Measures the similarity between two strings. It is equal to the minimum number of characters that must be deleted, inserted or replaced to move from one string to the other (source: Wikipedia).

Data to be enhanced

This is the dataset in your possession (for example, the list of prospects in your CRM), to which you wish to add information not present in the form of new columns (for example, the company's workforce).

Enrichment data

This is a reference dataset, either internal (e.g. available in your MDM tool) or external (e.g. the SIRENE database), which contains additional information you need to increase your analysis capacity.

PI (Plant Information) data

This data, produced on industrial sites, comes from sensors installed on production sites and sent to a storage system.

Recording

Rows in a database or file (as opposed to columns).

Data enrichment

Consists in completing, improving and structuring data by using another source (repository, base file, etc.).

Flow

User-constructed processing, enabling remediation, preparation and monitoring of data. A flow is by construction designed for production.

Flow Designer

Tale of Data software environment for developing Flows* to design transformations on data.

Geolocation

Technology used to determine the location of an object or person with a certain degree of accuracy (source CNIL).

Artificial intelligence

A set of techniques that enable computers to simulate and reproduce human intelligence.

Blurred joints

Assemble several sources by matching them using fuzzy matching algorithms.

Full-text join

Assemble multiple sources by performing an in-depth search of all specified textual data. This enables, for example, the discovery of links between records in two tables where the differences are related to a different word order. A conventional algorithm will not be able to detect this type of correspondence, whereas it may be obvious to a human operator and to a full-text join algorithm.

Natural language

Means that the user doesn't need to know any computer language to use the solution. All functions are available via self-explanatory menus.

Machine Learning

Machine learning involves letting algorithms discover patterns in the dataset. Once this training has been completed, the algorithm will be able to find the patterns in a new data set.

Mass Data Discovery

A method of exploring a computer system to discover and map all the data present in the system. In particular, this enables an atlas of stored sensitive data (such as personal data) to be drawn up. It also enables the generation of a report analyzing the quality of stored data.

Metadata

Data used to characterize other data, whether physical or digital (Larousse source). Data used to describe other data. Examples: file size, creation date, modification date, etc.

N-gram or N-Gram

Method used in Tale of Data to evaluate the similarity between several words or sentences. More generally, it refers to the succession of N elements of the same type extracted from a text, a sequence or a signal; the elements can be words or letters (source: Wikidictionary).

Open Data

Literally, "open data" refers to data to which access is totally public and free of rights, in the same way as exploitation and reuse. The Base des Adresses Nationales and the SIRET database are examples of information that can be consulted using Open Data.

Pattern

A user-defined pattern that can be searched for in data, or used as part of data transformation.

Phonetics / Phonetic algorithm / Phonetic analysis

Matching terms based on sound identity. Example: search for similarity between surnames with the sound [o], which can be spelled o, ô, au, eau.

Data Preparation

The stage preceding data analysis. This stage comprises a number of tasks, such as data cleansing and data enrichment. Raw data undergo a number of processing steps to make them reliable and therefore usable. Data preparation is the key stage for valid data analysis, leading to data mastery.

Record Lineage

A representation offered by Tale of Data which, for a particular dataset, shows the list and concatenation structure of the data used to feed that dataset ("downstream flows"), as well as all the datasets and concatenations dependent on the selected dataset ("upstream flows"). This visualization mode makes it possible to understand the origin of the data (=upstream view) and to establish the impact of a change within the data concerned on other datasets that depend on it (=downstream view).

Data reconciliation

Process of homogenizing data, grouping them according to their nature or source.

Turnaround

Phase during which the "raw" data is analyzed and corrected. One of the actions involved in data preparation.

Reference

List of elements forming a reference system. Example: a product repository is a list of all products containing a certain number of attributes for each product.

Management rules

Guidelines that govern the activities of an organization or system. They are designed to ensure the consistency and conformity of operations, minimize the risk of error or fraud, and improve the quality of products or services.

Business rules

A set of transformation operations on data, defined by the Tale of Data user without writing code, i.e. with an intuitive interface and the ability to specify conditions for each operation, which can be as complex as required. Tale of Data makes it possible to obtain a readable summary of the rules that have been defined, and to reuse them in other Flows* and other data transformation operations.

Remediation

Solving quality problems in data.

Runtime

Tale of Data software environment for executing Flows* to perform transformations on data. Flows* can be triggered directly by the user, or scheduled in a highly flexible way.

Runtime Environment

The software environment in which programs run. This includes the operating system, libraries and tools needed to run applications.

SaaS or Software as a Service

Software delivery system, in the form of a service, accessible via an Internet browser.

SaaS or Premise Single Server or Big Data Cluster

Software deployment models. SaaS (Software as a Service) enables access via the Internet, while On-Premise solutions are installed locally on a single server or on a cluster of servers to manage large quantities of data.

Script

A computer program that executes to perform an action or display a Web page.

Time series

A series of data indexed by time. A country's GDP or population trends are time series.

Shadow IT

All data and processing carried out outside the IT department (e.g. unofficial MS Access databases, Excel files with macros, etc.). This data and software is invisible to the IT Department, generating security and non-compliance risks (RGPD).

Le Langage de la Data Quality - Maîtrisez la Data Quality avec notre glossaire

Je télécharge ce glossaire