Tale of Data
Found a word you don't understand?
We realize that not everyone is fluent in data speak.
Find below the definitions of the words followed by *.
Matching/fuzzy matching algorithm: an algorithmic procedure based on an approximate match between two inputs, rather than on an exact match between them. In practice, Tale of Data provides a range of algorithms to deal with, for example, the peculiarities of French and English phonetics. Other approaches are also offered e.g. weighting consonants more heavily or using tried and tested mathematical procedures such as Levenshtein distance*.
API (application programming interface): a software interface for connecting software or a service to other software or another service so that data and functionalities can be exchanged.
Relational database: in computer science, a relational database is one in which information is organized into two-dimensional charts called relations or tables. In this relational model, a database is made up of one or more relations (Source: Wikipedia).
BAN (French national address database): this database contains all official French addresses. It is an ‘open’ database because it can be accessed and used freely by private and public users.
BCBS 239 : Banking standard to increase banks' capabilities in aggregating financial risk data; reporting and improving the quality of this risk data.
Churn: the loss of clients and subscribers. This is a mainly telecoms and banking term. Churn measures the average period of subscription to a particular offer or service (subscription to a TV sports package, magazine, newspaper, etc.). It is one of the main indicators of customer satisfaction (source: journal du net).
Cluster: an operating method that applies across several servers, making it possible to process a large amount of data simultaneously.
IRIS code: IRIS (aggregated units for statistical information) are standard-sized geographical units devised by INSEE. Each basic unit has 2 000 inhabitants.
Connector: a method of connecting to a particular type of data source (e.g. a SQL server database or a file server such as Azure Blob Storage) -> see the Architecture section.
Core banking legacy: A legacy system is an IT system (such as an ERP) that still meets requirements, but can no longer evolve. The organization still relies on this system, but may be limited by its inability to interact with the latest analytical tools, such as those hosted on the cloud.
Crowd sourcing: an organization system that uses the contributions of many different people to enrich and improve content. For example, the content of the Wikipedia encyclopedia is enriched by a very large number of contributors.
Data driven adjective that can be translated as "Data driven". In other words, it is a company that relies on the analysis of its data to make decisions and guide its evolution rather than on intuition.
Dataviz: a way of communicating raw figures and data by transforming them into easy-to-read visual objects, e.g. points, bars, curves and maps.
Note that the new version of Tale of Data will contain a DataViz module.
It will be accessible to all users of the solution and to those who only want to use this module.
Data scientist: A data scientist, he or she collects, processes, analyzes, and makes data talk to improve business performance.
Deduplication: a way of eliminating duplicates.
Levenshtein distance: a string metric for measuring the difference between two sequences. It is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other (source: Wikipedia).
PI (plant information) data: this is generated on industrial sites from production site sensors and is sent to a storage system.
Data to be enriched: this is the data set you have (for example, the list of prospects in your CRM), to which you want to add information that is not present in the form of new columns (for example, the number of employees in the company)
Enrichment data: this is a reference dataset, either internal (e.g., available in your MDM tool) or external (e.g., the SIRENE database) that contains additional information you need to increase your analysis capacity
Record: rows (as opposed to columns) in a database or file.
Data enrichment: Consists of completing the data, improving it and structuring it via the use of another source (repository, base file ...).
Flow: User-constructed processing that performs remediation, preparation, and data monitoring tasks. A flow is by construction designed for production.
Geolocation: a technology that relatively accurately locates objects and people (source CNIL).
Artificial intelligence: (AI) a set of techniques that enables computers to simulate and reproduce human intelligence.
Fuzzy joins Fuzzy matching: assembly of several sources by matching them using fuzzy matching algorithms.
Full-text join: Joining multiple sources by performing a deep search across all specified text data. This allows for example to discover links between records in two tables where the differences are related to a different word order. A conventional algorithm will not be able to detect this type of correspondence, whereas it may be obvious to a human operator and to a full-text join algorithm.
Natural language: means that the user does not need to know any computer languages to use the solution. The functions are all usable via explicit menus.
Machine Learning: Machine learning that involves letting algorithms discover patterns in the dataset. Once this training is done, the algorithm will be able to find the patterns in a new data set.
Mass data discovery: a process for exploring an IT system to discover and map all the data in it. A map of stored sensitive data (e.g. personal data) can then be produced along with a report analyzing the quality of the stored data.
Metadata: data used to characterize other physical or digital data (source: Larousse). Metadata can be used to describe other data, e.g. size of a file, creation date, modification date, etc.
N-gram or N-Gram: Method used in Tale of Data to evaluate the similarity between several words or between several sentences. More generally, it is the succession of N elements of the same type extracted from a text, a sequence or a signal; the elements can notably be words or letters (source : Wikidictionary).
Open data: this is data to which access, exploitation and re-use is totally public and free of rights. The French BAN and SIRET databases are examples of open data.
Pattern: A user-defined pattern that can be searched in the data, or used as part of its transformation.
Phonetics/phonetic algorithm/phonetic analysis: the matching of terms by sound, e.g. a search for similarity among surnames based on the sound [o], which in French can be written as o, ô, au or eau.
Data preparation: the step before data analysis. It consists of several tasks, such as data cleaning and enrichment. Raw data is processed to make it reliable and therefore usable.
Data preparation is the key step in the analysis of valid data, which is required for data control.
Record Lineage: A representation proposed by Tale of Data that allows to see for a particular dataset the list and chaining structure of the data that are used to feed this dataset (the "downstream streams"), as well as all the datasets and chaining that are dependent on the selected dataset (the "upstream streams"). This visualization mode allows to understand the origin of the data (=upstream view) and to establish the impact of a change in the data concerned on other data sets that depend on it (=downstream view).
Data reconciliation : Process related to the homogenization of data, to their grouping according to their nature or their source.
Adjustment Phase during which the "raw" data is analyzed for correction. One of the actions of data preparation.
Repository: a list of items that make up a reference system. Example: a product repository is a list of all the products with a certain number of attributes.
Management rules: guidelines that govern the activities of an organization or system. They are intended to ensure consistency and compliance of operations, minimize the risk of error or fraud and improve the quality of products or services.
Business rules: a set of data transformation operations that has been defined by the Tale of Data user without any writing of code, i.e. by using an intuitive interface that allows conditions (which can be as complex as necessary) to be set for each operation. Tale of Data lets you generate an easy to read summary of defined rules and reuse them in other Flows* and other data transformations.
SaaS or Software as a Service: a system for providing software as a service, accessible via an Internet browser.
Time series: a time-linked data series. A country’s GDP and changes in its population are time series.
Script: a computer program that when executed allows an action to be performed or a web page to be displayed.
Shadow IT: All data and processing taking place on the bangs of the ISD (e.g. unofficial MS Access databases, Excel files with macros, ...). These data and software are invisible to the IT department, which generates a risk of security and non-compliance (RGPD).