Data Generalization in Data Mining: 5 essential principles

Currently, data is one of the most essential components for the success of any modern-day organization. Once the information is specific to a small scale of individuals, it becomes sensitive. Therefore, data professionals use the data mining process in order to reveal hidden information from the company’s databases.

Data mining makes use of well-known information shared by a minimum number of people in order to ensure the required patterns’ statistical significance. In this case, sensitive data is discarded for reliable data mining. So the concept of the following article is introducing the data generalization in data mining concept from data mining as a manner of masking private detailed information then discovering useful patterns.

Data Mining

In technical terms, data mining is the process used to collect and extract data from a larger set of data to discover patterns and generate rules. Moreover, it is regarded as a discipline under the field of data science where it is distinguished from predictive analytics for its description of historical data; whereas the latter aims to predict future outcomes. It should be noted that databases store a significant number of data with precise details; nonetheless, users prefer summarized data sets view in descriptive terms.

From the point of view of data analysis, data mining can be classified into two categories presented in the following.

Descriptive Mining

this category of data mining focuses on describing information or task-relevant data sets in summative, concise, and informative method forms. In addition, it gives importance to presenting basic attention-grabbing characteristics of data.

Predictive Mining

it focuses on analyzing the data in order to construct a set of models for the database as well as predict the conduct and properties of recent unknown data sets.

Taking everything into consideration, data mining is very beneficial as it has the ability to summarize and current big sets of data sets to an excessive conceptual level. This process requires data generalization which is known as a vital functionality.

Data Generalization in Data Mining

Data Generalization is the process of summarizing general features of objects in a certain class and producing characteristic rules. In this process, users use concept hierarchies in order to convert low-level attributes of data into high-level attributes of data. For instance, age data can be in the form of (20, 40) in the dataset; therefore, it will be transformed into a higher conceptual level such as (young, old). The latter is a categorical value.

This transformation from a lower conceptual level to a higher one is very useful to get a clearer image of the data. Moreover, as data generalization enables its users to replace one data value with a less specific one using many techniques, it preserves and protects data utility against attacks made by the re-identification of individuals or unintentionally revealing their private information.

Furthermore, the data is associate with a user-specified class that can be retrieved by a database query and runs via a real summarization module to extract and calculate the essence of the data at different abstractions’ levels. As a matter of fact, there are two main types of generalization namely automated and declarative. The former blur values until they reach a specified value attribute.

This type is better for companies as it offers both privacy and accuracy for both parties. Also, it uses an algorithm to apply the minimum amount of required distortion in order to attain the stated value. On the other hand, declarative generalization enables users to specify the bin sizes upfront; however, this technique may result in distorting the data in some ways and introduce bias. Data generalization can be very helpful in Online Analytical Processing (OLAP) technology. The latter is used for offering quick answers to multi-dimensional analytical queries.

Approaches of Data Generalization

There are two approaches to data generalization that will be discussed in detail in the following.

Data cube approach (OLAP approach)

It can be regarded as a data warehouse based materialized approach. It is also pre-computational-oriented which carries out computations and stores outcomes in data cubes.

Strengths

An efficient data generalization implementation.
Computation of many types of measures. For instance, sum, count, max, and average.
Generalization and specialization can be carried out on a data cube by both drill-down and roll-up.

Limitations

It handles the dimensions of easy non-numeric data only along with measures of simple aggregated numeric values.
Additionally, the lack of intelligent evaluation cannot inform users about which dimensions need to be used and what levels ought to the generalization reach.

Attribute-oriented induction approach

it is relational database query oriented that was first proposed in 1989 (KDD ‘89 workshop). To further elaborate, it is generalized based, and an online data analysis technique that is not confined to particular measures such as categorical data.

The Basic principles of Attribute-Oriented Induction

Data focusing: it involves collecting task-relevant data (initial relation) among which dimensions by using a relational database query.
Attribute-removal: if there is a large set of distinct values for A, focus on removing it. However, it should be one of these two cases: if there is no operator of a generalization on A or A’s higher-level information is expressed in terms of other attributes.
Attribute-generalization: select an operator and generalize A If 1- there exists a set of A’s generalization operators, and 2- there is a large set of A’s distinct values.
Attribute-threshold control: this means applying aggregations by merging generalized tuples (usual 2-8), specified, or default.
Generalized relation threshold control: it emphasizes on controlling the final relation and/ or rule size by accumulating the respective counts and presenting them.

Presentation of Generalized Results

In order to present generalized data, users use the following:

Generalized Relation

It Relations the place where few or most attributes are generalized, with either counts or other aggregation values collected and accumulated.

Cross-Tabulation

It includes mapping results into cross-tabulation type (similar to contingency tables).

Visualization Techniques

This involves the use of bar charts, pie charts, cubes, curves, and different other visuals.

Quantitative attribute guidelines

This focuses on mapping generalized findings in characteristic rules to attribute guidelines with quantitative information related to it.

To sum up, data generalization in Data Mining is a method that abstracts a large set of related information in a database from a low conceptual level to a higher one. To be accurate, it summarizes basic objects’ options in a target class and then produces attribute guidelines. The modern data warehouse is a rich repository of data from several sources which helps in the business of companies. Therefore, data generalization is an important process during data mining as it contains several advantages for both businesses and individuals.

Data Generalization in Data Mining: Approaches, Techniques, and Principles

Data Mining

Descriptive Mining

Predictive Mining