Data Mining Process: Models, Process Steps & Challenges …

This Tutorial on Data Mining Process Covers Data Mining Models, Steps and Challenges Involved in the Data Extraction Process:

Data Mining Techniques were explained in detail in our previous tutorial in this Complete Data Mining Training for All. Data Mining is a promising field in the world of science and technology.

Data Mining, which is also known as Knowledge Discovery in Databases is a process of discovering useful information from large volumes of data stored in databases and data warehouses. This analysis is done for decision-making processes in the companies.

Data Mining is carried using various techniques such as clustering, association, and sequential pattern analysis & decision tree.

Data Mining is a process of discovering interesting patterns and knowledge from large amounts of data. The data sources can include databases, data warehouses, the web, and other information repositories or data that are streamed into the system dynamically.

Why Do Businesses Need Data Extraction?

With the advent of Big Data, data mining has become more prevalent. Big data is extremely large sets of data that can be analyzed by computers to reveal certain patterns, associations, and trends that can be understood by humans. Big data has extensive information about varied types and varied content.

Thus with this amount of data, simple statistics with manual intervention would not work. This need is fulfilled by the data mining process. This leads to change from simple data statistics to complex data mining algorithms.

The data mining process will extract relevant information from raw data such as transactions, photos, videos, flat files and automatically process the information to generate reports useful for businesses to take action.

Thus, the data mining process is crucial for businesses to make better decisions by discovering patterns & trends in data, summarizing the data and taking out relevant information.

Any business problem will examine the raw data to build a model that will describe the information and bring out the reports to be used by the business. Building a model from data sources and data formats is an iterative process as the raw data is available in many different sources and many forms.

Data is increasing day by day, hence when a new data source is found, it can change the results.

Below is the outline of the process.

[image source]

Many industries such as manufacturing, marketing, chemical, and aerospace are taking advantage of data mining. Thus the demand for standard and reliable data mining processes is increased drastically.

The important data mining models include:

CRISP-DM is a reliable data mining model consisting of six phases. It is a cyclical process that provides a structured approach to the data mining process. The six phases can be implemented in any order but it would sometimes require backtracking to the previous steps and repetition of actions.

The six phases of CRISP-DM include:

#1) Business Understanding: In this step, the goals of the businesses are set and the important factors that will help in achieving the goal are discovered.

#2) Data Understanding: This step will collect the whole data and populate the data in the tool (if using any tool). The data is listed with its data source, location, how it is acquired and if any issue encountered. Data is visualized and queried to check its completeness.

#3) Data Preparation: This step involves selecting the appropriate data, cleaning, constructing attributes from data, integrating data from multiple databases.

#4) Modeling: Selection of the data mining technique such as decision-tree, generate test design for evaluating the selected model, building models from the dataset and assessing the built model with experts to discuss the result is done in this step.

#5) Evaluation: This step will determine the degree to which the resulting model meets the business requirements. Evaluation can be done by testing the model on real applications. The model is reviewed for any mistakes or steps that should be repeated.

#6) Deployment: In this step a deployment plan is made, strategy to monitor and maintain the data mining model results to check for its usefulness is formed, final reports are made and review of the whole process is done to check any mistake and see if any step is repeated.

[image source]

SEMMA is another data mining methodology developed by SAS Institute. The acronym SEMMA stands for sample, explore, modify, model, assess.

SEMMA makes it easy to apply exploratory statistical and visualization techniques, select and transform the significant predicted variables, create a model using the variables to come out with the result, and check its accuracy. SEMMA is also driven by a highly iterative cycle.

Steps in SEMMA

Both the SEMMA and CRISP approach work for the Knowledge Discovery Process. Once models are built, they are deployed for businesses and research work.

The data mining process is divided into two parts i.e. Data Preprocessing and Data Mining. Data Preprocessing involves data cleaning, data integration, data reduction, and data transformation. The data mining part performs data mining, pattern evaluation and knowledge representation of data.

[image source]

Why do we preprocess the data?

There are many factors that determine the usefulness of data such as accuracy, completeness, consistency, timeliness. The data has to quality if it satisfies the intended purpose. Thus preprocessing is crucial in the data mining process. The major steps involved in data preprocessing are explained below.

Data cleaning is the first step in data mining. It holds importance as dirty data if used directly in mining can cause confusion in procedures and produce inaccurate results.

Basically, this step involves the removal of noisy or incomplete data from the collection. Many methods that generally clean data by itself are available but they are not robust.

This step carries out the routine cleaning work by:

(i) Fill The Missing Data:

Missing data can be filled by methods such as:

(ii) Remove The Noisy Data: Random error is called noisy data.

Methods to remove noise are :

Binning: Binning methods are applied by sorting values into buckets or bins. Smoothening is performed by consulting the neighboring values.

Binning is done by smoothing by bin i.e. each bin is replaced by the mean of the bin. Smoothing by a median, where each bin value is replaced by a bin median. Smoothing by bin boundaries i.e. The minimum and maximum values in the bin are bin boundaries and each bin value is replaced by the closest boundary value.

When multiple heterogeneous data sources such as databases, data cubes or files are combined for analysis, this process is called data integration. This can help in improving the accuracy and speed of the data mining process.

Different databases have different naming conventions of variables, by causing redundancies in the databases. Additional Data Cleaning can be performed to remove the redundancies and inconsistencies from the data integration without affecting the reliability of data.

Data Integration can be performed using Data Migration Tools such as Oracle Data Service Integrator and Microsoft SQL etc.

This technique is applied to obtain relevant data for analysis from the collection of data. The size of the representation is much smaller in volume while maintaining integrity. Data Reduction is performed using methods such as Naive Bayes, Decision Trees, Neural network, etc.

Some strategies of data reduction are:

In this process, data is transformed into a form suitable for the data mining process. Data is consolidated so that the mining process is more efficient and the patterns are easier to understand. Data Transformation involves Data Mapping and code generation process.

Strategies for data transformation are:

Data Mining is a process to identify interesting patterns and knowledge from a large amount of data. In these steps, intelligent patterns are applied to extract the data patterns. The data is represented in the form of patterns and models are structured using classification and clustering techniques.

This step involves identifying interesting patterns representing the knowledge based on interestingness measures. Data summarization and visualization methods are used to make the data understandable by the user.

Knowledge representation is a step where data visualization and knowledge representation tools are used to represent the mined data. Data is visualized in the form of reports, tables, etc.

RDBMS represents data in the form of tables with rows and columns. Data can be accessed by writing database queries.

Relational Database management systems such as Oracle support Data mining using CRISP-DM. The facilities of the Oracle database are useful in data preparation and understanding. Oracle supports data mining through java interface, PL/SQL interface, automated data mining, SQL functions, and graphical user interfaces.

A data warehouse is modeled for a multidimensional data structure called data cube. Each cell in a data cube stores the value of some aggregate measures.

Data mining in multidimensional space carried out in OLAP style (Online Analytical Processing) where it allows exploration of multiple combinations of dimensions at varying levels of granularity.

List of areas where data mining is widely used includes:

#1) Financial Data Analysis: Data Mining is widely used in banking, investment, credit services, mortgage, automobile loans, and insurance & stock investment services. The data collected from these sources is complete, reliable and is of high quality. This facilitates systematic data analysis and data mining.

#2) Retail and Telecommunication Industries: Retail Sector collects huge amounts of data on sales, customer shopping history, goods transportation, consumption, and service. Retail data mining helps to identify customer buying behaviors, customer shopping patterns, and trends, improve the quality of customer service, better customer retention, and satisfaction.

#3) Science and Engineering: Data mining computer science and engineering can help to monitor system status, improve system performance, isolate software bugs, detect software plagiarism, and recognize system malfunctions.

#4) Intrusion Detection and Prevention: Intrusion is defined as any set of actions that threaten the integrity, confidentiality or availability of network resources. Data mining methods can help in intrusion detection and prevention system to enhance its performance.

#5) Recommender Systems: Recommender systems help consumers by making product recommendations that are of interest to users.

Enlisted below are the various challenges involved in Data Mining.

Data Mining is an iterative process where the mining process can be refined, and new data can be integrated to get more efficient results. Data Mining meets the requirement of effective, scalable and flexible data analysis.

It can be considered as a natural evaluation of information technology. As a knowledge discovery process, Data preparation and data mining tasks complete the data mining process.

Data mining processes can be performed on any kind of data such as database data and advanced databases such as time series etc. The data mining process comes with its own challenges as well.

Stay tuned to our upcoming tutorial to know more about Data Mining Examples!!

PREV Tutorial | NEXT Tutorial

Excerpt from:

Data Mining Process: Models, Process Steps & Challenges ...

Related Posts

Comments are closed.