Data Mining

The term mining has been used for centuries in hundreds of different languages to mean more or less the same thing; the process of extraction… In today’s world of computer science, the term has retained the same meaning with slight variation in definition and a wholesome change in the contents being mined. Data is a key driver in computing and its mining has gained traction over the last few years. Yes, data mining is still a fairly new technology but what is it really? This article shall construe the science behind data mining and its benefits. Read ahead to find out.

Data mining is the process of extracting useful information through pattern discovery, analysis of data according to various perspectives and machine learning from large data sets in databases and data warehouses for various information requirements. The definition is long and varies depending on various authors. Data mining utilizes computer science and statistics to extract this information intelligently with the overall goal of comprehensively using such information for various other uses. Data mining is only a step in a larger process known as knowledge discovery in databases (KDD).

There are several steps that are involved in the process of data mining. All these steps lead to the production of useful information that is summarized, informative, compact and visual. These (6) steps are discussed below.

Anomaly Detection
In this first step, data errors that might require further investigation as well as unusual data patterns are identified. Defined; anomaly detection is the process of identifying rare observatoins, items or events that cause suspicions due to differences with a larger portion of the data. To put this in a real world perspective, you can visualize this example; a bank is analyzing their data and in their hundreds of thousands of transactions, there is a single account with over $1,000,000 that has never had a single transaction in the last 5 years. What could have happened to the account holder? This is an example of an anomaly. Once detected, action can be taken. Anomalies are also called novelties or noise.

Association Rule Learning
This is where machine learning comes in. You find that using a specific algorithm, a computer is able to associate some variable with some other variable(s). Defined: this is a machine learning method based on rules that is used to discover relationships between various variables in a large database. It is important to note that data mining takes place in large databases, not some random small database. Without a doubt, the absolute purpose of this learning is to help the machine mimic the thought process of a human being. Consider a grocery store as an example where this step can be implemented. Using this type of machine learning, the store keeper can gather the type of groceries that are bought together. This shall help the grocery store owner to stock related items in the same area in the shelves making the client’s work easier – increasing loyalty.

This is a way of grouping objects in groups (cluster) such that groups in the same cluster are more similar to one another than to other objects in other clusters. This is a very fundamental process of data mining. More than one algorithm may be used to achieve this. These algorithms understand clusters in different ways and known structures in data are not used.

The fourth step in this process is not the same as clustering. Classification deals with observations and the grouping is are ultimately grouped into know categories. In clustering, there are no known categories – the algorithm just clusters objects (data). Classification is the problem of identifying to which, in a set of predefined categories, a new observation belongs. A good example of real life application is how email is either categorized into spam or not.

Finding a function that models data with the least among of errors is important and this fifth step does exactly that. The purpose of this task is to estimate relationships among data or data sets. Regression is a statistical mechanism and it deals with the relationship between a dependent variable and an independent variable(s). Note that the dependent variable is always influenced by the independent variable. Let us get to the last step.

This is a straight forward step. By definition; it is the provision of a more compact data set that includes visualization and reports generation.

In the end, the information generated through data mining must be valid and must be useful enough for decision making and predictions. These tasks are important in a variety of fields including but not limited to medicine, business, AI, science and even surveillance. Put simply, wherever there is digital data, digital mining is applicable. This is not all though, data mining has quite a lot of content. I shall extend this article in the near future but for now, if you have any concern(s) be sure to leave it in the comment section below.

Leave a Reply

Your email address will not be published. Required fields are marked *