Data Mining Techniques PDF

Lect.4: Data Mining Techniques Overview  Data mining process.  Data mining techniques.  Summary. 2.1 Data Mining Process. Before the actual data mining could occur, there are several processes involved in data mining implementation. These processes are presented in the following five steps, which are; business research, data quality checks, data cleaning, data transformation, and data modeling, as following;  Step 1: Business Research – before you begin, you need to have a complete understanding of your enterprise’s objectives, available resources, and current scenarios in alignment with its requirements. This would help create a detailed data mining plan that effectively reaches organizations’ goals.  Step 2: Data Quality Checks – as the data gets collected from various sources, it needs to be checked and matched to ensure no bottlenecks in the data integration process. The quality assurance helps spot any underlying anomalies in the data, such as missing data interpolation, keeping the data in top-shape before it undergoes mining.  Step 3: Data Cleaning – it is believed that 90% of the time gets taken in the selecting, cleaning, formatting, and anonymizing data before mining. 1  Step 4: Data Transformation – comprising five sub-stages, here, the processes involved make data ready into final data sets. It involves: − Data Smoothing: here, noise is removed from the data. − Data Summary: the aggregation of data sets is applied in this process. − Data Generalization: here, the data gets generalized by replacing any low-level data with higher-level conceptualizations. − Data Normalization: here, data is defined in set ranges. − Data Attribute Construction: the data sets are required to be in the set of attributes before data mining.  Step 5: Data Modelling– for better identification of data patterns, several mathematical models are implemented in the dataset, based on several conditions. 2.2 Data Mining Techniques  Data mining includes the utilization of refined data analysis tools to find previously unknown, valid patterns and relationships in huge data sets.  These tools can incorporate statistical models, machine learning techniques, and mathematical algorithms, such as neural networks or decision trees.  Thus, data mining incorporates analysis and prediction.  Depending on various methods and technologies from the intersection of machine learning, database management, and statistics, professionals in data mining have devoted their careers to better understanding how to process and make conclusions from the huge amount of data, but what are the methods they use to make it happen? 2  In recent data mining projects, various major data mining techniques have been developed and used, including association, classification, clustering, prediction, sequential patterns, outlier detection / anomaly detection, feature selection and regression. Data Mining Techniques Outlier Association Rejection Learning Clustering Regression Classification Feature Prediction Selection 2.2.1 Association Learning  Association learning, or market-basket analysis, is used to analyze which things tend to occur together either in pairs or larger groups.  A basic example:  make a review on purchases people at the cashiers and see that people who buy milk also buy bread, or people who buy diapers also buy baby formula.  It’s all about trying to find this association among products and the idea being that, again, you can leverage this information.  Either you can bundle these things together … or you can do crazy stuff like you put apples on one end of the store and oranges at the 3 other end of the store, so as people travel through the store they buy all kinds of other stuff that they didn’t plan on buying.  Association learning goes beyond simple correlation, according to Giraud- Carrier, because it extends beyond pairs and can account for larger groupings of items.  Thus, association learning as a data mining technique helps to discover a link between two or more items. It finds a hidden pattern in the data set.  Association rules are if-then statements that support to show the probability of interactions between data items within large data sets in different types of databases.  Association rule mining has several applications and is commonly used to help sales correlations in data or medical data sets. 4  The way the algorithm works is that you have various data, For example, a list of grocery items that you have been buying for the last six months. It calculates a percentage of items being purchased together.  These are three major measurements technique, which are called; lift, support, and confidence:  Lift: This measurement technique measures the accuracy of the confidence over how often item B is purchased.  Support: This measurement technique measures how often multiple items are purchased and compared it to the overall dataset. (Item A + Item B) / (Entire dataset)  Confidence: This measurement technique measures how often item B is purchased when item A is purchased as well. (Item A + Item B)/ (Item A) 2.2.2 Clustering  Recognizing distinct groups or sub-categories within data is called cluster detection. Machine learning algorithms detect significantly differing subgroups within a dataset.  For example,  something humans naturally do is separate things into groups. Put people in buckets: people we like, people we don’t like; people we find attractive, people we don’t find attractive.  “In your head you have these notions of what makes people attractive or not, and it’s based on who knows what.” 5  When people match that criteria, a person will sort them into corresponding buckets.  The reason it’s called clustering is because people don’t have labels on their heads identifying which bucket they belong in.  Cluster detection uses data to sort things into buckets based on similarities.  Another example of cluster detection would be to analyze purchasing behavior of hobbyists — fishermen and gardeners would have naturally diverse purchasing habits based on their hobbies.  Clustering is a division of information into groups of connected objects. It models data by its clusters. 6  Data modeling puts clustering from a historical point of view rooted in statistics, mathematics, and numerical analysis.  From a machine learning point of view, clusters relate to hidden patterns, the search for clusters is unsupervised learning, and the subsequent framework represents a data concept.  From a practical point of view, clustering plays an extraordinary job in data mining applications. For example, scientific data exploration, text mining, information retrieval, spatial database applications, CRM, Web analysis, computational biology, medical diagnostics, and much more.  In other words, we can say that clustering analysis is a data mining technique to identify similar data. This technique helps to recognize the differences and similarities between the data.  Clustering is very similar to the classification, but it involves grouping chunks of data together based on their similarities. 2.2.3 Classification  Unlike cluster detection, classification deals with things that already have labels. This is referred to as training data — when there is information existing that can be trained on, or rather easily classified with an algorithm.  For example, spam filters identify differences between content found in legitimate and spam messages. This is made possible through identifying large sets of emails as spam. 7  Classification technique is used to obtain important and relevant information about data and metadata. This data mining technique helps to classify data in different classes. 2.2.4 Prediction  Prediction used a combination of data mining techniques such as trends, clustering, classification, etc.  It analyzes past events or instances in the right sequence to predict a future event. 8 2.2.5 Regression  The regression method is used to make predictions based on relationships within the data set.  For example, future engagement on Facebook can be predicted based on everything in the user’s history — likes, photo tags, comments and interactions with other users, friend requests and all other activity on the site.  Another example would be using the relationship between income and education level to predict choice of neighborhood. Regression allows all relationships within the data to be analyzed and then used to predict future behavior. 9  Regression analysis is the data mining process is used to identify and analyze the relationship between variables because of the presence of the other factor.  It is used to define the probability of the specific variable. Regression, primarily a form of planning and modeling.  For example, we might use it to project certain costs, depending on other factors such as availability, consumer demand, and competition. Primarily it gives the exact relationship between two or more variables in the given data set. 2.2.6 Outlier/Anomaly Detection  Anomaly detection or outliers detection can be used to determine when something is noticeably different from the regular pattern.  This type of data mining technique relates to the observation of data items in the data set, which do not match an expected pattern or expected behavior.  This technique may be used in various domains like intrusion, detection, fraud detection, etc.  The outlier is a data point that diverges too much from the rest of the dataset.  The majority of the real-world datasets have an outlier.  Outlier detection plays a significant role in the data mining field. Outlier detection is valuable in numerous fields like network interruption identification, credit or debit card fraud detection, detecting outlying in wireless sensor network data, etc. 10 2.2.7 Feature Selection  Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in.  Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features. 11 2.3 Summary  Data mining process: 12 (1) Business Understanding/Problem definition: the first step is to identify goals. Based on the defined goal, the correct series of tools can be applied to the data to build the corresponding behavioral model. (2) Data Understanding/ Data exploration: if the quality of data is not suitable for an accurate model then recommendations on future data collection and storage strategies can be made at this. For analysis, all data needs to be consolidated so that it can be treated consistently. (3) Data preparation: the purpose of this step is to clean and transform the data so that missing and invalid values are treated and all known valid values are made consistent for more robust analysis. (4) Modeling: based on the data and the desired outcomes, a data mining algorithm or combination of algorithms is selected for analysis. (5) Evaluation and Deployment: based on the results of the data mining algorithms, an analysis is conducted to determine key conclusions from the analysis and create a series of recommendations for consideration.  Data mining techniques: 13

Data Mining Techniques PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue