Data Mining Lecture PDF

DATA MINING Data Mining Data mining is the art and science of discovering knowledge, insights, and patterns in data. It is the act of extracting useful patterns from an organized collection of data. Patterns must be valid, novel, potentially useful, and understandable. The implicit assumption is that data about the past can reveal patterns of activity that can be projected into the future. Data mining is a multidisciplinary field that borrows techniques from a variety of fields. It draws modeling and analytical techniques from statistics and computer science and artificial intelligence areas. Past data can be of predictive value in many complex situations, especially where the pattern may not be so easily visible without the modeling technique. Gathering and Selecting Data There is an ever-growing avalanche of data coming with higher velocity, volume, and variety. One has to make judicious decisions about what to gather and what to ignore, based on the purpose of the data mining exercises. To learn from data, one needs to effectively gather quality data, clean and organize it, and then efficiently process it. Most organizations develop an enterprise data model (EDM), which is a unified, high-level model of all the data stored in an organization’s databases. Gathering and curating data takes time and effort, particularly when it is unstructured or semi-structured. Unstructured data can come in many forms like databases, blogs, images, videos, and chats. There are streams of unstructured social media data from blogs, chats, and tweets. There are also streams of machine-generated data from connected machines, RFID tags, the internet of things, and so on. Data Cleansing and Preparation The quality of data is critical to the success and value of the data mining project. The quality of incoming data varies by the source and nature of data. Data from internal operations is likely to be of higher quality, as it will be accurate and consistent. Data from social media and other public sources is less under the control of business, and is less likely to be reliable. Data almost certainly needs to be cleansed and transformed before it can be used for data mining. Data cleansing and preparation is a laborintensive or semiautomated activity that can take up to 60 to 70 percent of the time needed for a data mining project. Data Cleansing and Preparation 1. Duplicate data needs to be removed. The same data may be received from multiple sources. When merging the data sets, data must be de-duped. 2. Missing values need to be filled in, or those rows should be removed from analysis. Missing values can be filled in with average or modal or default values. 3. Data elements may need to be transformed from one unit to another. For example, total costs of health care and the total number of patients may need to be reduced to cost/patient to allow comparability of that value. 4. Continuous values may need to be binned into a few buckets to help with some analyses. For example, work experience could be binned as low, medium, and high. 5. Data elements may need to be adjusted to make them comparable over time. For example, currency values may need to be adjusted for inflation; they would need to be converted to the same base year for comparability. They may need to be converted to a common currency. Data Cleansing and Preparation 6. Outlier data elements need to be removed after careful review, to avoid the skewing of results. For example, one big donor could skew the analysis of alumni donors in an educational setting. 7. Any biases in the selection of data should be corrected to ensure the data is representative of the phenomena under analysis. If the data includes many more members of one gender than is typical of the population of interest, then adjustments need to be applied to the data. 8. Data should be brought to the same granularity to ensure comparability. Sales data may be available daily, but the sales person compensation data may only be available monthly. To relate these variables, the data must be brought to the lowest common denominator, in this case, monthly. 9. Data may need to be selected to increase information density. Some data may not show much variability, because it was not properly recorded or for any other reasons. This data may dull the effects of other differences in the data and should be removed to improve the information density of the data. Outputs of Data Mining The outputs of data mining will reflect the objective being served. There are many representations of the outputs of data mining. One popular form of data mining output is a decision tree. It is a hierarchically branched structure that helps visually follow the steps to make a model-based decision. The output can be in the form of a regression equation or mathematical function that represents the best fitting curve to represent the data. This equation may include linear and nonlinear terms. Business rules are an appropriate representation of the output of a market basket analysis exercise. These rules are if-then statements with some probability parameters associated with each rule. Data Mining Techniques There are two primary kinds of data mining processes: supervised learning and unsupervised learning. Data Mining Techniques Decision trees are the most popular data mining technique, for many reasons. 1. Decision trees are easy to understand and easy to use, by analysts as well as executives. They also show a high predictive accuracy. 2. They select the most relevant variables automatically out of all the available variables for decision-making. 3. Decision trees are tolerant of data quality issues and do not require much data preparation from the users. 4. Even nonlinear relationships can be handled well by decision trees. Data Mining Techniques Regression is a relatively simple and the most popular statistical data mining technique. The goal is to fit a smooth well-defined curve to the data. Regression analysis techniques, for example, can be used to model and predict the energy consumption as a function of daily temperature. Data Mining Techniques Artificial neural network (ANN) is a sophisticated data mining technique from the Artificial Intelligence stream in Computer Science. It mimics the behavior of human neural structure: Neurons receive stimuli, process them, and communicate their results to other neurons successively, and eventually a neuron outputs a decision. ANNs are popular because they are eventually able to reach a high predictive accuracy. ANNs are also relatively simple to implement and do not have any issues with data quality. ANNs require a lot of data to trainto develop good predictive ability. Data Mining Techniques Cluster analysis is an exploratory learning technique that helps in identifying a set of similar groups in the data. It is a technique used for automatic identification of natural groupings of things. Data instances that are similar to (or near) each other are categorized into one cluster, while data instances that are very different (or far away) from each other are categorized into separate clusters. There can be any number of clusters that could be produced by the data. The K-means technique is a popular technique and allows the user guidance in selecting the right number (K) of clusters from the data. Assignment #05 1. What is data mining? What are supervised and unsupervised learning techniques? 2. Describe the key steps in the data mining process. Why is it important to follow these processes? 3. What is a confusion matrix? 4. Why is data preparation so important and time consuming? 5. What are some of the most popular data mining techniques? 6. What are the major mistakes to be avoided when doing data mining? 7. What are the key requirements for a skilled data analyst?

Data Mining Lecture PDF

Document Details

Tags

Related

Summary

Full Transcript