DMW-Data Mining Tasks Chapter 3 PDF

Chapter 3 DMW- Data Mining Tasks Exploratory Data Analysis Predictive Modeling Descriptive Modeling Discovering Patterns and Rules 1 Exploratory Data Analysis (EDA) Exploratory data analysis can be extremely valuable – You should always “look” at your data before applying any data mining algorithms – Exploratory data analysis is useful for data checking E.g.: finding that a variable is always integer valued or positive Finding the some variables are highly skewed In general EDA helps to – maximize insight into a dataset; – Uncover/extract underlying structure; – extract important variables; – detect outliers and anomalies; – test underlying assumptions; 2 Exploratory Data Analysis (EDA) Classical sequence of data analysis: Problem  Data  Model  Analysis  Conclusion – Model the pattern before analysis Exploratory: Problem  Data  Analysis  Model  Conclusion – Analysis the data before applying model/ mining algorithms 3 Exploratory Data Analysis (EDA) Exploratory data analysis helps to get an overall sense of the dataset through computing summary. statistics: – Number of distinct values, max, min, mean, median, variance, skewness,.. EDA is an approach for data analysis that employs a variety of techniques (mostly graphical) for visualization. Such techniques include: scatter plot, mean plots, standard deviation plots, and Predictive Modeling A predictive model makes a prediction about values of data using known results found from different historical data – Prediction Methods use existing variables to predict unknown or future values of other variables. Predict one variable Y given a set of other variables X. Here X could be an n-dimensional vector – In effect this is function approximation through learning the relationship between Y and X Many, many algorithms for predictive modeling in statistics and machine learning, including – Classification, regression, etc. Often the emphasis is on predictive accuracy, less emphasis on understanding the model 5 Prediction Problems: Classification vs. Numeric Prediction Classification – predicts categorical class labels (discrete or nominal) – classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data/instances – Used to classify new instance of data in to predefined class labels. Numeric Prediction – models continuous-valued functions, i.e., predicts unknown or missing values – Used to predict the unknown class labels by using only the data available in DBs, 6 WHs, WWW. Models and Patterns Model = abstract representation of a given training dataset. x f(x) e.g., very simple linear model structure 1 1 Y=aX+b – a and b are parameters determined 2 4 from the data 3 9 – Y = aX + b is the model structure – Y = 0.9X + 0.3 is a particular model 4 16 Pattern represents “local 5 ? structure Example: ” in Given sample, pairs, create a model a dataset a finite – E.g., if X>x then Y >y with probability that can hold for future values? pTo guess the true function f, find some pattern (called a hypothesis) in the training examples, and assume that the pattern will hold for future examples too. 7 Predictive Modeling: Fraud Detection Credit card fraud detection – Credit card losses in the US are over 1 billion $ per year – Roughly 1 in 50 transactions are fraudulent Approach – For each transaction estimate p(fraudulent | transaction) – Model is built on historical data of known fraud/non-fraud – High probability transactions investigated by fraud police Example: – Fair-Isaac/HNC’s fraud detection software based on neural networks, led to reported fraud decreases of 30 to 50% (http:// www.fairisaac.com/fairisaac) 8 Predictive Modeling: Customer Scoring Example: a bank has a database of 1 million past customers, 10% of whom took out mortgages(for the repayment of a loan). Use machine learning to rank new customers as a function of p(mortgage | customer data) Customer data – History of transactions with the bank – Other credit data (obtained from Experian, etc) – Demographic data on the customer or where they live Techniques – Binary classification: logistic regression, decision trees, etc – Many, many applications of this nature 9 Classification Example: Credit scoring – Differentiating between low-risk and high-risk customers from their income ,savings and etc Discriminant rule: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk Descriptive Modeling Goal is to build a “descriptive” model that models the underlying observation – e.g., a model that could simulate the data if needed EM ITERATION 25 4.4 Descriptive model 4.3 Red Blood Cell Hemoglobin Concentration identifies patterns or 4.2 relationship in data 4.1 – Unlike the predictive model, 4 a descriptive model 3.9 serves as a way to 3.8 explore the properties 3.7 of the data examined, Red Blood Cell Volume 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 not to predict new Description Methods find human-interpretable patterns properties that describe and find natural groupings of the data. Methods used in descriptive modeling are: clustering, summarization, etc. 11 Clustering Hair length =0 Short hair = male Long hair = female Hair length = 30 cm Example of Descriptive Modeling Goal: learn directed relationships among p variables Techniques: directed (causal) graphs Challenge: distinguishing between correlation and causation – Example: Do yellow fingers cause lung cancer? Is there any relationship between yellow finger and lung cancer? smoking Hidden cause: smoking Yellow fingers Cancer ? 13 Pattern (Association Rule) Discovery Goal: To discover interesting “local” patterns (sequential patterns) in the data rather than to characterize the data globally – Also called link analysis (uncovers relationships among data) Given market basket data we might discover that – If customers buy wine and bread then they buy cheese with probability 0.9 – These are known as “association rules” Methods used in pattern discovery include: Association rules, Sequence discovery, etc. 14 Example of Pattern Discovery Example in retail: Customer transactions to consumer behavior: – People who bought “Da Vinci Code” also bought “The Five People You Meet in Heaven” (www.amazon.com) Example: Basket ball player behavior –If player A is in the game, player B’s scoring rate ADACABDABAABBDDBCADDDDBCDDBC CBBCC increases from 3.2 points per quarter to 8.7 points DADADAADABDBBDABABBCDDDCDDA per quarter BDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBB CBBCCBBBDCABDDBBA DDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCD What about the following? DADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACA BCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACB CABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBD CADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDA DADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDAD BACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB 15 Supervised vs. Unsupervised Learning Supervised learning (classification) – Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations – New data is classified based on the training set Unsupervised learning (clustering) – The class labels of training data is unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 16 Basic Data Mining algorithms Classification maps data into predefined groups or classes to enhance the prediction process( Data  class labels) – It is supervised learning technique Clustering groups similar data together into clusters. – is used to find appropriate groupings of elements for a set of data. Unlike classification, clustering is a kind of undirected knowledge discovery or unsupervised learning; that is, there is no target field, and the relationship among the data is identified by bottom-up approach. – It is unsupervised learning technique Association Rule (is also known as market basket analysis) – discover interesting associations between attributes 17 Classification 18 Classification: Definition Classification is a data mining (machine learning) technique used to predict group membership for data instances. Given a collection of records (training set), each record contains a set of attributes, one of the attributes is the class. – Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. – Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 19 Illustrating Classification Tid Attrib1 Task Learning Attrib2 Attrib3 Class 1 Yes Large 125K No algorithm 2 No Medium 100K No 3 No Small 70K No No 4 Yes Medium 120K Induction 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No Learn 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes Model 10 Training Set Apply Tid Attrib1 Attrib2 Attrib3 Class Model 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? Deduction 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Metrics for Performance Evaluation… PREDICTED CLASS Class=Yes Class=No Class=Yes a b ACTUAL (TP) (FP) CLASS Class=No c d (FP) (TP) Most widely-used metric: a d TP Accuracy   *100 a  b  c  d TP  FP Classification methods Goal: Predict class Ci = f(x1, x2,.. Xn) where x1,x2,…x n be attributes,c There are various classification methods. Popular classification techniques include the following. – K-Nearest Neighbour: classify based on similarity measurement – Decision tree classifier: divide decision space into piecewise constant regions. – Neural networks: partition by non-linear boundaries – Bayesian network: a probabilistic model 22 Decision Tree 23 Decision tree classifier Decision tree performs classification by constructing a tree based on training instances with leaves having class labels. – The tree is traversed for each test instance to find a leaf, and the class of the leaf is the predicted class. This is a directed knowledge discovery in the sense that there is a specific field whose value we want to predict. Widely used learning method. It has been applied to: – classify medical patients based on the disease, – equipment malfunction by cause, 24 Decision Trees Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels; i.e. a Boolean classifier for the input instance. Given an instance of an object or situation, which is specified by a set of properties, the tree returns a "yes" or "no" decision about that instance. Attribute_1 value-1 value-3 value-2 Attribute_2 Class1 Attribute_2 value-5 value-4 value-6 value-7 Class2 Class3 Class4 Class5 Algorithm for Decision Tree Basic algorithmInduction (a greedy algorithm) – Tree is constructed in a top-down recursive divide- and-conquer manner – At start, all the training examples are at the root – Attributes are categorical (if continuous-valued, they are discretized in advance) – Examples are partitioned recursively based on selected attributes – Optimal attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain, gini index) Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for Attribute Selection Measure Information gain Select the attribute with the highest information gain First, compute the disorder using Entropy; the expected information needed to classify objects into classes Second, measure the Information Gain; to calculate by how much the disorder of a set would reduce by knowing the value of a particular attribute. In information gain measure we want:- – large Gain – same as: small average disorder created Prom: it is biased towards larger attribute values with out considering size and number at each nodes. GINI index – An alternative to information gain that measure impurity of attributes in the classification task – Select the attribute with the smallest GINI value Entropy The Entropy measures the disorder of a set S containing a total of n examples of which n+ are positive and n- are negative and it is given by: n n n n D(n , n )  log 2  log 2 Entropy ( S ) n n n n Some useful properties of the Entropy: D(n,m) = D(m,n) D(0,m) = 0 D(S)=0 means that all the examples in S have the same class D(m,m) = 1 D(S)=1 means that half the examples in S are of one class and half are the opposite class Information Gain The Information Gain measures the expected reduction in entropy due to splitting on an attribute A  k ni  GAIN split Entropy ( S )    Entropy (i )   i 1 n  Parent Node, p is split into k partitions; ni is number of records in partition I Entropy(i)=D(n+, n-)= -(n+/n)log(n+/n)- (n-/n)log(n-/n) Information Gain: Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction Example: Decision Tree for “buy computer or not”. Use the training Dataset given below to age construct income student decision treebuys_computer credit_rating 40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes

DMW-Data Mining Tasks Chapter 3 PDF

Document Details

Tags

Related

Summary

Full Transcript