Summary

This document explains different forms of input in data mining. It explores concepts, instances, and attributes. It discusses various styles of learning like classification, association, and clustering. Additionally, it touches upon topics like data preparation, different data types (e.g., sparse data), and problems related to missing or inaccurate data.

Full Transcript

Input Representations Before looking at how DM methods operate, let us look at the different forms the input might take. Then, in Chapter 3, we will look at the different forms the output might take. Input Representation The input takes the form of concepts, instances, and...

Input Representations Before looking at how DM methods operate, let us look at the different forms the input might take. Then, in Chapter 3, we will look at the different forms the output might take. Input Representation The input takes the form of concepts, instances, and attributes. We call the thing that is to be learned a concept description. The idea of a concept, like the very idea of learning in the first place, is hard to pin down precisely, and we won’t spend time philosophizing about just what it is and isn’t. In a sense, what we are trying to find—the result of the learning process—is a description of the concept that is intelligible in that it can be understood, discussed, and disputed, and operational in that it can be applied to actual examples. Instances The information that the learner is given takes the form of a set of instances. In the examples in Chapter 1, each instance was an individual, independent example of the concept to be learned. Of course, there are many things you might like to learn for which the raw data cannot be expressed as individual, independent instances. Perhaps background knowledge should be taken into account as part of the input. Perhaps the raw data is an agglomerated mass that cannot be fragmented into individual instances. Instances Perhaps it is a single sequence—say a time sequence—that cannot meaningfully be cut into pieces. This book is about simple, practical methods of data mining, and we focus on situations where the information can be supplied in the form of individual examples. However, we do introduce one slightly more complicated scenario where the examples for learning contain multiple instances. Instances Each instance is characterized by the values of attributes that measure different aspects of the instance. Many data mining schemes deal only with numeric and nominal (aka categorical) attributes. Finally, we examine the question of preparing input for data mining and introduce a simple format for representing the input information as a text file. Styles of Learning 1. In classification learning, the learning scheme is presented with a set of classified examples from which it is expected to learn a way of classifying unseen examples. 2. In association learning, any association among features is sought, not just ones that predict a particular class value. 3. In clustering, groups of examples that belong together are sought. 4. In numeric prediction, the outcome to be predicted is not a discrete class but a numeric quantity. Regardless of the type of learning involved, we call the thing to be learned the concept and the output produced by a learning scheme the concept description. Classification Learning Most of the examples in Chapter 1 are classification problems (e.g. weather, contact lens, iris, labor negotiation). Tricky: multilabelled instances. Supervised learning. – Training set, test set – Generalization – Cross-validation Association Learning Most data sets in Chapter 1 can also be used for association learning. Here, the problem is to discover any structure in the data that is “interesting.” Because association rules can predict any attribute, there are potentially far more association rules than classification rules, and the challenge is to avoid being swamped by them. Association Learning For this reason, association rules are often limited to those that apply to a certain minimum number of examples (called the support)—say 80% of the dataset—and have greater than a certain minimum accuracy level—say 95% accurate (called the confidence). Even then, there are usually lots of them, and they have to be examined manually to determine whether they are meaningful or not. Association rules usually involve only nonnumeric attributes; thus, you wouldn’t normally look for association rules in the iris dataset. Clustering When there is no specified class, clustering is used to group items that seem to fall naturally together. Imagine a version of the iris data in which iris type is omitted. Then it is likely that the 150 instances fall into natural clusters corresponding to the three iris types. The challenge is to find these clusters and assign the instances to them—and to be able to assign new instances to the clusters as well. It may be that one or more of the iris types splits naturally into subtypes, in which case the data will exhibit more than three natural clusters. Clustering The success of clustering is often measured subjectively in terms of how useful the result appears to be to a human user. It may be followed by a second step of classification learning in which rules are learned that give an intelligible description of how new instances should be placed into the clusters. Numeric Prediction Numeric prediction is a variant of classification learning in which the outcome is a numeric value rather than a category. The CPU performance problem is one example. The weather data can also be recast into a form in which what is to be predicted is not play or don’t play but rather the time (in minutes) to play. With numeric prediction problems, as with other machine learning situations, the predicted value for new instances is often of less interest than the structure of the description that is learned, expressed in terms of what the important attributes are and how they relate to the numeric outcome. Input? The input to a machine learning scheme is a set of instances (a.k.a. examples). A dataset can generally represented as a matrix of instances versus attributes (a relation in DB terminology). Instance versus example. Multi-Instance Example In some situations, instead of the individual instances being examples of the concept, each individual example comprises a set of instances. Example: drug molecules can assume alternative shapes by rotating its bonds. If one of these shapes has a certain property, then the molecule as a whole is considered to have it. Attributes What if different instances have different attributes? Or, presence of 1 attribute depends on value of another (married & spouse name)? Type – Approach 1: Nominal vs numeric – Approach 2: nominal, ordinal, interval, ratio. Example: (cold, warm, hot), year or Fahrenheit, distance. What about (true, false)? Interval versus ratio: defined zero point. Problematic: days of week, orange-red-green. Meta-data. Data Preparation In practice, preparing input for a DM investigation usually consumes the bulk of the effort. Real data is often disappointingly low in quality, and careful checking (called data cleaning or scrubbing) pays off. Gather the data First step is to identify which data is relevant and pull it all together. E.g. in a marketing study, data will be needed from Sales, Customer Billing, and Customer Service departments. Integrating data from different sources poses challenges: – Different styles of record-keeping, diff conventions, diff time periods, diff degrees of data aggregation, diff primary keys, and diff kinds of error. Gathering Data May be necessary to go outside organization to collect some relevant data (e.g. weather data) that is not normally collected by the organization. Called overlay data Degree of aggregation. – E.g. Milk farmer will take daily milking machine data (typically recorded twice a day) and aggregate it – Cell phone usage. – By the week, month, quarter? Avg, mean, max- min. ARFF attribute types numeric, nominal, string, date, relation. Strings stored internally in string table. date: 2004-03-27T17:22:51 relation: allows multi-instance examples Sparse Data E.g. Market basket data, document word count data. Unspecified values are zero not missing. Normalization May be carried out by the DM algorithm. Fixed range (e.g. 0-1) by dividing by max, or dividing the range. Subtract mean and divide by stdev (called normalizing the variable). Many DM algorithms require measuring “distance” between two attribute values. Measuring Distance for Nominal Vars 1 if same, 0 if different. Several synthetic binary attributes for each nominal attribute. Sometimes a genuine mapping between nominal values and numeric scales is possible: – zip codes: lat and long – Student ID’s: year digits Reverse problem: nominal values may be expressed as integer (e.g. 62 as code for CS). Ordinal vs Nominal Sometimes we have choice to represent data as ordinal or nominal (e.g. age-group in contact lens data). versus Missing Values Frequently indicated by out-of-range entries (e.g. -1). For nominal values, missing values may be coded as blanks or dashes. Sometimes, diff kinds of missing (unknown vs unrecorded vs irrelevant) are distinguished, e.g. by different –ve values (-1, -2, -3). DM practioner must consider reasons why values are missing. Sometimes the fact (or reason) why a value is missing itself relevant (e.g. early plant death, refusal to answer a question). Missing Data Perhaps a decision was intentionally taken not to carry out a particular lab test based on the result of an earlier lab test. Here, if this fact is recorded (e.g. by recording missing as “not tested”), it may be relevant. Inaccurate Values Imp to check DM files for rogue attributes or values. Data used for mining has almost certainly not been gathered expressly for that purpose. When originally collected, many fields probably didn’t matter and were left blank or unchecked. When same data is used for DM, the errors and omissions can have great significance. E.g. banks may not really care about age of customer, so DB may contain errors left uncorrected Inaccurate Values Typing errors. Value of nominal attribute may be mis- spelled, creating addition possible value (Papsi) Or, diff names for same thing (Pepsi-Cola versus Pepsi) Numeric: Typing or measurement errors generally cause outliers that can be spotted by graphing 1 var at a time (Weka can help) Inaccurate Values Duplicate data. Algorithms act diff. Ppl deliberately make minor changes in address to track source of junk mail. Rigid computerized data entry: zip code. Supermarket checkout clerk loyalty card. Data may go stale. Data cleaning (or data scrubbing) is time- consuming, labor-intensive, boring, and not easy to completely automate.

Use Quizgecko on...
Browser
Browser