Data Mining: Data & Data Types PDF
Document Details
Uploaded by Deleted User
Data Mining
Tags
Summary
This document provides an introduction to data mining concepts, focusing on different types of data (categorical and numerical) and attributes. Various statistical descriptions and visualization techniques for analyzing data characteristics are also covered. Python examples illustrate the application of essential plotting methods.
Full Transcript
Getting to Know Your Data Data Objects and Attribute Types Basic Statistical Descriptions of Data Measuring Data Similarity and Dissimilarity Data Mining 1 Data Objects and Attribute Types Basic Statistical Descriptions of Data Measuring Data Simila...
Getting to Know Your Data Data Objects and Attribute Types Basic Statistical Descriptions of Data Measuring Data Similarity and Dissimilarity Data Mining 1 Data Objects and Attribute Types Basic Statistical Descriptions of Data Measuring Data Similarity and Dissimilarity Data Mining 2 Data-Related Issues for Successful Data Mining Type of Data: – Data sets differ in a number of ways. – Type of data determines which techniques can be used to analyze the data. Quality of Data: – Data is often far from perfect. – Improving data quality improves the quality of the resulting analysis. Preprocessing Steps to Make Data More Suitable for Data Mining: – Raw data must be processed in order to make it suitable for analysis. Improve data quality, Modify data so that it better fits a specified data mining technique. Analyzing Data in Terms of its Relationships: – Find relationships among data objects and then perform remaining analysis using these relationships rather than data objects themselves. – There are many similarity or distance measures, and the proper choice depends on the type of data and application. Data Mining 3 What is Data? Data sets are made up of data objects. A data object represents an entity. – Also called sample, example, instance, data point, object, tuple, record Data objects are described by attributes. An attribute is a property or characteristic of a data object. – Examples: eye color of a person, temperature, etc. – Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object. Attribute values are numbers or symbols assigned to an attribute. Data Mining 4 A Data Object database rows data objects database columns attributes Data Mining 5 Attributes Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object. – E.g., customer _ID, name, address Attribute values are numbers or symbols assigned to an attribute Distinction between attributes and attribute values – Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters – Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different; ID has no limit but age has a maximum and minimum value Data Mining 6 Attribute Types Four main types of attributes Nominal: Categorical (Qualitative) – categories, states, or “names of things” Hair color, marital status, occupation, ID numbers, zip codes – An important nominal attribute: Binary Nominal attribute with only 2 states (0 and 1) Ordinal: Categorical (Qualitative) – Values have a meaningful order (ranking) but magnitude between successive values is not known. Size = {small, medium, large}, grades, army rankings Interval: Numeric (Quantitative) – Measured on a scale of equal-sized units – Values have order: temperature in C˚ or F˚, calendar dates – No true zero-point: ratios are not meaningful Ratio: Numeric (Quantitative) – Inherent zero-point: ratios are meaningful temperature in Kelvin, length, counts, monetary quantities, age Data Mining 7 Attribute Types Four main types of attributes: Nominal Attributes The values of a nominal attribute are symbols or names of things. – Each value represents some kind of category, code, or state, Nominal attributes are also referred to as categorical attributes. The values of nominal attributes do not have any meaningful order. Example: The attribute marital_status can take on the values single, married, divorced, and widowed. Because nominal attribute values do not have any meaningful order about them and they are not quantitative. – It makes no sense to find the mean (average) value or median (middle) value for such an attribute. – However, we can find the attribute’s most commonly occurring value (mode). Data Mining 8 Attribute Types Four main types of attributes: Nominal Attributes A binary attribute is a special nominal attribute with only two states: 0 or 1. A binary attribute is symmetric if both of its states are equally valuable and carry the same weight. – Example: the attribute gender having the states male and female. A binary attribute is asymmetric if the outcomes of the states are not equally important. – Example: Positive and negative outcomes of a medical test for HIV. – By convention, we code the most important outcome, which is usually the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV negative). Data Mining 9 Attribute Types Four main types of attributes: Ordinal Attributes An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known. Example: An ordinal attribute drink_size corresponds to the size of drinks available at a fast-food restaurant. – This attribute has three possible values: small, medium, and large. – The values have a meaningful sequence (which corresponds to increasing drink size); however, we cannot tell from the values how much bigger, say, a medium is than a large. The central tendency of an ordinal attribute can be represented by its mode and its median (middle value in an ordered sequence), but the mean cannot be defined. Data Mining 10 Attribute Types Four main types of attributes: Interval Attributes Interval attributes are measured on a scale of equal-size units. – We can compare and quantify the difference between values of interval attributes. Example: A temperature attribute is an interval attribute. – We can quantify the difference between values. For example, a temperature of 20oC is five degrees higher than a temperature of 15oC. – Temperatures in Celsius do not have a true zero-point, that is, 0oC does not indicate “no temperature.” – Although we can compute the difference between temperature values, we cannot talk of one temperature value as being a multiple of another. Without a true zero, we cannot say, for instance, that 10oC is twice as warm as 5oC. That is, we cannot speak of the values in terms of ratios. The central tendency of an interval attribute can be represented by its mode, its median (middle value in an ordered sequence), and its mean. Data Mining 11 Attribute Types Four main types of attributes: Ratio Attributes A ratio attribute is a numeric attribute with an inherent zero-point. Example: A number_of_words attribute is a ratio attribute. – If a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of another value. The central tendency of an ratio attribute can be represented by its mode, its median (middle value in an ordered sequence), and its mean. Data Mining 12 Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses: – Distinctness: = – Order: < > – Addition: + - – Multiplication: * / Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties Data Mining 13 Properties of Attribute Values Attribute Description Examples Type Nominal The values of a nominal attribute are just zip codes, employee ID different names, numbers, eye color, sex: i.e., nominal attributes provide only enough {male, female} information to distinguish one object from another. (=, ) Ordinal The values of an ordinal attribute provide hardness of minerals, {good, enough information to order objects. () better, best}, grades, street numbers Interval For interval attributes, the differences calendar dates, temperature between values are meaningful, in Celsius or Fahrenheit i.e., a unit of measurement exists. (+, - ) Ratio For ratio variables, both differences and ratios temperature in Kelvin, are meaningful. (*, /) monetary quantities, counts, age, mass, length, Data Mining 14 Attribute Types Categorical (Qualitative) and Numeric (Quantitative) Nominal and Ordinal attributes are collectively referred to as categorical or qualitative attributes. – qualitative attributes, such as employee ID, lack most of the properties of numbers. – Even if they are represented by numbers, i.e. , integers, they should be treated more like symbols. – Mean of values does not have any meaning. Interval and Ratio are collectively referred to as quantitative or numeric attributes. – Quantitative attributes are represented by numbers and have most of the properties of numbers. – Note that quantitative attributes can be integer-valued or continuous. – Numeric operations such as mean, standard deviation are meaningful Data Mining 15 Discrete vs. Continuous Attributes Discrete Attribute – Has only a finite or countably infinite set of values zip codes, profession, or the set of words in a collection of documents – Sometimes, represented as integer variables – Note: Binary attributes are a special case of discrete attributes – Binary attributes where only non-zero values are important are called asymmetric binary attributes. Continuous Attribute – Has real numbers as attribute values temperature, height, weight – Practically, real values can only be measured and represented using a finite number of digits – Continuous attributes are typically represented as floating-point variables Data Mining 16 Types of data sets Record Ordered – Relational records – Video data: sequence of images – Data matrix, e.g., numerical matrix, – Temporal data: time-series crosstabs – Sequential Data: transaction – Document data: text documents: sequences term-frequency vector – Genetic sequence data – Transaction data – Spatial data: maps Graph and network – Image data – World Wide Web – Social or information networks – Molecular Structures Data Mining 17 Record Data Data that consists of a collection of records, each of which consists of a fixed set of attributes Data Mining 18 Data Matrix If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute. A data matrix is a variation of record data, but because it consists of numeric attributes, standard matrix operation can be applied to transform and manipulate the data. Projection Projection Distance Load Thickness of x Load of y load 10.23 5.27 15.22 2.7 1.2 12.65 6.25 16.22 2.2 1.1 Data Mining 19 Document (Text) Data Each document becomes a term vector, – each term is a component (attribute) of the vector, – the value of each component is the number of times the corresponding term occurs in the document Convert text documents to record data by counting word frequencies (document-term matrix). Data Mining 20 Transaction Data Transaction data is a special type of record data, where – each record (transaction) involves a set of items. – Example: The set of products purchased by a customer constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Milk 2 Bread, Diaper, Tea, Eggs 3 Milk, Diaper, Tea, Coke 4 Bread, Milk, Diaper, Tea 5 Bread, Milk, Diaper, Coke Data Mining 21 Graph Data Different variations of graph data 22 Ordered Data Different variations of ordered data 23 Data Objects and Attribute Types Basic Statistical Descriptions of Data Measuring Data Similarity and Dissimilarity Data Mining 24 Basic Statistical Descriptions of Data Basic statistical descriptions can be used to identify properties of the data and highlight which data values should be treated as noise or outliers. For data preprocessing tasks, we want to learn about data characteristics regarding both central tendency and dispersion of the data. Measures of central tendency include mean, median, mode, and midrange. Measures of data dispersion include quartiles, interquartile range (IQR), and variance. These descriptive statistics are of great help in understanding the distribution of the data. Data Mining 25 Measuring Central Tendency: Mean The most common and most effective numerical measure of the “center” of a set of data is the arithmetic mean. n 1 Arithmetic Mean: x n i 1 xi Sometimes, each value xi in a set may be associated with a weight wi. – The weights reflect the significance and importance attached to their respective values. n wx i i Weighted Arithmetic Mean: x i1 n w i i1 Data Mining 26 Measuring Central Tendency: Mean Although the mean is the single most useful quantity for describing a data set, it is not always the best way of measuring the center of the data. – A major problem with the mean is its sensitivity to extreme (outlier) values. – Even a small number of extreme values can corrupt the mean. To offset the effect caused by a small number of extreme values, we can instead use the trimmed mean, Trimmed mean can be obtained after chopping off values at the high and low extremes. Data Mining 27 Measuring Central Tendency: Median Another measure of the center of data is the median. Suppose that a given data set of N distinct values is sorted in numerical order. – If N is odd, the median is the middle value of the ordered set; – If N is even, the median is the average of the middle two values. In probability and statistics, the median generally applies to numeric data; however, we may extend the concept to ordinal data. – Suppose that a given data set of N values for an attribute X is sorted in increasing order. – If N is odd, then the median is the middle value of the ordered set. – If N is even, then the median may not be not unique. In this case, the median is the two middlemost values and any value in between. Data Mining 28 Measuring Central Tendency: Mode Another measure of central tendency is the mode. The mode for a set of data is the value that occurs most frequently in the set. – It is possible for the greatest frequency to correspond to several different values, which results in more than one mode. – Data sets with one, two, or three modes: called unimodal, bimodal, and trimodal. – At the other extreme, if each data value occurs only once, then there is no mode. Central Tendency Measures for Numerical Attributes: Mean, Median, Mode Central Tendency Measures for Categorical Attributes: Mode (Median?) – Central Tendency Measures for Nominal Attributes: Mode – Central Tendency Measures for Ordinal Attributes: Mode, Median Data Mining 29 Measuring Central Tendency - Mean, Median, Mode Median, mean and mode of symmetric, positively and negatively skewed data symmetric data positively skewed data negatively skewed data Data Mining 30 Measuring Central Tendency: Example What are central tendency measures (mean, median, mode) for the following attributes? attr1 = {2,4,4,6,8,24} attr2 = {2,4,7,10,12} attr3 = {xs,s,s,s,m,m,l} Data Mining 31 Measuring Central Tendency: Example What are central tendency measures (mean, median, mode)for the following attributes? attr1 = {2,4,4,6,8,24} mean = (2+4+4+6+8+24)/6 = 8 average of all values median = (4+6)/2 = 5 avg. of two middle values mode = 4 most frequent item attr2 = {2,4,7,10,12} mean = (2+4+7+10+12)/5 = 7 average of all values median = 7 middle value mode = any of them (no mode) all of them has same freq. attr3 = {xs,s,s,s,m,m,l} mean is meaningless for categorical attributes. median = s middle value mode = s most frequent item Data Mining 32 Measuring Dispersion of Data The degree to which numerical data tend to spread is called the dispersion, or variance of the data. The most common measures of data dispersion: Range: Difference between the largest and smallest values. Interquartile Range (IQR): range of middle 50% – quartiles: Q1 (25th percentile), Q3 (75th percentile) IQR=Q3-Q1 – five number summary: Minimum, Q1, Median, Q3, Maximum Variance and Standard Deviation: (sample: s, population: σ) – variance of N observations: where is the mean value of the observations – standard deviation σ (s) is the square root of variance σ2 ( s2) Data Mining 33 Measuring Dispersion of Data: Quartiles Suppose that set of observations for numeric attribute X is sorted in increasing order. Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equal size consecutive sets. – The kth q-quantile for a given data distribution is the value x such that at most k/q of the data values are less than x and at most (q-k)/q of the data values are more than x, where k is an integer such that 0