KDD Summary (2) - PDF
Document Details
Tags
Summary
This document provides a summary of the Knowledge Discovery in Databases (KDD) process. It details definitions, processes, and various data types. The document covers data cleaning, data integration, and also discusses data reduction and transformation techniques for effective data analysis. It provides example calculations.
Full Transcript
Lecture 1 KDD: A Definition KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. Data, Information, Knowledge Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data. Knowledge is integrated information, incl...
Lecture 1 KDD: A Definition KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. Data, Information, Knowledge Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data. Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our “mental pictures”. Knowledge can be considered data at a high level of abstraction and generalization. Benefits of Knowledge Discovery Data ➔Information ➔Knowledge The KDD process Non-trivial process -> Multiple process Valid -> Justified patterns/models Novel -> Previously unknown Useful -> Can be used Understandable -> by human and machine The Knowledge Discovery Process 1- Understand the domain and Define problems 2- Collect and Preprocess Data 3- Data Mining Extract Patterns/Models A step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations 4- Interpret and Evaluate discovered knowledge 5- Putting the results in practical use KDD is inherently interactive and iterative Main Contributing Areas of KDD Statistics Infer info from data (deduction & induction, mainly numeric data) Databases Store, access, search, update data (deduction) Machine Learning Computer algorithms that improve automatically through experience Potential Applications Business information Manufacturing information Scientific information Personal information Primary Tasks of Data Mining Classification finding the description of several predefined classes and classify a data item into one of them. Regression maps a data item to a real-valued prediction variable. Clustering identifying a finite set of categories or clusters to describe the data. Dependency Modeling finding a model which describes significant dependencies between variables. Deviation and change detection discovering the most significant changes in the data. Summarization finding a compact description for a subset of data. Lecture 2 1- Data Objects & Attribute Types 2- Basic Statistical Descriptions of Data 3- Measuring Data similarity & Dissimilarity 1- Objects and Attributes A data object represents an entity ◦Also sample, example, instance, data point, or object (in a DB : Data Tuple) An attribute is a data field, representing a characteristic or feature of a data object ◦Also noun attribute, dimension, feature, and variable (DB and Statistics) Attribute (feature) vector → A set of attributes that describe an object Attribute Types Nominal Attributes Symbol or names of things and each value represents category, code, or state ◦also referred to as categorical ◦Possible to be represented as numbers ◦Qualitative Binary Attributes Nominal with only two values representing two states or categories: 0 or 1 Also Boolean (true or false) ◦Qualitative Symmetric: both states are equally valuable and have the same weight Asymmetric: states are not equally important Ordinal Attributes Values have a meaningful order or ranking, but magnitude between successive values is not known ◦Useful for data reduction of numerical attributes ◦Qualitative Numeric Attributes Interval-scaled: measured on a scale of equal-size units ◦Do not have a true zero point ◦Not possible to be expressed as multiples Ratio-scaled: have a true zero point ◦A value can be expressed as a multiple of another ◦Quantitative Discrete vs. Continuous Attributes Discrete Attribute: has a finite or countably infinite set of values, integers or otherwise If an attribute is not discrete, it’s continuous 2- Basic Statistical Descriptions of Data Measuring Central Tendency Median: middle value in set of ordered values ◦N is odd → median is middle value of ordered set ◦N is even → median is not unique → average of two middlemost values ◦Expensive to compute for large # of observations Mode: value that occurs most frequently in the attribute values ◦Works for both qualitative and quantitative attributes ◦Data can be unimodal, bimodal, or trimodal – no mode? Example Salary (in thousands of dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 ◦Mean = 58,000 ◦Median = 54,000 ◦Mode = 52,000 and 70,000 – bimodal Measuring dispersion of Data Five-Number Summary: ◦Median (Q2), quartiles Q1 and Q3, & smallest and largest individual observations – in order Boxplots: visualization technique for the five-number ◦Whiskers terminate at min & max OR the most extreme observations within 1.5 × IQR of the quartiles – with remainder points (outliers) plotted individually Ex: Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with the following results: Calculate the mean, median, and standard deviation of age and %fat. Draw the boxplots for age and %fat. Calculate the correlation coefficient. Are these two attributes positively or negatively correlated? Compute their covariance. Draw the boxplots for age and %fat. For Age Q1=39, median= 51, Q3=57, min=23, max=61 IQR= 57-39= 18, 1.5 IQR= 27 newMin= 39-27= 12, newMax= 57+27= 84 For Fat Q1=26.5, median= 30.7, Q3=34.1, min=7.8, max=42.5 IQR= 34.1-26.5= 7.6, 1.5 IQR= 11.4 newMin= 26.5-11.4= 15.1 newMax= 34.1+11.4= 45.5 Visual Representations of Data Distributions Histograms Scatter Plots: each pair of values is treated as a pair of coordinates and plotted as points in plane ◦X and Y are correlated if one attribute implies the other ◦positive, negative, or null (uncorrelated) ◦For more attributes, we use a scatter plot matrix Lecture 3 3- Measuring Data similarity & Dissimilarity Statistical descriptions are about Attributes Similarity/Dissimilarity is about Objects Similarity/Dissimilarity measures objects proximity Similarity of i and j→ 0 if totally unalike, larger means more alike Dissimilarity (distance) of i and j→ 0 if totally alike, larger means less alike Data Matrix & Dissimilarity Matrix Data matrix (object-by-attribute structure) ◦Stores n data objects that have p attributes as an n-by-p matrix ◦Two-mode matrix Dissimilarity matrix (object-by-object structure) ◦stores a collection of proximities for all pairs of n objects as an n-by-n matrix ◦One-mode matrix Proximity Measures For Nominal Attributes Proximity Measures For Binary Attributes Dissimilarity of Numeric Data Normalize → give all attributes an equal weight ◦If object “patient” is described as having height (in meters) and weight (in g), what data ranges can we have for both attributes? Which will be more dominant? Minkowski Distance Summary Attributes describe Objects Attributes can be nominal, binary, ordinal, or numeric To measure the central tendency of attribute observations, use mean, SD, median, and mode for numeric attributes, and only median and mode for the rest To measure objects similarity/dissimilarity, use: matching percent for objects with nominal attributes symmetric/asymmetric binary dissimilarity or the Jaccard coefficient for objects with binary attributes Minkowski distance for objects with numerical attributes Lecture 4 Why Preprocess Data ? Tosatisfy the requirements of the intended use Factors of data quality: Accuracy → lack of due to faulty instruments, errors Completeness → lack of due to different design phases, optional attributes Consistency → lack of due to semantics, data types,field formats Timeliness Believability →how much the data are trusted by users Interpretability → how easy the data are understood Major Preprocessing Tasks : Data cleaning → filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies Data integration → include data from multiple sources in your analysis,map semantic concepts, infer attributes Data reduction → obtain a reduced representation of the data set that is much smaller in volume, while producing almost the same analytical results Discretization → raw data values for attributes are replaced by ranges or higher conceptual levels Data transformation → normalization Data Cleaning -> Fill in missing values,smooth out noise while identifying outliers, and correct inconsistencies in the data. Data in the Real World Is Dirty! Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data Noisy: containing noise, errors, or outliers Inconsistent: containing discrepancies in codes or names Intentional → Jan. 1 as everyone’s birthday? Missing Values Ignore the tuple → not very effective, unless the tuple contains several attributes with missing values Fill in the missing value manually → time consuming, not feasible for large data sets Use a global constant → replace all missing attribute values by same value Use mean or median → for normal data distributions, the mean is used, while skewed data distribution should employ the median Use mean or median → for all samples belonging to the same class as the given tuple Use the most probable value → using regression, inference-based tools such as decision tree Noisy Data Noise is a random error or variance in a measured variable Data smoothing techniques: 1. Binning 2. Regression 3. Outlier Analysis 1. Binning → smooth a sorted data value by consulting its “neighborhood” sorted values are partitioned into a # of “buckets,” or bins→ local smoothing equal-frequency bins → each bin has same # of values equal-width bins → interval range of values per bin is constant Smoothing by bin means → each bin value is replaced by the bin mean Smoothing by bin medians → each bin value is replaced by the bin median Smoothing by bin boundaries → each bin value is replaced by the closest boundary value (min & max in a bin are bin boundaries) Example: Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 2. Regression → Conform data values to a function Linear regression → find “best” line to fit two attributes so that one attribute can be used to predict the other 3. Outlier Analysis Pot er’s Wheel → Automated interactive data cleaning tool Data Integration Merging data from multiple data stores Helps reduce and avoid redundancies and inconsistencies in the resulting data set Challenges: Semantic heterogeneity → entity identification problem Structure of data → functional dependencies and referential constraints Redundancy Entity Identification Problem Schema integration and object matching Metadata → name, meaning, data type, and range of values permitted, null rules for handling blank, zero, or null values → can help avoid errors in schema integration and data transformation Redundancy and correlation analysis Correlation Matrix A correlation matrix is simply a table that displays the correlation coefficients for different variables. Data Reduction Strategies Dimensionality reduction → reduce number of attributes Wavelet transforms, PCA, Attribute subset selection Numerosity reduction → replace original data by smaller data representation Parametric → used to estimate the data - only the data parameters are stored Nonparametric → store reduced representations of the data Compression → transformations applied to obtain a “compressed” representation of original data Attribute SubsetSelection find a min set of attributes such that the resulting probability distribution of data is as close as possible to the original distribution using all attributes Attribute construction → area attribute based on height and width Attributes Regression Data is modeled to fit a straight line Regression line equation → y = wx + b , w and b are regression coefficients Solved for by the method of least squares →minimize error between actual line separating data and estimate of the line Sampling A large data set represented by a smaller random data sample Simple random sample without replacement (SRSWOR) of size s → draw s of the N tuples (s < N) Simple random sample with replacement (SRSWR) of size s → similar to SRSWOR, but each time a tuple is drawn, it’s recorded then placed back it may be drawn again Cluster sample → If tuples are grouped into M “ clusters, ” an SRS of s clusters can be obtained Stratified sample → If tuples are divided into strata, a stratified sample is generated by obtaining an SRS at each stratum