data mining unit 1.pdf

DATA MINING: AN INTRODUCTION Dr. Ranjana Kale 1 THE EVOLUTION OF DATABASE TECHNOLOGY Data mining can viewed as a result of the natural evolution of data base technology (Fig. 1.1). The figure shows 5 stages of functionalities: - data collection and database creation - database management systems - advanced databases systems - web-based databases systems - data warehousing and data mining 2 3 EVOLUTION OF DATABASE TECHNOLOGY 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s—2000s: Data mining and data warehousing, multimedia databases, and Web databases THE EVOLUTION OF DATABASE TECHNOLOGY..CONT Databases systems provide data storage and retrieval, and transaction processing. Data warehousing and data mining provide data analysis and understanding. Data ware house is a database architecture that store many different types of databases, a repository of multiple heterogeneous data sources. They are organized under a unified schema at a single site in order to facilitate management decision making. 5 THE EVOLUTION OF DATABASE TECHNOLOGY..CONT Data warehouse technology includes: - data cleansing - data integration, and - On-Line Analytical Processing (OLAP) OLAP is the analysis technique for performing summarization, consolidation, and aggregation, as well as ability to view information from different angles. Although OLAP tools support data analysis but not in-depth- analysis such as data classification, clustering, and the characterization of data changes over time 6 WHY DATA MINING? The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, digital cameras, YouTube We are drowning in data, but starving for knowledge! “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 7 WHY DO WE NEED TO MINE THE DATA? Too much data and too little information There is a need to extract useful information from the data and to interpret the data 8 WHY DATA MINING?—POTENTIAL APPLICATIONS Data analysis and decision support Market analysis and management Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers) Other Applications Text mining (news group, email, documents) and Web mining Stream data mining Bioinformatics and bio-data analysis 9 DEFINITION OF DATA MINING The nontrivial extraction of implicit, previously unknown, and potentially useful information 10 WHAT IS DATA MINING? Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names and their “inside stories”: Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. What is not data mining? (Deductive) query processing.  Expert systems or small machine learning/ statistical programs Process of semi-automatically analyzing large databases to find patterns that are: valid: hold on new data with some certainity novel: non-obvious to the system useful: should be possible to act on the item understandable: humans should be able to interpret the pattern also known as knowledge discovery in databases (KDD) 12 DATA MINING : 1-STEP OF KDD Knowledge Evaluation & KDD = Knowledge Discovery in Databases Presentation Patterns Data Mining Selection and Transformation Data Cleaning and Warehouse Integration Databases Flat files 13 STEPS OF A KDD PROCESS Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining  summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge DATA BASE SYSTEMS Relational Data warehouse Transactional DB Advanced DB system Kinds of DB Flat files WWW Classification Association Kinds of Knowledge Clustering Prediction … … 15 ARCHITECTURE OF A TYPICAL DATA MINING SYSTEM Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Data cleaning & data integration Filtering Data Databases Warehouse KNOWLEDGE BASE is a domain knowledge that is used to guide search To evaluate the interestingness of resulting patterns Includes concept hierarchies- used to organize attributes into different levels of abstraction 17 DATA MINING ENGINE Consists of a set of functional modules Characterization, association, classification, cluster analysis, evolution and deviation analysis PATTERN EVALUATION MODULE Interacts with data mining modules to focus the search towards interesting patterns Use thresholds to filter out discovered patterns 18 GRAPHICAL USER INTERFACE Communicates between users and data mining system Allows users to interact with the system(query/task) Provides info helpful for searching Performs exploratory data mining Allows user to browse database and dw schemas or ds, evaluate mined patterns And visualize patterns in different form 19 DATA MINING TECHNIQUES: CONFLUENCE OF MULTIPLE DISCIPLINES Database Technology Statistics Machine Visualization Learning Data Mining Pattern Recognition Other Algorithm Disciplines 20 WHY CONFLUENCE OF MULTIPLE DISCIPLINES? Tremendous amount of data Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications August 6, 2024 DATA MINING: CONCEPTS AND TECHNIQUES 21 DATA MINING – TYPES OF DATA Mining can be performed on data in a variety of forms Relational Database Traditional DMBS everyone is familiar with Data is stored in a series of tables (Collection of tables) Data is extracted via queries, typically with SQL SQL: “Show me a list of items that were sold in the last quarter” “show me the total sales of the last month, grouped by branch” “How many transactions occurred in the month of December?” “which sales person had the highest amount of sales” Relational language: aggregate function such as sum, avg, count, max, min 22 DATA MINING – TYPES OF DATA Apply data mining – go further Searching for trends or data patterns Analyzed customer data to predict credit risk of new customers based on their income Detect deviation – items whose sales are far from those expected in comparison with the previous year (further investigated: change in packaging, increase in price?) Transaction Database Similar to relational database (transactions stored in a table) Each row (record) is a transaction with id & list of items in transaction Nested relation ER model Can be unfolded into a relational database or stored in flat files since nested relational structures did not supported by relational db system Which items sold well together? 23 DATA MINING – TYPES OF DATA Data Warehouse Stores historical data, potentially from multiple sources Organized around major subjects Contains summary statistics Multidimensional data Object / Object-Relational Databases Database consisting of objects Object = set of variables + associated methods Eg: Intel uses regularity extraction in automatic circuit layout Images Can mine features extracted from images, OR Can use mining techniques to extract features Content based image retrieval 24 DATA MINING – TYPES OF DATA Spatial Databases Include spatial related information Include geographic(map) databases Include GIS and CAD data Represented in Raster format – n-dimensional bit maps /pixel maps, each pixel registers the rainfall in given area Represented in Vector format – point, line, polygon- roads, bridges, lakes buildings are represented as union of basic geometric constructs Can find spatial patterns between features Describing the characteristics of houses located near a specified kind of location e.g. park Describe the climate of mountainous areas located at various altitudes Vehicle navigation- Current location of a driver 25 DATA MINING – TYPES OF DATA Text Databases and multimedia databases Contains word description for objects Long sentences, paragraphs, error or bug reports, warning messages, summary reports, notes, other docs Can be structured(library databases), or highly unstructured(web pages), semi-structured(email messages, html/xml web pages) Documentation, newspaper articles, web sites etc. Can facilitate search by linking related documents / concepts Multimedia-store image, audio, video Used in picture content-based retrieval, voice-mail systems, www, Speech recognition – recognized spoken command Security applications Integrated with standard data mining methods (storage and searching) 26 DATA MINING – TYPES OF DATA Temporal Databases / Time Series Global change databases (temperature records) Space shuttle telemetry Stock market data (stock exchange) Usually stores relational data that include time-related attributes Find the trend of changes for objects – decision making/strategy planning Stock exchange data can be mined to uncover trends that could help in planning investment strategies (when is the best time to purchase TNB stock?) 27 Heterogeneous Databases & Legacy Databases Group of heterogeneous databases (relational, OO db, network db, multimedia db etc.) Connected by intra- or inter-computer networks Information exchange is very difficult – student academic performance among different schools/universities Data mining – transforming the given data into higher, more generalized, conceptual levels World wide web Understanding user access patterns to improve system design, better marketing decisions Keyword based search 28 DATA MINING TECHNIQUES AIA2 17 DATA MINING TECHNIQUES- ASSOCIATION RULE Transaction Item t1 milk, chip, bread, salsa, coke t2 banana, chip, rice, salsa t3 salsa, coke, banana, chip t4 milk, lettuce, coke, rice, salsa, bread t5 lettuce, salsa, bread, coke, chip, milk Objective: Identify items that occur together  Support of {salsa, chip} is 80%,  Support of {bread, milk} is 60% Data is useful for shelving, merchandizing, and pricing. 30 ASSOCIATION RULES Which feature values are commonly associated with each other in individual records? Knowledge of the form IF (feature1 = value1) THEN (feature2 = value2) Sample applications  market basket analysis  recommender systems  Web browsing logs  microarray analysis? ASSOCIATION AND CORRELATION ANALYSIS (CHAPTER 5) Frequent patterns (or frequent itemsets)  What items are frequently purchased together in your Walmart? Association, correlation vs. causality  A typical association rule  Diaper → Beer [0.5%, 75%] (support, confidence)  Are strongly associated items also strongly correlated? How to mine such patterns and rules efficiently in large datasets? How to use such patterns for classification, clustering, and other applications? 32 CLASSIFICATION AND PREDICTION Classification and prediction Construct models (functions) based on some training examples Describe and distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify cars based on (gas mileage) Predict some unknown or missing numerical values Typical methods Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern-based classification, logistic regression, … Typical applications: Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, … 33 CLASSIFICATION Also called supervised learning A prediction problem, similar to regression Input: an object described by features (a.k.a. variables, covariates) Output: the target class that the object belongs to Mathematically, assume there is some function f(x) = y producing the data. Given many pairs (x,y), find f.  Prediction Data Prediction is a two-step process, similar to that of data classification. we do not utilize the phrasing of “Class label attribute” because the attribute for which values are being predicted is consistently valued(ordered) instead of categorical. The attribute can be referred to as the predicted attribute. Prediction can be viewed as the construction and use of a model to assess the class of an unlabeled object, or to assess the value or value ranges of an attribute that a given object is likely to have CLUSTERING -- MARKET SEGMENTATION AS AN EXAMPLE Each point represent the characters of a customer Objective: grouping members that have similar characteristics together Widely applied on fraud detection, business and finance, science 35 CLUSTER AND OUTLIER ANALYSIS Cluster analysis Unsupervised learning (i.e., Class label is unknown) Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns Principle: Maximizing intra-class similarity & minimizing interclass similarity Many methods and applications Outlier analysis Outlier: A data object that does not comply(fulfill) with the general behavior of the data Noise or exception? ― One person’s garbage could be another person’s treasure Methods: by product of clustering or regression analysis, … Useful in fraud detection, rare events analysis 36 CLUSTERING Also called unsupervised learning Make groups of data based on their similarities or distances  How to define similarity or distance? Need a domain expert interpret results METHOD 1: AGGLOMERATIVE CLUSTERING Start with each point in its own cluster Combine two closest clusters into one Repeat Variation: Divisive clustering METHOD 2: K-MEANS CLUSTERING Objective: minimize squared distance from all points to their assigned center (prototype) point Choose number of clusters K Initialize cluster centers Repeat  assign each point to a center  move centers to centroid of assigned points...until no changes K-MEANS EXAMPLE Display vs Brand Loyalty 2 1.5 1 Brand Loyalty 0.5 0 -1 0 1 2 3 4 5 -0.5 -1 -1.5 -2 Display OVERFITTING Fitting the model exactly to the data is usually not a good idea. The resulting model may not generalize well to unseen data. weight weight height height INTEGRATION OF DATA MINING AND DATA WAREHOUSING Data mining systems, DBMS, Data warehouse systems coupling No coupling, loose-coupling, semi-tight-coupling, tight-coupling On-line analytical mining data integration of mining and OLAP technologies Interactive mining multi-level knowledge Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. Integration of multiple mining functions  Characterized classification, first clustering and then association 42 COUPLING DATA MINING WITH DB/DW SYSTEMS No coupling—flat file processing, not recommended Loose coupling Fetching data from DB/DW Semi-tight coupling—enhanced DM performance Provide efficient implement a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions Tight coupling—A uniform information processing environment DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc. 43 MAJOR ISSUES IN DATA MINING (2) Issues relating to the diversity of data types Handling relational and complex types of data Mining information from heterogeneous databases and global information systems (WWW) Issues related to applications and social impacts Application of discovered knowledge Domain-specific data mining tools Intelligent query answering Process control and decision making Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem Protection of data security, integrity, and privacy MAJOR ISSUES IN DATA MINING Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion User interaction Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy 45 ARE ALL THE “DISCOVERED” PATTERNS INTERESTING? Data mining may generate thousands of patterns: Not all of them are interesting Suggested approach: Human-centered, query-based, focused mining Interestingness measures A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. 46 DATA MINING APPLICATIONS Data mining is a young discipline with wide and diverse applications  There is still a nontrivial gap between general principles of data mining and domain-specific, effective data mining tools for particular applications Some application domains Biomedical and DNA data analysis Financial data analysis Retail industry Telecommunication industry BIOMEDICAL DATA MINING AND DNA ANALYSIS DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T). Gene: a sequence of hundreds of individual nucleotides arranged in a particular order Humans have around 100,000 genes Tremendous number of ways that the nucleotides can be ordered and sequenced to form distinct genes Semantic integration of heterogeneous, distributed genome databases  Current: highly distributed, uncontrolled generation and use of a wide variety of DNA data  Data cleaning and data integration methods developed in data mining will help DNA ANALYSIS: EXAMPLES Similarity search and comparison among DNA sequences  Compare the frequently occurring patterns of each class (e.g., diseased and healthy)  Identify gene sequence patterns that play roles in various diseases Association analysis: identification of co-occurring gene sequences  Most diseases are not triggered by a single gene but by a combination of genes acting together  Association analysis may help determine the kinds of genes that are likely to co-occur together in target samples Path analysis: linking genes to different disease development stages  Different genes may become active at different stages of the disease  Develop pharmaceutical interventions that target the different stages separately Visualization tools and genetic data analysis DATA MINING FOR FINANCIAL DATA ANALYSIS Financial data collected in banks and financial institutions are often relatively complete, reliable, and of high quality Design and construction of data warehouses for multidimensional data analysis and data mining  View the debt and revenue changes by month, by region, by sector, and by other factors  Access statistical information such as max, min, total, average, trend, etc. Loan payment prediction/consumer credit policy analysis  feature selection and attribute relevance ranking  Loan payment performance  Consumer credit rating FINANCIAL DATA MINING Classification and clustering of customers for targeted marketing  multidimensional segmentation by nearest-neighbor, classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group Detection of money laundering and other financial crimes  integration of from multiple DBs (e.g., bank transactions, federal/state crime history DBs)  Tools: data visualization, linkage analysis, classification, clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences) DATA MINING FOR RETAIL INDUSTRY Retail industry: huge amounts of data on sales, customer shopping history, etc. Applications of retail data mining  Identify customer buying behaviors  Discover customer shopping patterns and trends  Improve the quality of customer service  Achieve better customer retention and satisfaction  Enhance goods consumption ratios  Design more effective goods transportation and distribution policies DATA MINING IN RETAIL INDUSTRY: EXAMPLES Design and construction of data warehouses based on the benefits of data mining  Multidimensional analysis of sales, customers, products, time, and region Analysis of the effectiveness of sales campaigns Customer retention: Analysis of customer loyalty  Use customer loyalty card information to register sequences of purchases of particular customers  Use sequential pattern mining to investigate changes in customer consumption or loyalty  Suggest adjustments on the pricing and variety of goods Purchase recommendation and cross-reference of items DATA MINING FOR TELECOMM. INDUSTRY (1) A rapidly expanding and highly competitive industry and a great demand for data mining  Understand the business involved  Identify telecommunication patterns  Catch fraudulent activities  Make better use of resources  Improve the quality of service Multidimensional analysis of telecommunication data  Intrinsically multidimensional: calling-time, duration, location of caller, location of callee, type of call, etc. DATA MINING FOR TELECOMM. INDUSTRY (2) Fraudulent pattern analysis and the identification of unusual patterns  Identify potentially fraudulent users and their atypical usage patterns  Detect attempts to gain fraudulent entry to customer accounts  Discover unusual patterns which may need special attention Multidimensional association and sequential pattern analysis  Find usage patterns for a set of communication services by customer group, by month, etc.  Promote the sales of specific services  Improve the availability of particular services in a region Use of visualization tools in telecommunication data analysis EXAMPLES OF DATA MINING SYSTEMS (1) IBM Intelligent Miner  A wide range of data mining algorithms  Scalable mining algorithms  Toolkits: neural network algorithms, statistical methods, data preparation, and data visualization tools  Tight integration with IBM's DB2 relational database system SAS Enterprise Miner  A variety of statistical analysis tools  Data warehouse tools and multiple data mining algorithms Mirosoft SQLServer 2000  Integrate DB and OLAP with mining  Support OLEDB for DM standard EXAMPLES OF DATA MINING SYSTEMS (2) SGI MineSet  Multiple data mining algorithms and advanced statistics  Advanced visualization tools Clementine (SPSS)  An integrated data mining development environment for end-users and developers  Multiple data mining algorithms and visualization tools DBMiner (DBMiner Technology Inc.)  Multiple data mining modules: discovery-driven OLAP analysis, association, classification, and clustering  Efficient, association and sequential-pattern mining functions, and visual classification tool  Mining both relational databases and data warehouses TYPES OF DATA SETS Record Relational records Data matrix, e.g., numerical matrix, crosstabs Document data: text documents: term- frequency vector Transaction data Graph and network World Wide Web Social or information networks Molecular Structures Ordered Video data: sequence of images Temporal data: time-series Sequential Data: transaction sequences Genetic sequence data Spatial, image and multimedia: Spatial data: maps Image data: Video data: DATA OBJECTS Data sets are made up of data objects. A data object represents an entity. Examples: sales database: customers, store items, sales medical database: patients, treatments university database: students, professors, courses Also called samples , examples, instances, data points, objects, tuples. Data objects are described by attributes. Database rows -> data objects; columns ->attributes. 59 ATTRIBUTES Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object. E.g., customer _ID, name, address Types: Nominal Binary Numeric: quantitative Interval-scaled Ratio-scaled 60 ATTRIBUTE TYPES  Nominal: categories, states, or “names of things” or symbols Also called as categorical  Hair_color = {auburn, black, blond, brown, grey, red, white}  marital status, occupation, ID numbers, zip codes Not quantitative Represent names with numbers  E.g. 0 for black, 1 for brown, and so on… Makes no sense to calculate mean or median Mode is the major of central tendency 61 ATTRIBUTE TYPES Binary  Nominal attribute with only 2 states (0 and 1)  Symmetric binary: both outcomes equally important  e.g., gender  Asymmetric binary: outcomes not equally important.  e.g., medical test (positive vs. negative)  Convention: assign 1 to most important outcome (e.g., HIV positive) 62  Ordinal  Values have a meaningful order (ranking) but magnitude between successive values is not known.  Grades- A+, A, A-, B+,B,B-,…..  Size = {small, medium, large}, professional rankings  Often used in surveys for rating  How customers are satisfied  0-very dissatisfied  1- somewhat dissatisfied  2-neutral  3-satisfied  4-very satisfied Central tendency cab be represented by mode and median but not mean 63 NUMERIC ATTRIBUTE TYPES Quantity (integer or real-valued) Interval-scaled  Measured on a scale of equal-sized units  Values have order  E.g., temperature in C˚or F˚, calendar dates  No true zero-point Ratio-scaled  Inherent zero-point  We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚).  e.g., temperature in Kelvin, length, counts, monetary quantities 64 DISCRETE VS. CONTINUOUS ATTRIBUTES Discrete Attribute  Has only a finite or countably infinite set of values  E.g., zip codes, profession, or the set of words in a collection of documents  Sometimes, represented as integer variables  Note: Binary attributes are a special case of discrete attributes Continuous Attribute  Has real numbers as attribute values  E.g., temperature, height, or weight  Practically, real values can only be measured and represented using a finite number of digits  Continuous attributes are typically represented as floating-point variables 65 BASIC STATISTICAL DESCRIPTIONS OF DATA Motivation  To better understand the data: central tendency, variation and spread Data dispersion characteristics  median, max, min, quantiles, outliers, variance, etc. Numerical dimensions correspond to sorted intervals  Data dispersion: analyzed with multiple granularities of precision  Boxplot or quantile analysis on sorted intervals Dispersion analysis on computed measures  Folding measures into numerical dimensions  Boxplot or quantile analysis on the transformed cube 66 MEASURING THE CENTRAL TENDENCY  Mean (algebraic measure) (sample vs. population): Note: n is sample size and N is population size. 1 n x =  xi =  x  Weighted arithmetic mean: n i =1 N  Trimmed mean: chopping extreme values n  Median: w x i i x = i =1  Middle value if odd number of values, or average of the middle two valuesn otherwise w i =1 i  Estimated by interpolation (for grouped data): n / 2 − ( freq)l median = L1 + ( ) width freqmedian Median interval  Mode  Value that occurs most frequently in the data  Unimodal, bimodal, trimodal  Empirical formula: mean − mode = 3  (mean − median) 67 Midrange- used to assess the central tendency of a numeric data set Average of largest and smallest values in the set Midrange=(30+110)/2=70 68 SYMMETRIC VS. SKEWED DATA 69  Median, mean and mode of symmetric, symmetric positively and negatively skewed data positively skewed negatively skewed  Data Mining: Concepts and  August 6, 2024 Techniques MEASURING THE DISPERSION OF DATA Dispersion-spread of numeric data ❑Range quantiles ❑Quartiles ❑Percentiles ❑Interquartile range ❑Variance and standard deviation 70 RANGE, QUARTILES, INTER-QUARTILE RANGE Range- difference between largest and smallest values  Quartiles: we can pick certain data points to split data distribution into equal-size consecutive sets. These points are called as quantiles.  Quantiles are points taken at regular intervals of data distribution dividing into equal size consecutive sets.  Q1 (25th percentile),Q2 (median), Q3 (75th percentile)  2-quantile- data point dividing lower and upper halves of data distribution- corresponds to median  4-quantiles –three data points that split data into 4 parts. Each part is ¼- known as quartile 100-quantiles= percentiles- split in 100 71  Gives indication of distribution’s center, spread and shape  Inter-quartile range: IQR = Q3 – Q1  Five number summary: min, Q1, median, Q3, max 72 PROPERTIES OF NORMAL DISTRIBUTION CURVE 73 The normal (distribution) curve  From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation)  From μ–2σ to μ+2σ: contains about 95% of it  From μ–3σ to μ+3σ: contains about 99.7% of it GRAPHIC DISPLAYS OF BASIC STATISTICAL DESCRIPTIONS  Boxplot: graphic display of five-number summary  Histogram: x-axis are values, y-axis repres. frequencies  Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are  xi  Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another  Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane 74 BOXPLOT ANALYSIS  Five-number summary of a distribution  Minimum, Q1, Median, Q3, Maximum  Boxplot  Data is represented with a box  The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR  The median is marked by a line within the box  Whiskers: two lines outside the box extended to Minimum and Maximum  Outliers: points beyond a specified outlier threshold, plotted individually 75  Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually  Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ)  Variance: (algebraic, scalable computation)  Standard deviation s (or σ) is the square root of variance s2 (or σ2) 1 n 1 n 2 1 n 2  ( xi − x ) = n −1[ xi − ( xi ) ] n n s = 1 1   xi −  2 2 2  =2 ( x −  )2 = 2 n − 1 i =1 i =1 n i =1 N i =1 i N i =1 76 77 VISUALIZATION OF DATA DISPERSION: 3-D BOXPLOTS DATA MINING: CONCEPTS AND August 6, 2024 TECHNIQUES 78 HISTOGRAM ANALYSIS Histogram: Graph display of tabulated 40 frequencies, shown as bars 35 It shows what proportion of cases fall 30 into each of several categories 25 Differs from a bar chart in that it is the 20 area of the bar that denotes the value, 15 not the height as in bar charts, a crucial distinction when the categories are not 10 of uniform width 5 0 The categories are usually specified as 10000 30000 50000 70000 90000 non-overlapping intervals of some variable. The categories (bars) must be adjacent 79 HISTOGRAMS OFTEN TELL MORE THAN BOXPLOTS ◼ The two histograms shown in the left may have the same boxplot representation ◼ The same values for: min, Q1, median, Q3, max ◼ But they have rather different data distributions 80 QUANTILE PLOT Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information  For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi DATA MINING: CONCEPTS AND TECHNIQUES 81 QUANTILE-QUANTILE (Q-Q) PLOT  Graphs the quantiles of one univariate distribution against the corresponding quantiles of another  View: Is there is a shift in going from one distribution to another?  Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of items sold at Branch 1 tend to be lower than those at Branch 2. 82 SCATTER PLOT Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane 83 POSITIVELY AND NEGATIVELY CORRELATED DATA The left half fragment is positively correlated The right half is negative correlated 84 UNCORRELATED DATA 85 DATA PREPROCESSING Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Reduction Data Transformation and Data Discretization Summary DATA QUALITY: WHY PREPROCESS THE DATA? Measures for data quality: A multidimensional view Accuracy: correct or wrong, accurate or not Completeness: not recorded, unavailable, … Consistency: some modified but some not, dangling, … Timeliness: timely update? Believability: how trustable the data are correct? Interpretability: how easily the data can be understood? 87 MAJOR TASKS IN DATA PREPROCESSING Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration  Integration of multiple databases, data cubes, or files Data reduction  Dimensionality reduction  Numerosity reduction  Data compression Data transformation and data discretization  Normalization  Concept hierarchy generation DATA CLEANING Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., Occupation=“ ” (missing data)  noisy: containing noise, errors, or outliers  e.g., Salary=“−10” (an error)  inconsistent: containing discrepancies in codes or names, e.g.,  Age=“42”, Birthday=“03/07/2010”  Was rating “1, 2, 3”, now rating “A, B, C”  discrepancy between duplicate records  Intentional (e.g., disguised missing data)  Jan. 1 as everyone’s birthday? INCOMPLETE (MISSING) DATA Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data 90 Missing data may need to be inferred HOW TO HANDLE MISSING DATA? Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably Fill in the missing value manually: tedious + infeasible? Fill in it automatically with  a global constant : e.g., “unknown”, a new class?!  the attribute mean  the attribute mean for all samples belonging to the same class: smarter  the most probable value: inference-based such as Bayesian formula or 91 decision tree NOISY DATA Noise: random error or variance in a measured variable Incorrect attribute values may be due to  faulty data collection instruments  data entry problems  data transmission problems  technology limitation  inconsistency in naming convention Other data problems which require data cleaning  duplicate records  incomplete data  inconsistent data HOW TO HANDLE NOISY DATA? Binning  first sort data and partition into (equal-frequency) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression  smooth by fitting the data into regression functions Clustering  detect and remove outliers Combined computer and human inspection  detect suspicious values and check by human (e.g., deal with possible outliers) 93 DATA CLEANING AS A PROCESS Data discrepancy detection  Use metadata (e.g., domain, range, dependency, distribution)  Check field overloading  Check uniqueness rule, consecutive rule and null rule  Use commercial tools  Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections  Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) Data migration and integration  Data migration tools: allow transformations to be specified  ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface Integration of the two processes  Iterative and interactive (e.g., Potter’s Wheels) 94 DATA PREPROCESSING Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Reduction Data Transformation and Data Discretization Summary 95 DATA INTEGRATION Data integration:  Combines data from multiple sources into a coherent store Schema integration: e.g., A.cust-id ≡ B.cust-#  Integrate metadata from different sources Entity identification problem:  Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts  For the same real world entity, attribute values from different sources are different  Possible reasons: different representations, different scales, e.g., metric vs. British units 96 96 HANDLING REDUNDANCY IN DATA INTEGRATION Redundant data occur often when integration of multiple databases  Object identification: The same attribute or object may have different names in different databases  Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue Redundant attributes may be able to be detected by correlation analysis and covariance analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality 97 AIA1 CORRELATION ANALYSIS (NOMINAL DATA) Χ2 (chi-square) test The larger the Χ2 value, the more likely the variables are related The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count Correlation does not imply causality  # of hospitals and # of car-theft in a city are correlated  Both are causally linked to the third variable: population CHI-SQUARE CALCULATION: AN EXAMPLE Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) It shows that like_science_fiction and play_chess are correlated in the group CORRELATION ANALYSIS (NUMERIC DATA) Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product. If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. rA,B = 0: independent; rAB < 0: negatively correlated CORRELATION (VIEWED AS LINEAR RELATIONSHIP) Correlation measures the linear relationship between objects To compute correlation, we standardize data objects, A and B, and then take their dot product 101 COVARIANCE (NUMERIC DATA) Covariance is similar to correlation Correlation coefficient: where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B. Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values. Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value. Independence: CovA,B = 0 but the converse is not true:  Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence 102 CO-VARIANCE: AN EXAMPLE It can be simplified in computation as Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14). Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?  E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4  E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6  Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4 Thus, A and B rise together since Cov(A, B) > 0. DATA PREPROCESSING Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Reduction Data Transformation and Data Discretization Summary DATA REDUCTION STRATEGIES Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Why data reduction? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set. Data reduction strategies  Dimensionality reduction, e.g., remove unimportant attributes  Wavelet transforms  Principal Components Analysis (PCA)  Feature subset selection, feature creation  Numerosity reduction (some simply call it: Data Reduction)  Regression and Log-Linear Models 105  Histograms, clustering, sampling  Data cube aggregation  Data compression DATA REDUCTION 1: DIMENSIONALITY REDUCTION Curse of dimensionality  When dimensionality increases, data becomes increasingly sparse  Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful  The possible combinations of subspaces will grow exponentially Dimensionality reduction  Avoid the curse of dimensionality  Help eliminate irrelevant features and reduce noise  Reduce time and space required in data mining  Allow easier visualization Dimensionality reduction techniques  Wavelet transforms  Principal Component Analysis  Supervised and nonlinear techniques (e.g., feature selection) 106 ATTRIBUTE SUBSET SELECTION Another way to reduce dimensionality of data Redundant attributes  Duplicate much or all of the information contained in one or more other attributes  E.g., purchase price of a product and the amount of sales tax paid Irrelevant attributes  Contain no information that is useful for the data mining task 107 at hand  E.g., students' ID is often irrelevant to the task of predicting students' GPA DATA REDUCTION 2: NUMEROSITY REDUCTION Reduce data volume by choosing alternative, smaller forms of data representation Parametric methods (e.g., regression)  Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)  Ex.: Log-linear models—obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods  Donot assume models 108  Major families: histograms, clustering, sampling, … PARAMETRIC DATA REDUCTION: REGRESSION AND LOG-LINEAR MODELS Linear regression  Data modeled to fit a straight line  Often uses the least-square method to fit the line Multiple regression  Allows a response variable Y to be modeled as a linear function of multidimensional feature vector Log-linear model  Approximates discrete multidimensional probability distributions 109 DATA REDUCTION 3: DATA COMPRESSION String compression  There are extensive theories and well-tuned algorithms  Typically lossless, but only limited manipulation is possible without expansion Audio/video compression  Typically lossy compression, with progressive refinement  Sometimes small fragments of signal can be reconstructed without reconstructing the whole Time sequence is not audio  Typically short and vary slowly with time Dimensionality and numerosity reduction may also be considered as forms of data compression DATA COMPRESSION Original Data Compressed Data lossless Original Data Approximated DATA PREPROCESSING Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Reduction Data Transformation and Data Discretization Summary DATA TRANSFORMATION A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values Methods  Smoothing: Remove noise from data  Attribute/feature construction  New attributes constructed from the given ones  Aggregation: Summarization, data cube construction  Normalization: Scaled to fall within a smaller, specified range  min-max normalization  z-score normalization  normalization by decimal scaling  Discretization: Concept hierarchy climbing NORMALIZATION Min-max normalization: to [new_minA, new_maxA]  Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to Z-score normalization (μ: mean, σ: standard deviation):  Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling Where j is the smallest integer such that Max(|ν’|) < 1 DISCRETIZATION Three types of attributes  Nominal—values from an unordered set, e.g., color, profession  Ordinal—values from an ordered set, e.g., military or academic rank  Numeric—real numbers, e.g., integer or real numbers Discretization: Divide the range of a continuous attribute into intervals  Interval labels can then be used to replace actual data values  Reduce data size by discretization  Supervised vs. unsupervised  Split (top-down) vs. merge (bottom-up)  Discretization can be performed recursively on an attribute  Prepare for further analysis, e.g., classification DATA DISCRETIZATION METHODS Typical methods: All the methods can be applied recursively Binning  Top-down split, unsupervised Histogram analysis  Top-down split, unsupervised Clustering analysis (unsupervised, top-down split or bottom-up merge) Decision-tree analysis (supervised, top-down split) Correlation (e.g., χ2) analysis (unsupervised, bottom-up merge) SIMPLE DISCRETIZATION: BINNING Equal-width (distance) partitioning  Divides the range into N intervals of equal size: uniform grid  if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.  The most straightforward, but outliers may dominate presentation  Skewed data is not handled well Equal-depth (frequency) partitioning  Divides the range into N intervals, each containing approximately same number of samples  Good data scaling  Managing categorical attributes can be tricky BINNING METHODS FOR DATA SMOOTHING ❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 118 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 DATA VISUALIZATION  Why data visualization?  Gain insight into an information space by mapping data onto graphical primitives  Provide qualitative overview of large data sets  Search for patterns, trends, structure, irregularities, relationships among data  Help find interesting regions and suitable parameters for further quantitative analysis  Provide a visual proof of computer representations derived  Categorization of visualization methods:  Pixel-oriented visualization techniques  Geometric projection visualization techniques  Icon-based visualization techniques  Hierarchical visualization techniques  Visualizing complex data and relations 119 PIXEL-ORIENTED VISUALIZATION TECHNIQUES  For a data set of m dimensions, create m windows on the screen, one for each dimension  The m dimension values of a record are mapped to m pixels at the corresponding positions in the windows  The colors of the pixels reflect the corresponding values (a) Income (b) Credit Limit (c) transaction volume (d) age 120 LAYING OUT PIXELS IN CIRCLE SEGMENTS To save space and show the connections among multiple dimensions, space filling is often done in a circle segment (a) Representing a data record (b) Laying out pixels in circle segment Representing aboutin circle 265,000 segmentData Items 50-dimensional with the ‘Circle Segments’ Technique 121 GEOMETRIC PROJECTION VISUALIZATION TECHNIQUES Visualization of geometric transformations and projections of the data Methods  Direct visualization  Scatterplot and scatterplot matrices  Landscapes  Projection pursuit technique: Help users find meaningful projections of multidimensional data  Prosection views  Hyperslice  Parallel coordinates 122 Ribbons with Twists Based on Vorticity DIRECT DATA VISUALIZATION DATA MINING: CONCEPTS AND TECHNIQUES 123 SCATTERPLOT MATRICES Used by ermission of M. Ward, Worcester Polytechnic Institute Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots] 124 LANDSCAPES Used by permission of B. Wright, Visible Decisions Inc. news articles visualized as a landscape Visualization of the data as perspective landscape The data needs to be transformed into a (possibly artificial) 2D spatial representation which preserves the characteristics of the data 125 PARALLEL COORDINATES n equidistant axes which are parallel to one of the screen axes and correspond to the attributes The axes are scaled to the [minimum, maximum]: range of the corresponding attribute Every data item corresponds to a polygonal line which intersects each of the axes at the point which corresponds to the value for the attribute 126 PARALLEL COORDINATES OF A DATA SET 127 ICON-BASED VISUALIZATION TECHNIQUES Visualization of the data values as features of icons Typical visualization methods  Chernoff Faces  Stick Figures General techniques  Shape coding: Use shape to represent certain information encoding  Color icons: Use color icons to encode more information  Tile bars: Use small icons to represent the relevant feature vectors in document retrieval 128 CHERNOFF FACES A way to display variables on a two-dimensional surface, e.g., let x be eyebrow slant, y be eye size, z be nose length, etc. The figure shows faces produced using 10 characteristics--head eccentricity, eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening): Each assigned one of 10 possible values, generated using Mathematica (S. Dickson) REFERENCE: Gonick, L. and Smith, W. The Cartoon Guide to Statistics. New York: Harper Perennial, p. 212, 1993 Weisstein, Eric W. "Chernoff Face." From MathWorld--A Wolfram Web Resource. mathworld.wolfram.com/ChernoffFace.html 129 STICK FIGURE A census data figure showing age, income, gender, education, etc. A 5-piece stick figure (1 body and 4 limbs w. different angle/length) DATA MINING: CONCEPTS AND TECHNIQUES 130 HIERARCHICAL VISUALIZATION TECHNIQUES Visualization of the data using a hierarchical partitioning into subspaces Methods  Dimensional Stacking  Worlds-within-Worlds  Tree-Map  Cone Trees  InfoCube 131 DIMENSIONAL STACKING  Partitioning of the n-dimensional attribute space in 2-D subspaces, which are ‘stacked’ into each other  Partitioning of the attribute value ranges into classes. The important attributes should be used on the outer levels.  Adequate for data with ordinal attributes of low cardinality  But, difficult to display more than nine dimensions  Important to map dimensions appropriately 132 DIMENSIONAL STACKING Used by permission of M. Ward, Worcester Polytechnic Institute Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes 133 WORLDS-WITHIN-WORLDS Assign the function and two most important parameters to innermost world Fix all other parameters at constant values - draw other (1 or 2 or 3 dimensional worlds choosing these as the axes) Software that uses this paradigm  N–vision: Dynamic interaction through data glove and stereo displays, including rotation, scaling (inner) and translation (inner/outer)  Auto Visual: Static interaction by means of queries 134 TREE-MAP  Screen-filling method which uses a hierarchical partitioning of the screen into regions depending on the attribute values  The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes) Schneiderman@UMD: Tree-Map of a File System Schneiderman@UMD: Tree-Map to support large data sets of a million items 135 INFOCUBE A 3-D visualization technique where hierarchical information is displayed as nested semi-transparent cubes The outermost cubes correspond to the top level data, while the subnodes or the lower level data are represented as smaller cubes inside the outermost cubes, and so on 136 THREE-D CONE TREES 3D cone tree visualization technique works well for up to a thousand nodes or so First build a 2D circle tree that arranges its nodes in concentric circles centered on the root node Cannot avoid overlaps when projected to 2D G. Robertson, J. Mackinlay, S. Card. “Cone Trees: Animated 3D Visualizations of Hierarchical Information”, ACM SIGCHI'91 Graph from Nadeau Software Consulting website: Visualize a social network data set that models the way an infection spreads from one person to the next 137 VISUALIZING COMPLEX DATA AND RELATIONS  Visualizing non-numerical data: text and social networks  Tag cloud: visualizing user-generated tags ◼ The importance of tag is represented by font size/color ◼ Besides text data, there are also methods to visualize relationships, such as visualizing social networks Newsmap: Google News Stories in 2005

data mining unit 1.pdf

Document Details

Tags

Related

Full Transcript