Week 1B PDF - Data Analysis
Document Details
Tags
Summary
This document provides an overview of data analysis topics. It covers various aspects of data analysis, including analyzing data structures in different forms, such as graph, networks and data streams. It describes different analysis methodologies and considerations such as evaluation and mining challenges.
Full Transcript
Structure and Network Analysis Graph mining Finding frequent subgraphs (e.g., chemical compounds), substructures (web fragments) Information network analysis Social networks: actors (objects, nodes) and relationships (edges) e.g., author networks in CS, t...
Structure and Network Analysis Graph mining Finding frequent subgraphs (e.g., chemical compounds), substructures (web fragments) Information network analysis Social networks: actors (objects, nodes) and relationships (edges) e.g., author networks in CS, terrorist networks Multiple networks A person could be multiple information networks: friends, family, classmates, … Links carry a lot of semantic information: Link mining Web mining Web is a big information network: from PageRank to Google Analysis of Web information networks Web community discovery, opinion mining, usage mining, … 19 Time and Ordering: Sequential Pattern, Trend and Evolution Analysis Sequence, trend and evolution analysis Trend, time-series, and deviation analysis: e.g., regression and value prediction Sequential pattern mining e.g., first buy digital camera, then buy large SD memory cards Periodicity analysis Biological sequence analysis Approximate and consecutive motifs Similarity-based analysis Mining data streams Ordered, time-varying, potentially infinite, data streams Conversational Analysis 20 Evaluation Important to validate assumption, prove effectiveness, make argument,... Need to evaluate the knowledge discovered, predictions made, … Multi-dimentional evaluation: accuracy, interestingness, representative, completeness, efficiency, explainability, diversity, … Different evaluation approaches for different areas/topics/applications/approaches … 21 Application Areas Industry Application E-commerce Credit Card Analysis Social Media Claims, Fraud Analysis Finance Call record analysis Insurance promotion analysis Telecommunication Value added data Transport Power usage analysis Data Service providers … … 22 Major Challenges in Data Analysis (1) Mining Methodology Mining various and new kinds of knowledge Data Analysis: An interdisciplinary effort Boosting the power of discovery in a networked environment Handling noise, uncertainty, and incompleteness of data Pattern evaluation and pattern- or constraint-guided mining User Interaction Interactive analysis Incorporation of background knowledge Presentation and visualization of data analysis results 23 Major Challenges in Data Analysis (2) Efficiency and Scalability Efficiency and scalability of data analysis algorithms Parallel, distributed, stream, and incremental analysis methods Diversity of data types Handling complex types of data Analyzing dynamic, networked, and global data repositories Data analysis and society Social impacts of data analysis Privacy-preserving data analysis 24 Getting to Know Your Data Data Objects and Attribute Types Basic Statistical Descriptions of Data Measuring Data Similarity and Dissimilarity Summary 25 Data Objects Data sets are made up of data objects. A data object represents an entity. Examples: sales data: customers, store items, sales medical data: patients, treatments university data: students, professors, courses Also called samples , examples, instances, data points, objects, tuples. Data objects are described by attributes. Database rows -> data objects; columns ->attributes. 26 Attributes Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object. E.g., customer _ID, name, address Types: Nominal Binary Ordinal Numeric: quantitative Interval-scaled Ratio-scaled 27 Attribute Types Nominal: categories, states, or “names of things” Hair_color = {auburn, black, blond, brown, grey, red, white} marital status, occupation, ID numbers, zip codes Binary Nominal attribute with only 2 states (0 and 1) Symmetric binary: both outcomes equally important e.g., gender Asymmetric binary: outcomes not equally important. e.g., medical test (positive vs. negative) Convention: assign 1 to most important outcome (e.g., HIV positive) Ordinal Values have a meaningful order (ranking) but magnitude between successive values is not known. Size = {small, medium, large}, grades, army rankings 28 Numeric Attribute Types Quantity (integer or real-valued) Interval Measured on a scale of equal-sized units Values have order E.g., temperature in C˚, calendar dates No true zero-point Can add or subtract, multiplication and division are not meaningful Ratio A ratio-scaled attribute is a numeric attribute with an inherent zero- point. Can perform all arithmetic operations—addition, subtraction, multiplication, and division E.g., $10 is twice as high as $5 29 Discrete vs. Continuous Attributes Discrete Attribute Has only a finite or countably infinite set of values E.g., zip codes, profession, or the set of words in a collection of documents Sometimes, represented as integer variables Note: Binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute values E.g., temperature, height, or weight Practically, real values can only be measured and represented using a finite number of digits Continuous attributes are typically represented as floating- point variables 30 Some special data formats Sequential data Ordered objects Time series: indexed in time order Text data: sequence of words, characters or other linguistic units Graph data Nodes (Vertices) Edges (Links) Getting to Know Your Data Data Objects and Attribute Types Basic Statistical Descriptions of Data Measuring Data Similarity and Dissimilarity Summary 32 Measuring the Central Tendency 1 n Mean (algebraic measure) (sample vs. population): x = å xi n i =1 Note: n is sample size and N is population size. n Weighted arithmetic mean åw x i i x= i =1 n åw i =1 i Trimmed mean: chopping extreme values Median: Middle value if odd number of values, or average of the middle two values otherwise Mode Value that occurs most frequently in the data Unimodal, bimodal, trimodal 34 Symmetric vs. Skewed Data Median, mean and mode of symmetric symmetric, positively and negatively skewed data positively skewed negatively skewed 35 Measuring the Dispersion of Data Quartiles, outliers and boxplots Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 – Q1 Five number summary: min, Q1, median, Q3, max Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ) Variance: (algebraic, scalable computation) 1 n 1 n 2 1 n 1 n 1 n s = 2 å n - 1 i =1 ( xi - x ) = 2 [å xi - (å xi ) 2 ] n - 1 i =1 n i =1 s = å ( xi - µ ) 2 = 2 N i =1 N åx i =1 i 2 - µ2 Standard deviation s (or σ) is the square root of variance s2 (or σ2) 36 Boxplot Analysis Five-number summary of a distribution Minimum, Q1, Median, Q3, Maximum Boxplot Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR The median is marked by a line within the box Whiskers: two lines outside the box extended to Minimum and Maximum Outliers: points beyond a specified outlier threshold, plotted individually 37 Histogram Analysis 40 Histogram: Graph display of tabulated 35 frequencies, shown as bars 30 It shows what proportion of cases fall into each of several intervals 25 20 15 10 5 0 10000 30000 50000 70000 90000 38 Histograms Often Tell More than Boxplots n The two histograms shown in the left may have the same boxplot representation n The same values for: min, Q1, median, Q3, max n But they have rather different data distributions 39 Getting to Know Your Data Data Objects and Attribute Types Basic Statistical Descriptions of Data Measuring Data Similarity and Dissimilarity Summary 40 Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Dissimilarity (e.g., distance) Numerical measure of how different two data objects are Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity 41 Proximity Measure for Nominal Attributes Method 1: Simple matching m: # of matches, p: total # of variables d (i, j) = p - p m Method 2: Creating a new binary attribute for each of the M nominal states – then apply binary proximity measures 43 Proximity Measure for Binary Attributes A contingency table for binary data Object j Object i Distance measure for symmetric binary variables: Distance measure for asymmetric binary variables: Jaccard coefficient (similarity measure for asymmetric binary variables): 44 Distance on Numeric Data: Minkowski Distance Minkowski distance: A popular distance measure where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p- dimensional data objects, and h is the order (the distance so defined is also called L-h norm) Properties d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness) d(i, j) = d(j, i) (Symmetry) d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality) A distance that satisfies these properties is a metric 46 Special Cases of Minkowski Distance h = 1: Manhattan (city block, L1 norm) distance E.g., the Hamming distance: the number of bits that are different between two binary vectors d (i, j) =| x - x | + | x - x | +...+ | x - x | i1 j1 i2 j 2 ip jp h = 2: (L2 norm) Euclidean distance d (i, j) = (| x - x |2 + | x - x |2 +...+ | x - x |2 ) i1 j1 i2 j 2 ip jp h ® ¥. “supremum” (Lmax norm, L¥ norm) distance. This is the maximum difference between any component (attribute) of the vectors 47 Example: Minkowski Distance Dissimilarity Matrices point attribute 1 attribute 2 Manhattan (L1) x1 1 2 L x1 x2 x3 x4 x2 3 5 x1 0 x3 2 0 x2 5 0 x4 4 5 x3 3 6 0 x4 6 1 7 0 x2 x4 Euclidean (L2) L2 x1 x2 x3 x4 4 x1 0 x2 3.61 0 x3 2.24 5.1 0 x4 4.24 1 5.39 0 2 x1 Supremum L¥ x1 x2 x3 x4 x1 0 x2 3 0 x3 x3 2 5 0 0 2 4 x4 3 1 5 0 Proximity for Ordinal Variables Order is important, e.g., rank Can be treated like numeric data replace xif by their rank rif Î{1,..., M f } map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by rif -1 zif = M f -1 compute the dissimilarity using methods for numeric variables Example: {small, median, large} -> rank {1, 2, 3} ->numerical {0, 0.5, 1} 49 Attributes of Mixed Type A database may contain all attribute types Nominal, symmetric binary, asymmetric binary, numeric, ordinal One may use a weighted formula to combine their effects S pf = 1d ij( f ) dij( f ) d (i, j) = S pf = 1d ij( f ) if xif = xjf =0 and f is asymmetric binary, otherwise f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise f is numeric: use the normalized distance f is ordinal Compute ranks rif and Treat zif as numeric zif = r -1 if M -1 f 50 Textual Data: Cosine Similarity A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then cos(A, B) = (A B) /||A|| ||B|| = where indicates vector dot product, ||d||: the length of vector 51 Getting to Know Your Data Data Objects and Attribute Types Basic Statistical Descriptions of Data Measuring Data Similarity and Dissimilarity Summary 52 Summary Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-scaled Gain insight into the data by: Basic statistical data description: central tendency, dispersion, graphical displays Data visualization: map data onto graphical primitives Measure data similarity Above steps are the beginning of data preprocessing. Many methods have been developed but still an active area of research. 53