Data Mining Lecture2.pdf

Data Mining [CSEN 911] GUC - Winter 2024 – Lecture 2 Data Exploration and OLAP analytics (recap) Dr. Ayman Al-Serafi TAs: Tameem Alghazaly* (lead)...

Data Mining [CSEN 911] GUC - Winter 2024 – Lecture 2 Data Exploration and OLAP analytics (recap) Dr. Ayman Al-Serafi TAs: Tameem Alghazaly* (lead) Nada Bakeer Sarah Samir Mariam Moustafa Q&A breaks Outline between sections Urgent Qs only in 1. Data Mining Overview Recap between! 2. Data Exploration 3. Which Data Mining technique? 4. OLAP analytics (recap) 5. Conclusion Data Mining - GUC - Winter 2024 2-3 THE BIG PICTURE – DATA MINING PROCESS Data Ingestion/Wrangling/Munging Model Building Evaluation Streams Cross Validation Data Discovery Data Preprocessing Regression Hyperparameter Production Data Cleaning Feature Selection Tuning Data Formats Classification Flat Files Model Accuracy Data Reduction Feature Engineering Clustering Data Extraction Visualization Confidence Regularization Visualization Data Parsing Outlier Analysis Cluster Quality Databases Data Transformation Data Tagging Association Rule Mining Imbalance Data Integration Data Structures Ensemble Models Overfitting Data Mining - GUC – Winter 2024 4 Data Mining Process: CRISP-DM 1 2 Business Data Understanding Understanding CRoss-Industry 3 Data Standard Preparation 6 Data Sources Process Deployment 4 for Data Mining Model Building 5 Testing and Evaluation Data Mining - GUC - Winter 2024 2-5 Data Mining Process: CRISP-DM Step 1: Business Understanding Accounts for Step 2: Data Understanding ~85% of total project time Step 3: Data Preparation (!) Step 4: Model Building Step 5: Testing and Evaluation Step 6: Deployment  The process is highly repetitive and experimental Data Mining - GUC - Winter 2024 2-6 Phases and Tasks Business Data Data Modeling Evaluation Deployment Understanding Understanding Preparation Determine Collect Select Select Evaluate Plan Business Initial Modeling Data Results Deployment Objectives Data Technique Plan Monitering Assess Describe Clean Generate Review & Situation Data Data Test Design Process Maintenance Determine Produce Explore Construct Build Determine Data Mining Final Data Data Model Next Steps Goals Report Verify Produce Integrate Assess Review Data Project Plan Data Model Project Quality Format Data Data Mining - GUC - Winter 2024 2-7 Outline 1. Data Mining Overview Recap 2. Data Exploration 3. Which Data Mining technique? 4. OLAP analytics (recap) 5. Conclusion Q&A Data Mining - GUC - Winter 2024 2-8 What Kinds of Data Can Be Mined?  TRADITIONAL: Database-oriented data sets and applications  Relational database, data warehouse, transactional database  ADVANCED data sets and advanced applications  Data streams, sensor data  Time-series data, temporal data, sequence data (incl. bio- sequences)  Structure data, graphs, social networks and multi-linked data  Object-relational databases, heterogeneous databases, legacy databases  Spatial data and spatiotemporal data  Multimedia database  Text databases  The World-Wide Web  Social Feeds Data Mining - GUC - Winter 2024 2-9 DATA REPRESENTATIONS Tabular – Ideal for ML! (Structured / Defined Schema) Semi-Structured – XML, JSON, … (Flexible Schema) Unstructured (Unstructured?) – images, text, video, … start end trip station start station start station station end station end station birth duration starttime stoptime id start station name latitude longitude id end station name latitude longitude bikeid usertype year gender Cambridge Main 1/1/2015 Library at Broadway / 542 1/1/2015 0:21 0:30 115 Porter Square Station 42.387995 -71.119084 96 Trowbridge St 42.373379 -71.111075 277 Subscriber 1984 1 Cambridge St - at 1/1/2015 MIT Stata Center at Columbia St / Webster 438 1/1/2015 0:27 0:34 80 Vassar St / Main St 42.3619622 -71.0920526 95 Ave 42.372969 -71.094445 648 Subscriber 1985 1 One Kendall Square at 1/1/2015 Hampshire St / Portland Central Square at Mass 254 1/1/2015 0:31 0:35 91 St 42.366277 -71.09169 68 Ave / Essex St 42.36507 -71.1031 555 Subscriber 1974 1 Data Mining - GUC - Winter 2024 10 Feature A1 A2 A3 TABULAR DATA – O1 O2 … … … … … … INSTANCES AND ATTRIBUTES O3 … … … O4 … … … A data instance represents an entity Instance Observations Feature vector  Also sample, example, object, data point  e.g. customers, students, patients, books An attribute is a data field, representing a characteristic or feature of a data object  Also dimension, variable  e.g. name, age, salary, gender, grade, … Attribute – feature vector  A set of attributes that describe an object Observed values for an attribute  observations Data Mining - GUC – Winter 2024 11 Boston’s Hubway Data Challenge Data Mining - GUC - Winter 2024 12 start end trip station start station start station station end station end station birth duration starttime stoptime id start station name latitude longitude id end station name latitude longitude bikeid usertype year gender Cambridge Main 1/1/2015 Library at Broadway / 542 1/1/2015 0:21 0:30 115 Porter Square Station 42.387995 -71.119084 96 Trowbridge St 42.373379 -71.111075 277 Subscriber 1984 1 Cambridge St - at 1/1/2015 MIT Stata Center at Columbia St / Webster 438 1/1/2015 0:27 0:34 80 Vassar St / Main St 42.3619622 -71.0920526 95 Ave 42.372969 -71.094445 648 Subscriber 1985 1 One Kendall Square at 1/1/2015 Hampshire St / Portland Central Square at Mass 254 1/1/2015 0:31 0:35 91 St 42.366277 -71.09169 68 Ave / Essex St 42.36507 -71.1031 555 Subscriber 1974 1 Cambridge Main 1/1/2015 Library at Broadway / 432 1/1/2015 0:53 1:00 115 Porter Square Station 42.387995 -71.119084 96 Trowbridge St 42.373379 -71.111075 1307 Subscriber 1987 1 Lower Cambridgeport at Inman Square at 1/1/2015 Magazine St/Riverside Vellucci Plaza / 735 1/1/2015 1:07 1:19 105 Rd 42.356954 -71.113687 88 Hampshire St 42.374035 -71.101427 177 Customer 1986 2 Central Sq Post Office / Cambridge City Hall 1/1/2015 Inman Square at Vellucci at Mass Ave / Pleasant 311 1/1/2015 1:28 1:33 88 Plaza / Hampshire St 42.374035 -71.101427 76 St 42.366426 -71.105495 685 Subscriber 1989 1 Half a million Hubway rides from 2011 to 2013! ‘What does the data tell us about Boston’s ride share program?’ Data Mining - GUC - Winter 2024 13 DATA EXPLORATION/QUESTION REFINEMENT Who? Who’s using the bikes? Where? Where are bikes being checked  More men or more women? out?  Older or younger people?  More in Boston than Cambridge?  Subscribers or one time users?  More in commercial or residential?  More around tourist attractions? Source: CS109 Stanford’s Data Science Course Data Mining - GUC – Winter 2024 14 DATA EXPLORATION/QUESTION REFINEMENT When? When are the bikes being Why? For what reasons/activities are checked out? people checking out bikes?  More during the weekend than on the  More bikes are used for recreation than weekdays? commute?  More during rush hour?  More bikes are used for touristic purposes?  More during the summer than the fall?  Bikes are used to bypass traffic? Source: CS109 Stanford’s Data Science Course Data Mining - GUC – Winter 2024 15 DATA EXPLORATION/QUESTION REFINEMENT How? Questions that investigate/model ○ Do we have the data to answer these relationships between variables questions with reasonable certainty?  How does user demographics impact the ○ What data do we need to collect in duration the bikes are being used? Or order to answer these questions? where they are being checked out?  How does weather or traffic conditions ○ Sometimes the feature you want to impact bike usage? explore doesn’t exist in the data, and must be engineered!  How do the characteristics of the station location affect the number of bikes being ○ Sometimes the data is given to you in checked out? pieces and must be merged! Source: CS109 Stanford’s Data Science Course Data Mining - GUC – Winter 2024 16 THINGS TO CONSIDER FOR TABULAR DATA  Are column headers values and not variable names?  Are variables stored in both rows and columns?  Are multiple variables stored in one column?  Are multiple types of experimental units stored in the same table? In general, we want each file to correspond to a dataset, each column to represent a single variable and each row to represent a single object We want to tabularize the data. This makes Python and ML algorithms happy! Data Mining - GUC - Winter 2024 18 THINGS TO CONSIDER FOR TABULAR DATA The following is a table for the number of product deliveries over a weekend Are column headers values and not variable names? Are variables stored in both rows and columns? Are multiple variables stored in one column? Are multiple types of experimental units stored in the same table? What are the variables in this dataset? Variables should be: Time, Day, # Produce Deliveries What object or event are we measuring? Each column header represents a value, not a variable What's the issue? How do we fix it? The values of the variable “# Produce Deliveries” are not recorded in a single column Source: CS109 Stanford’s Data Science Course Data Mining - GUC - Winter 2024 19 THINGS TO CONSIDER FOR TABULAR DATA Reorganize the data to make explicit the event we’re observing and the variables associated to this event Source: CS109 Stanford’s Data Science Course Data Mining - GUC - Winter 2024 20 ATTRIBUTE TYPES Qualitative Attributes Quantitative Attributes Categorical/Nominal Numeric Binary Ordinal Data Mining - GUC - Winter 2024 21 QUALITATIVE ATTRIBUTES Categorical  Each value represents category, code, or state Most algorithms in Python are designed to  e.g. hair color, marital status, customer ID work with numbers!  Possible to be represented as numbers (coding) Qualitative attributes may need to be encoded Binary into numbers  Nominal with only two values; two states or categories: 0 or 1 (absent or present, true or false)  e.g. gender Ordinal  Values have a meaningful order or ranking, but magnitude between successive values is not known  e.g. professional rank, grade, size, customer satisfaction Data Mining - GUC – Winter 2024 22 QUANTITATIVE ATTRIBUTES Interval-scaled  Measured on a scale of equal-size units Sometimes we need to normalize quantitative  e.g. temperature, year data  Do not have a true zero point Sometimes we need to discretize quantitative  Not possible to be expressed as multiples data – Back to categorical! Ratio-scaled  Have a true zero point  A value can be expressed as a multiple of another  e.g. years of experience, weight, salary Data Mining - GUC – Winter 2024 23 BASIC STATISTICAL DESCRIPTIONS OF DATA Motivation: to better understand the data distribution Measuring Central Tendency Measuring dispersion of Data Data Mining - GUC - Winter 2024 24 MEASURING CENTRAL TENDENCY For N observations of numerical variable X: 𝑥1 , 𝑥2 , … , 𝑥𝑁 Mean: or average of values σ𝑁 𝑖=1 𝑥𝑖 𝑥1 +𝑥2 + …+𝑥𝑁  𝑥ҧ = = 𝑁 𝑁 Weighted Average: a weight is associated with each value σ𝑁 𝑖=1 𝑤𝑖 𝑥𝑖 𝑤1 𝑥1 +𝑤2 𝑥2 + …+𝑤𝑁 𝑥𝑁  𝑥ҧ = σ𝑁 = σ𝑁 𝑖=1 𝑤𝑖 𝑖=1 𝑤𝑖 Problem: sensitivity to outlier values  e.g. mean salary, mean student score  Trimmed mean  chop off extreme values at both ends Data Mining - GUC – Winter 2024 25 MEASURING CENTRAL TENDENCY Median: middle value in a set of ordered values  N is odd  median is middle value of ordered set  N is even  median is not unique  average of two middlemost values  Expensive to compute for large # of observations – why? Mode: most frequent value in the attribute values Works for both qualitative and quantitative attributes Data Mining - GUC – Winter 2024 26 MEASURING DISPERSION OF DATA Variance & Standard Deviation (SD): indicate how spread out a data distribution is  Low SD  data observations tend to be very close to the mean  High SD  data is spread out over a large range of values 1 𝑁 1 𝑁 𝜎 2 = σ𝑖=1 𝑥𝑖 − 𝑥ҧ 2 = σ𝑖=1 𝑥𝑖 2 − 𝑥ҧ 2 𝑁 𝑁 𝑆𝐷 = 𝜎 = 𝜎2 Data Mining - GUC – Winter 2024 27 MEASURING DISPERSION OF DATA For N observations of numerical variable X: 𝑥1 , 𝑥2 , … , 𝑥𝑁 First, we order the observations! Then, we can compute … Range: difference between the largest and smallest values Quantiles: points taken at regular intervals of a data distribution, dividing it into (almost) equal-size consecutive sets  Most famous  percentile  100 equal-sized sets  Quartiles  4 Quantiles  Interquartile Range: = Q3 - Q1 Data Mining - GUC – Winter 2024 28 MEASURING DISPERSION OF DATA Five-Number Summary: Length at most  Min, Q1, Median (Q2), Q3, Max 1.5×IQR Boxplots: visualization for the five-number summary Whiskers terminate at min & max OR the most extreme observations within 1.5 × IQR of the quartiles   Lower whisker: Min OR Q1 – (1.5 × IQR)  Upper whisker: Max OR Q3 + (1.5 × IQR)  Remaining points are plotted individually (outliers!) Data Mining - GUC – Winter 2024 29 BOXPLOT EXAMPLES Comparison of a boxplot of a nearly normal distribution and a probability density function (pdf) for a normal distribution Working Example: https://www.khanacademy.org/math/statistics- probability/summarizing-quantitative-data/box-whisker- plots/a/box-plot-review Data Mining - GUC – Winter 2024 30 CAREFUL WITH ESTIMATIONS OF CENTRALITY AND DISPERSION PARAMETERS! Outliers can The first 20 points All the points in the dataset distort everything! Data Mining - GUC – Winter 2024 31 VISUAL REPRESENTATIONS OF DATA – WHY? Visualization is the transformation of data into images that effectively and accurately represent information about the data  Explore  if nothing is known yet about the data  Where we are now!  Analyze  if there are hypotheses or assumptions to verify or falsify  Present  if results are known and communication of results/insights to users is needed Data Mining - GUC – Winter 2024 32 VISUAL REPRESENTATIONS OF DATA DISTRIBUTIONS In addition to the usual representations we use regularly:  Line charts  Bar charts Comparison Distribution  Pie charts We also have:  Boxplots  Histograms  Scatter Plots Relationship Composition Data Mining - GUC – Winter 2024 33 VISUAL REPRESENTATIONS OF DATA DISTRIBUTIONS Comparison  Evaluate and compare values between two or more data categories Distribution  Visualize full data spectrum Comparison Distribution Relationship  Show the relationship, correlation, or connection of two or more variables Composition  Show how a total value can be divided into parts or to highlight the significance of each part relative to the total value Relationship Composition Data Mining - GUC – Winter 2024 34 Choice of a specific visualization method is based on: ○ Number of variables ○ Number of data points ○ Time Functions we may use in Seaborn Data Mining - GUC - Winter 2024 35 HISTOGRAMS An approximate representation of the distribution of numerical data 9 8 7 6 5 COUNT 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 PRICE ($) Data Mining - GUC – Winter 2024 37 HISTOGRAMS To construct a histogram, you 30 need to “bin”/“bucket” the range 25 of values  Bins are consecutive, non- 20 overlapping intervals COUNT 15  Divide the range of values into a series of intervals – then count how 10 many values fall into each interval 5 To reduce further  change 0 width of buckets ($10 range) 1 to 10 11 to 20 PRICE ($) 21 to 30 Data Mining - GUC – Winter 2024 38 SCATTER PLOTS Can visualize how multi-dimensional data are distributed across certain values Each pair of attribute values is treated as a pair of coordinates and plotted as points in plane X and Y are correlated if one attribute implies the other positive, negative, or null (uncorrelated)  more on that later! For more attributes, we use a scatter plot matrix Data Mining - GUC – Winter 2024 39 SCATTER PLOTS – CORRELATIONS What types of correlations can be inferred from these scatter plots? Data Mining - GUC – Winter 2024 40 SCATTER PLOTS – CORRELATIONS What types of correlations can be inferred from these scatter plots? Above: Scatter plot of BMI and insurance charges Left: We add the smoking behavior Source: Kaggle Data Visualization Micro Course Data Mining - GUC – Winter 2024 41 CORRELATIONS DESCRIPTIVE STATISTICS Correlation Analysis (Pearson Correlations)  Basic data analysis capability useful for:  Understanding nature of relationships (positive, negative) among variables  Aid in quantifying the strength of the relationship between pairs of variables  Relationships are summarized into an easy-to-understand coefficient  A value between -1 and 1  It is the ratio between the covariance of two variables and the product of their standard deviations; thus it is essentially a normalised measurement of the covariance  Has underlying assumptions which should be kept in mind  Multivariate normality  Joint distribution of variable pairs is normally distributed  The relationship between variables is either always positive/negative or zero  no non-linear relationship exists between variables  That variables are numerical and continuous in nature  0/1 indicators and variables representing ranks require different type of correlation coefficient for accurate quantification of relationship  If data are linearly related then this will provide a solid indicator of sign of relationship and a fairly good relative measure of strength of relationship Data Mining - GUC – Winter 2024 43 CORRELATIONS DESCRIPTIVE STATISTICS Data Mining - GUC – Winter 2024 44 HEAT MAPS Shows magnitude of a phenomenon as color in two dimensions Work well with natural semantics Color mapping can be problematic Grayscale usually fine Data Mining - GUC – Winter 2024 45 A NOTE ON COLORS Use the organization’s palette Source: CS109 Stanford’s Data Science Course Choose colors sensibly based on the information you want to convey  Sequential  Diverging  Categorical Use different colors for categories (5-8 at most) Use same color with varying luminance or saturation for ordinal data  If scale starts at 0, use sequential colors, if scale diverges from 0, use diverging scale Data Mining - GUC – Winter 2024 46 PRINCIPLES OF VISUALIZATIONS Form/Vision Function/Understanding Maximize data to ink ratio Provide explanations and draw conclusions Use proper scale lines and a data rectangle Use all available space Reduce chart junk and clutter, and Align juxtaposed plots make data stand out Use log scales when appropriate Use visually prominent graphical elements Bank aspect ratio of lines to 45∘ Data Mining - GUC - Winter 2024 47 Outliers Detection  An outlier is defined as a data point which is very different from the rest of the data based on some measure.  Previous research said that the existence of outliers have a harmful effect on the results of different data mining techniques.  In order to avoid this harmful effect, outliers may be deleted or treated. But deleting may lead to loss in valuable information.  Extreme values that could distort analysis  detect via initial checks  decide whether values are genuine and whether cases should be included or excluded  take appropriate actions - recode or delete  Check and exclude groups of cases falling outside required sample universe Data Mining - GUC - Winter 2024 2-49 Outlier Detection techniques Outlier Detection Techniques  Based on the underlying dimensions:  Univariate  Multivariate  Based on the underlying methodology:  Distribution based: Outliers are those observations that deviate from the standard distribution.  Statistical based: A generalized form of the distribution based.  Clustering based: The outliers are those objects that don’t belong to any of the identified clusters.  Density based: Outliers are those objects that have low densities in their local neighbourhood. Data Mining - GUC - Winter 2024 2-50 Univariate versus Multivariate Y X Data Mining - GUC - Winter 2024 2-51 Clustering Based Data Mining - GUC - Winter 2024 2-52 TO SUM UP – FOR EFFECTIVE VISUALIZATIONS Identify important dimensions to visualize Identify vocabulary elements (color, shape, orientation) that will map to those dimensions Have graphical integrity  Do not distort scale  Include uncertainty  Plot all the data Keep it simple Use the right chart Interactive is usually good and gives control to user Use color sensibly Data Mining - GUC – Winter 2024 53 Date Exploration  Univariate Statistics  Count  Minimum, Maximum  Modes  Mean  Standard Deviation  …  Values Analysis (basic data quality analysis)  Data Types  Counts  # NULL Values  # Positive Values  # Negatives Values  # Zeros  # Blanks  # Unique Values  Frequency Analyses  Frequency of Discrete Variables  Histogram Analyses  Histograms of Continuous Variables  Overlap Analysis  Index/Key Column Consistency  Scatter Plot Analysis  2-D and 3-D Plots of Continuous Variables Data Mining - GUC - Winter 2024 2-54 Outline 1. Data Mining Overview Recap 2. Data Exploration 3. Which Data Mining technique? 4. OLAP analytics (recap) 5. Conclusion Q&A Data Mining - GUC - Winter 2024 2-55 Data in Data Mining  Data: a collection of facts usually obtained as the result of experiences, observations, or experiments  Data may consist of numbers, words, and images  Data: lowest level of abstraction (from which information and knowledge are derived) - DM with different Data data types? - Other data types? Categorical Numerical Nominal Ordinal Interval Ratio Data Mining - GUC - Winter 2024 2-56 DATA MINING STRATEGIES Data Mining Strategies Unsupervised Supervised Market Basket Clustering Learning Analysis Association Rule Mining Classification Estimation Prediction Data Mining - GUC - Winter 2024 2-57 Table 1.2 Data Instances with an Unknown Classification Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis 11 No No Yes Yes Yes ? 12 Yes Yes No No Yes ? 13 No No No No Yes ? INDEPENDENT DEPENDENT VARIABLES VARIABLE (CLASS) Data Mining - GUC - Winter 2024 2-58 DM tasks  Classification. a learning function that classifies data into one of several predefined classes.  E.g., Decision Trees (Nominal) / Logistic Regression (BINARY)  CATEGORICAL DEPENDENT VARIABLE  Estimation. mathematical equation that provides predictions of the values of one variable based on the known values of one or more other variables.  Linear (NUMERICAL) / Logistic Regression (BINARY, between 0 and 1)  Regression Trees (Numerical DEPENDENT VARIABLE)  Clustering. producing classifications from initially unclassified data and using variables in an unsupervised approach.  Clustering and Outliers (INDEPENDENT VARIABLES ONLY)  Summarization. finding a compact description for a subset of data.  Associations/Correlations (INDEPENDENT VARIABLES ONLY) Data Mining - GUC - Winter 2024 2-59 DM Tasks Description  Linear Regression  Linear regression can be used to predict or estimate the value of a continuous numeric data element based upon a linear combination of other numeric data elements present for each observation.  Logistic Regression  Logistic regression can be used to predict or estimate a two-valued variable based upon other numeric data elements present for each observation.  Factor Analysis  Factor analysis is a collective term for a family of techniques. In general, Factor analysis can be used to identify, quantify, and re-specify the common and unique sources of variability in a set of numeric variables. One of its many applications allows an analytical modeler to reduce the number of numeric variables needed to describe a collection of observations by creating new variables, called factors, as linear combinations of the original variables.  Decision Trees  Decision trees, or rule induction, can be used to predict or estimate the value of a multivalued variable based upon other categorical and continuous numeric data elements by building decision rules and presenting them graphically in the shape of a tree, based upon splits on specific data values.  Clustering  Cluster analysis can be used to form multiple groups of observations, such that each group contains observations that are very similar to one another, based upon values of multiple numeric data elements.  Association Rules  Generate association rules and various measures of frequency, relationship and statistical significance associated with these rules. These rules can be general, or have a dimension of time association with them. Data Mining - GUC - Winter 2024 2-60 Outline 1. Data Mining Overview Recap 2. Data Exploration 3. Which Data Mining technique? 4. OLAP analytics (recap) 5. Conclusion Q&A Data Mining - GUC - Winter 2024 2-62 Types of Business Intelligence Data Mining - GUC - Winter 2024 2-63 Online Analytical Processing (OLAP)  The term OLAP refers to a variety of activities usually performed by end-users in online systems to explore data and make reports.  OLAP is a tool that enables the user, while at a PC, to query the system, conduct an analysis, and so on. The result is generated in seconds.  Multidimensional OLAP (MOLAP): OLAP is implemented via a specialized multidimensional database (or data store) that summarizes transactions into multidimensional views ahead of time. Data is organized in cube structures. Queries are fast since consolidation has already been done. Data Mining - GUC - Winter 2024 65 Multidimensionality  The ability to organize, present, and analyze data by several dimensions, such as sales by region (geography), by product, by salesperson (person), and by time (four dimensions).  3 factors are considered in multidimensionality:  Dimensions.  Measures.  Time. Data Mining - GUC - Winter 2024 67 Multidimensional Database  A database in which the data are organized specifically to support easy and quick multidimensional analysis. The data are transported from a data warehouse.  A Data cube is used to present data along some measure of interest. A two-dimensional, three- dimensional, or higher-dimensional object in which each dimension of the data represents a measure of interest. Data Mining - GUC - Winter 2024 68 Cube analysis Data Mining - GUC - Winter 2024 69 Star Schema A single fact table and a single table for each dimension. Every fact points to one tuple in each of the dimensions and has additional attributes. Creates non-normalized data structures. Fact constellation: Multiple Fact tables that share many dimension tables: Example: Projected expense and the actual expense may share dimensional tables. Data Mining - GUC - Winter 2024 72 What, where, when,….etc: Key Performance Indicators (KPIs) Measures / facts in blue How are sales developed compared to last year? and dimensions in red In which regions are profits behind expectations? Which product groups will we draw biggest profit? I would like to have a report on the most important key figures and their changes for the recent weeks! I would like to have a report on the most important market indicators for Saxony- Anhalt and Germany (purchasing power, unemployment rate, inflation rate,....) What is the average sales per branch last year? Data Mining - GUC - Winter 2024 78 Those are measures in The Dimensional model the centre of the cube or cell which you compute for all the How? dimension in the Distribution surrounding boxes! channel When? Revenue Time Cost Sales What? Where? Product GEOGRAPHY / Region Who? Customer Data Mining - GUC - Winter 2024 79 Sample Report from a Management Information System (MIS) DIMENSIONS FACTS / MEASURES 80 Data Mining - GUC - Winter 2024 Multidimensional operations  Slicing – Multiple projections  Dicing – Generating smaller cubes of different cells.  Roll up – Aggregation along the different dimensions (new multidimen.-cube)  Drill down – Selection of smaller cubes with higher granularity  Rotation – Different views to the aggregates of the cube Data Mining - GUC - Winter 2024 84 "Slicing and Dicing", Rotation Product Turbo 1500 Facts Sales Volume Month Customer 01/2011 02/2011 03/2011 04/2011 05/2011 06/2011 Sum Hauser OEG 65 00 50 40 70 15 300 Product Product Huber und Söhne 30 30 25 20 35 10 150 Maier AG 450 390 320 28 0 400 100 2000 MUller GmbH 320 300 250 210 345 75 1500 Sauber KG 5 10 5 10 10 40 Sum 870 730 650 560 920 200 3990 Customer Customer Slicing Customer Müller GmbH Facts Sales Volume Product Product Month Product 01/2011 02/2011 03/2011 04/2011 05/2011 06/2011 Sum Turbo1500 320 300 250 210 345 75 1500 Turbo 2000 259 204 227 200 270 60 1300 Carry 500 20 20 15 15 30 100 Carry 1500 2 2 1 5 10 Serv 5000 Customer Customer Serv9000 0 3 1 2 Rotation Sum 602 608 433 425 650 135 2313 Data Mining - GUC - Winter 2024 87 "Drill Down" and "Roll Up" Customer Facts Sales Volume Produkt Customer Gr. Turbo1500 Turbo2000 Carry 500 Carry 1500 Serv 5000 Serv 9000 Sum Großabnehmer 3500 1450 170 70 5 3 5193 Kleinabnehmer 490 210 35 41 1 1 770 Total 3390 1660 205 111 6 4 5963 DRILL ROLL UP DOWN Facts Sales Volume Product Customer Gr. Customer Turbo1500 Turbo2000 Carry 500 Carry 1500 Serv 5000 Serv 9000 Sum Großabnehmer Maier AG 150 70 5 2285 2000 60 Müller GmbH 1500 1300 3 2913 100 10 Sum 3500 1450 170 70 5 3 513S Kleinabnehmer Hauser ÜEG 300 150 460 10 Huber und Söhne 150 5 167 10 1 1 Sauber KG 40 50 30 30 151 1 Sum 430 35 41 778 210 1 1 Total 3330 1660 205 6 4 537G 111 Data Mining - GUC - Winter 2024 88 Outline 1. Data Mining Overview Recap 2. Data Exploration 3. Which Data Mining technique? 4. OLAP analytics (recap) 5. Conclusion Q&A Data Mining - GUC - Winter 2024 2-90 CONCLUSION Data Types and Data Mining Exploring Data with Basic Statistical Descriptions Exploring Data with Visual Representations of Data Distributions Exploring Data with OLAP Analysis of Data Data Mining - GUC – Winter 2024 91 Date Lecture (Mondays) Tutorials (in same week) Week 1 Introduction to Data Mining Week 2 Data Exploration & OLAP Introduction and Python Crash Course Pt.1 Introduction to Pandas Week 3 Data Preparation, Preprocessing and Cleansing  Announce Quiz 1 Data Cleaning & Preprocessing Week 4 Supervised Estimation: Linear Regression  Quiz 1 Correlation & Linear Regression Week 5 Supervised Classification I: Logistic Regression, Naïve Bayes  Announce Quiz 2 Supervised Classification II: Decision Tree Induction, Prediction Evaluation, Problem Sets for Midterm Week 6 Performance, Results Presentation  Assignment 1 Submission + QUIZ 2 21 - 31 Oct. Mid-Term Exam Decision Trees & Ensemble Methods Week 7 Supervised Classification III – kNN, Evaluation Methods Unsupervised Learning I: K-Means Clustering, Hierarchical Clustering and Logistic Regression, Naïve Bayes, kNN Week 8  Assignment 2 submission Density Based Clustering (DBSCAN) K-Means Clustering Week 9 Association Rule Mining Association Rule Mining Week 10 Text Mining and Sentiment Analysis  QUIZ 3 Advanced Topics- Ensemble learners + Deep Learning / MLOps + Sentiment Analysis Week 11  Assignment 3 submission Recommender Systems Unsupervised Learning II – Clustering Overview, DB-SCAN, Hierarchical Deep Learning (last day of tutorials) + Problem Sets Week 12 Clustering  Assignment 4 submission + PYTHON QUIZ Week 13 Revision 21 Dec. – 9 Jan. Final Exam Data Mining - GUC - Dr. Ayman Alserafi 1-92 Course Material CMS.guc.edu.eg Data Mining - GUC - Dr. Ayman Alserafi 1-93 THANK YOU FOR YOUR ATTENTION NEXT WEEK: Data preparation, preprocessing and cleansing TUTORIAL: Introduction to fundamentals of Python

Data Mining Lecture2.pdf

Document Details

Tags

Related

Full Transcript