GED110 Lecture 9 - Predicting Human Behaviour with Big Data - PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This lecture covers topics on retrieving useful information from big data using infographics and generative AI. The lecture materials details the week-by-week outlines for the course including quizzes and data analysis tutorials.
Full Transcript
Lecture 9 Retrieving Useful Data Visualise Big Data with Infographics and Generative AI Lecture 9 GED110 Predicting Human Behaviour with Big Data 1 Week 10 Outline Last Week Recap Modelling Optimization (3NF) Classwork 2...
Lecture 9 Retrieving Useful Data Visualise Big Data with Infographics and Generative AI Lecture 9 GED110 Predicting Human Behaviour with Big Data 1 Week 10 Outline Last Week Recap Modelling Optimization (3NF) Classwork 2 Lecture 9 Retrieving Useful Data Visualise Big Data with Infographics and Generative AI Lecture 9 GED110 Predicting Human Behaviour with Big Data 2 Week 9 Recap Third Normal Form is to prevent data duplications and space wastage Mapping Steps covered in Lectures already satisfied the Second Normal Form Three General Standards to update the Physical Design: 1. Are there any fields in a Table that describe attributes irrelevant to the entity? 2. Are there any fields that are Not Required in the Table Design? 3. Are there any fields that the data has a limited possibility? Lecture 9 GED110 Predicting Human Behaviour with Big Data 3 Week 11 Quiz Open-Book Quiz (20% Individual) Start after some announcements on Group Project requirements 1.5 Hours (on average you can complete in 1 Hr) A mix of Multiple Choice Questions (can be found somewhere in the lecture slides) Practical Questions on ER Diagrams and Queries Lecture 9 GED110 Predicting Human Behaviour with Big Data 4 Retrieving Useful Data Lecture 9 Lecture 9 GED110 Predicting Human Behaviour with Big Data 5 Database Implementation (Tutorial 9a) From Logical Data Model to Physical Design, you have successfully analyzed the data need for a system In this Tutorial, we will try to implement your Physical Design in a RDBMS (i.e.: Microsoft Access), including: Create Tables Create Relationships Procedures to Insert, Update and Delete Data Create simple Queries Please refer to Tutorial 9a file for the Demonstration Lecture 9 GED110 Predicting Human Behaviour with Big Data 6 Advanced Queries (Tutorial 9b) Due to the nature of relational database design Pieces of data are usually located in different tables Very common to retrieve data from multiple tables in the same query Therefore, we will cover some methods to retrieve data from multiple tables and fine-tuning table designs: Aggregation Functions Inner Join and Outer Join Sub-queries Please refer to Tutorial 9b file for the Demonstration Lecture 9 GED110 Predicting Human Behaviour with Big Data 7 Visualise Big Data Lecture 9 Lecture 9 GED110 Predicting Human Behaviour with Big Data 8 Business Analytics There is no single way to define Business Analytics, we could imagine that Business Analytics: Is about Delivering the Right Decision Support (with Data) to the Right People at the Right Time Is the scientific process of transforming data into insight for better decision making What is decision making? A process of choosing among two or more alternative actions for the purpose of attaining business goals Lecture 8 BUS 351 Data Analyics for Marketing 9 Simon’s Model of Decision Making Herbert A. Simon (1916 – Intelligence 2001) Identifies the problem or opportunity Nobel Prize in Economics in 1978 “for his pioneering research into the decision- Implementation making process within Review economic organizations” Choice Design Decision Making is a Compare and select solution Inventing or developing recursive process alternatives Lecture 8 BUS 351 Data Analyics for Marketing 10 From Data to Knowledge Wisdom Application of Principles Wisdo Traffic rules and traffic lights exist to protect all road users m If I don’t stop in the red signal, I will cause road accident and people will get hurt Knowledg e Context and Patterns Knowled I have to stop the car in red signal The signal will turn to green again in a short moment ge Informatio n Meaning and Relationships The traffic light in front of me just turned red Information Data Raw Input Traffic Light #34286, 20:23:45 23/Mar/2022 Data L1 on, L2 off, L3 off Lecture 8 BUS 351 Data Analyics for Marketing 11 Business and Data Analysis Modern organizations are usually managed by facts for performance evaluation, improvement, and decision making: Data availability Time and effort Expectation of the users Data analysis Data: key inputs to decision models Analysis: extracting meaning from data to support evaluation and decision making The academic discipline specialized in this area is called: management science, decision science, data science Lecture 8 BUS 351 Data Analyics for Marketing 12 Types of Data Non-numerical Categorical (nominal) – data’s categories Geographical region, gender, brand, etc. No quantitative relationships among categories Ordinal data –data ordered or ranked according to some rules Education level Categories can be compared with one another Numerical Interval data – numerical data with no natural zero. Temperature, time, survey scales that are assumed to be interval Ratios are meaningless (50 degrees is not twice as hot as 25 degrees), differences are meaningful Ratio data –numerical data that have a natural zero Sales, length, weight, most business and economic data Lecture 8 BUS 351 Data Analyics for Marketing 13 Analytical Methods Descriptive statistics Graphical and numerical procedures to summarize and process data Predictive statistics Using data to make predictions, forecasts, and estimates to assist decision making. It is often necessary to report how confident you are with the findings. Linear Regression Logit/Logistic Regression Time Series Analysis Survival Analysis Lecture 8 And many more … BUS 351 Data Analyics for Marketing 14 Business Data Analytics Lecture 8 BUS 351 Data Analyics for Marketing 15 Related Software for Data Analytics Microsoft Excel (Simple data handling and graphs) Microsoft Access (Personal Database) Microsoft SQL Server (Business-level Database) Oracle (Business-level Database) IBM SPSS (Statistical Analysis Packages) SAS Enterprise Miner (Data Mining) R (Programming Language, Data Visualization) Etc. Lecture 8 BUS 351 Data Analyics for Marketing 16 Types of Business Problems Marketing Finance Sales Valuation Segmentation Stock Price Promotion Credit and Interest Risk Management Human-Resources Credit Analysis Recruitment Portfolio Promotion Fraud Detection Turn Over Lecture 8 BUS 351 Data Analyics for Marketing 17 Sales Analysis You work for a supermarket, which has an ERP that documents the companies’ daily transactions. Your boss wants to know the operation status of the company. You are assigned to give him a report. Lecture 8 BUS 351 Data Analyics for Marketing 18 What does you boss want to see in market reports? Contents Statistics Statistics for subcategories: Pivot Table Presentation Charts Logic Lecture 8 BUS 351 Data Analyics for Marketing 19 Descriptive Analytics The first target of descriptive analytics is to present data in a form that makes sense to people so that we can further contemplate about the data and get information to support decision making. A tendency to race through “descriptive statistics” The “descriptive statistics” is simple, complex and important. Three ways to explore a date set: Numerical summary measures, Descriptive statistics: for example counts, percentages, averages, and measures of variability. Tables of summary measures: for example pivot table. A variety of graphs, including bar carts, pie charts, histograms, scatterplots, and time series graphs (data visualization) Logic: Use theories or guidelines you learned in the domain knowledge to summarize data. Lecture 8 BUS 351 Data Analyics for Marketing 20 Measures for Non-Numerical Variables Only a few possibilities exist for describing categorical variables, based on counting. Count based on the number of categories Count the number of observations in each category Frequency distribution: frequency of observations in non- overlapping classes/cells Excel’s COUNTIF function. The function takes two arguments: The data range (where) and a criterion that one wants to count (what). Lecture 8 BUS 351 Data Analyics for Marketing 21 Statistics for Numerical Variables Single Variable Measures of central tendency Mean, Median, Mode Minimum, Maximum, Percentiles, and Quartiles Measures of dispersion/variability Range Variance Standard deviation Multiple variables Covariance and correlation Lecture 8 BUS 351 Data Analyics for Marketing 22 Measures of Central Tendency Arithmetic Mean Excel function AVERAGE(range) Affected by unusually large or small observations (outliers) Median Middle value when data are ordered from smallest to largest. Not affected by extremes Excel function MEDIAN(range) Mode Observation that occurs most frequently Useful when data consist of a small number of unique values Lecture 8 BUS 351 Data Analyics for Marketing 23 Minimum, Maximum, Percentiles, and Quartiles Minimum and Maximum For any percentage p, the pth percentile is the value such that a percentage p of all values are less than it. The quartiles divide the data into four groups, each with (approximately) a quarter of all observations. The first, second and third quartiles are the percentiles corresponding to p= 25%, p= 50%,and p= 75%. By definition, the second quartile (p= 50%) is equal to the median. MIN and MAX functions PERCENTILE and QUARTILE functions Lecture 8 BUS 351 Data Analyics for Marketing 24 Measures of Dispersion: Range The range is the maximum value minus the minimum value The interquartile range (IQR) is the third quartile minus the first quartile Thus, it is the range of the middle 50% of the data It is less sensitive to extreme values than the range Lecture 8 BUS 351 Data Analyics for Marketing 25 Measures of Dispersion Variance Standard Deviation = Square root of variance The standard deviation has the same units of measurement as the original data, unlike variance Excel functions STDEVP, STDEV Lecture 8 BUS 351 Data Analyics for Marketing 26 Three Rules for STD interpretation These three rules work well only when the data are normally distributed: Approximately 68% of the observations mean, i.e. within the interval 𝑋±𝑠. are within one standard deviation of the Approximately 95% of the observations mean, i.e. within the interval 𝑋±2𝑠. are within two standard deviations of the Approximately 99.7% of the observations are within three standard interval 𝑋±3𝑠. deviations of the mean, i.e. within the Lecture 8 BUS 351 Data Analyics for Marketing 27 Histograms A histogram is the most common type of chart for showing the distribution of a numerical variable. It is based on binning the variable—that is, dividing it up into discrete categories. It is a column chart of the counts in the various categories (with no gaps between the vertical bars). A histogram is great for showing the shape of a distribution—whether the distribution is symmetric or skewed in one direction. Lecture 8 BUS 351 Data Analyics for Marketing 28 Box Plots A box plot (or box-whisker plot) is an alternative type of chart for showing the distribution of a variable. The elements of a generic box plot are shown below: Lecture 8 BUS 351 Data Analyics for Marketing 29 Box plots Example Box Plots are extremely useful when you compare a variable across different categories. Lecture 8 BUS 351 Data Analyics for Marketing 30 Outliers An outlier is a value or an entire observation (row) that lies well outside of the norm. Some statisticians define an outlier as any value more than three standard deviations from the mean, but this is only a rule of thumb. Even if values are not unusual by themselves, there still might be unusual combinations of values. For instance, 72cm is a normal value for height. But a combination of 15 years old and 72cm is unusual. Additional attention may be needed. When dealing with outliers, it is best to run the analyses two ways: with the outliers and without them. Lecture 8 BUS 351 Data Analyics for Marketing 31 Chart Types in Excel Chart Type Description Use to… Pie/Doughnut Charts a series of values as a Illustrate the contribution of each value in the data set percentage of the whole to a total. Number of values in the data set should be minimal (approximately less than 10) Line Charts a series of values across a Illustrate one ore more trends over time (i.e. categories set of categories as points should be a unit of time such as hours, days, months, connected by a line years, and so on) Column/Bar Charts a series of values across a Illustrate a single data set or compare values of multiple set of categories using vertical data sets across same set of categories columns or horizontal bars Area Combines the properties of a line Illustrate a trend across a set of categories or time and pie chart to chart a series of values across a set of categories as a continuous area Scatter/Bubble Charts x, y coordinate pairs Illustrate the dependence of one set of values (Y) on another (X) Radar Charts changes in values relative Illustrate the differences of each value from the average to a center point value in a distribution. Lecture 8 BUS 351 Data Analyics for Marketing 32 Pie Charts Useful for attributes to show relative proportions Univariate analysis Data represented as an area in a circle expressed as a percentage of a whole. Number of categories should be kept to a minimum (