Advanced Regression Analysis Lecture 2 PDF
Document Details
Uploaded by Deleted User
Liangfei Qiu
Tags
Summary
This presentation introduces advanced regression analysis, data visualization, and artificial intelligence concepts. It includes examples and covers topics such as correlation, causation, selection bias, and clustering.
Full Transcript
Lecture 2 Advanced Regression Analysis DATA ANALY/DECIS SUPP Strategy Liangfei Qiu A Little Humor … Group Formation Group formation due on Oct 29 One email per team Your Team Name – For...
Lecture 2 Advanced Regression Analysis DATA ANALY/DECIS SUPP Strategy Liangfei Qiu A Little Humor … Group Formation Group formation due on Oct 29 One email per team Your Team Name – For the love of art – CollegeNet Team members’ names (in alphabetical order by last name) Members’ emails Slide 3 of 24 Artificial Intelligence Examples of Bad Data Visualization We have price per barrel of oil represented as the height of a three dimensional barrel An obvious problem? Compare 1974 and 1979 Less than 30% increase Examples of Bad Data Visualization Lie with Data Visualization 14 13.5 13 Price per Barrel 12.5 12 11.5 11 10.5 10 1974 1975 1976 1977 1978 1979 Year Examples of Bad Data Visualization Truth 14 12 10 Price per Barrel 8 6 4 2 0 1974 1975 1976 1977 1978 1979 Year How to Tell a Story with Data Visualization Who has the biggest military budget in the world? How to Tell a Story with Data Visualization Example: How to Tell a Story with Data Who has the biggest military budget in the world? Example: How to Tell a Story with Data Who has the biggest military budget as a proportion of GDP? Key Questions in Data Analysis Correlation does not imply causation When correlation equals causation? – A randomized controlled trial (experiment) Online controlled experiments (A/B testing, companies, academics) Offline randomized policy evaluations (government, academics) When correlation does not equal causation, what should we do? – Different data analysis methods – Methods are easy to implement using software (R packages, Stata Packages), why do we need people? – We need to interpret our results and apply methods in appropriate contexts Recap: What is Causal Effect? Selection Bias and AI AI can make big mistakes “Automated Inference on Criminality Using Face Images” (2016) Use machine learning to detect features of the human face that are associated with “criminality” They claim that their algorithm can use simple headshots to distinguish criminals from noncriminals with high accuracy (90%) What is wrong with it? Selection Bias and AI Cesare Lombroso’s 1876 book Criminal Man Selection Bias and AI Selection bias in training data sets Over 1,800 photos of Chinese men About 1,100 of these were noncriminals: Their photographs were taken from job-based social networking sites and staff listings from professional firms Over 700 of the subjects were convicted criminals: Their photos were provided by police departments and taken from their official ID The two sample are not comparable Selection Bias and AI The algorithm: criminals have smaller angles θ between the nose and the corners of the mouth, and higher curvature ρ to the upper lip Selection Bias and AI Training data set: The criminals are frowning or scowling. The noncriminals are faintly smiling As one smiles, the corners of the mouth spread out and the upper lip straightens An alternative—and far more plausible—explanation for the authors’ findings Noncriminals are smiling in their professional headshots, whereas criminals are not smiling in their government ID photographs They have not invented a criminality detector; they have invented a smile detector. Goals for today Data and Data-Generating Process Clustering Analysis Slide 19 of 24 Data and Data-Generating Process A large class of paradoxes that reflect the tensions between causation and association Key lessons: Data and data-generating process are different Same data and different data-generating processes can lead to different results A New Question: Smoking for Newborns A puzzling finding: birth-weight paradox Yerushalmy (1959) collected 15,000 children in San Francisco The low-birth-weight babies of smoking mothers had a better survival rate than those of non-smokers? The mother’s smoking actually had a protective effect? A New Question: Smoking for Newborns Smoking is harmful in that it contributes to low birth weight, but other causes of low birth weight, such as genetic diseases, are much more harmful. Two explanations for low birth weight: (i) smoking mother and (ii) genetic diseases If the mother is a smoker, it explains away the low weight and reduces the likelihood of genetic diseases If not, it increases the likelihood of genetic diseases Berkson’s Paradox Berkson (1964): even if two diseases have no relation to each other in the general population, they can appear to be associated among patients in a hospital Neither Disease 1 or 2 is severe enough to cause hospitalization, but the combination is Disease 1 is highly correlated with disease 2 in the hospitalized population Disease 1 Disease 2 Hospitalization Berkson’s Paradox By performing a study on patients who are hospitalized, we are controlling for hospitalization It is wrong to focus only on patients who are hospitalized Simpson’s Paradox Simpson (1951): BBG drug (not a randomized trial) Women: 5% in the control group had a heart attack, 7.5% in the treated group (took the drug) had a heart attack Men: 30% in the control group had a heart attack, 40% in the treated group had a heart attack Simpson’s Paradox The drug is associated with a higher risk of heart attack? Look at the whole sample: 22% had a heart attack in the control group, and 18% did in the treated group The drug causes or prevents heart attack? Bad for men, bad for women, good for people? To understand Simpson’s paradox, we need to look beyond the data to the data- generating process Simpson’s Paradox Drug Heart Attack Gender In the study, women had a preference for taking the drug, and men preferred not. Gender is a confounder, we need to control for gender Control for gender means looking at each subsample by gender or control for gender variable in a regression Bad for men, bad for women, bad for people! Simpson’s Paradox Drug Heart Attack Gender In this case, the trends in subsamples represent causality The trends in the whole sample represent correlations Simpson's Paradox alerts us: at least one of the statistical trends (either in the whole sample, subsamples, or both) cannot represent causal effects Simpson’s Paradox The essence of Simpson’s paradox is about whether to control for a variable (whether to look at subsamples): correlation vs. causality, trends in subsample or whole sample may just be correlations. Sometimes looking at subsample is correct (our previous example), sometimes looking at the whole sample is correct (our next example) The key: what is a good control, what is a bad control? It has important implications for regression analysis Y, X, X1, X2, … Simpson’s Paradox Drug Heart Attack Blood Pressure A similar example: Drug B is supposed to reduce blood pressure. Researcher wanted to see if it also might reduce heart attack risk Simpson’s Paradox A similar example: the same numbers Data is the same, but the data-generating process is different! Simpson’s Paradox Bad for low blood pressure patients Bad for high blood pressure patients Good for the whole sample Good or Bad? Look at whole sample or subsamples? Blood pressure is a mediator rather than a confounder If we control for blood pressure, it would disable one of the causal paths Our conclusion is the opposite of the previous example: Drug B works, we need to look at the whole sample instead of subsamples Blood pressure is a bad control, we should not Simpson's Paradox: Example Two points in today’s class 1. Correlation does not imply causation 2. Sometimes, it is not enough to look at data only; we need to understand the data- generating process (Simpson's Paradox) Clustering Analysis Consider you are a marketing manager of AT&T From the complete set of customers… Identify different sub-groups (clusters) with similar preferences An appropriate plan can be tailored to each cluster Many calls after 2PM Most calls are local Very long calls (50 minutes average) Long distance Most calls before 1PM International calls Mostly local calls (more than Short calls (less than 5 minutes average) 15 minutes) Properties of a cellphone plan : 1.Valuable for customers Address some needs that can be discerned from the data 2. Address the needs of a group of customers (rather than design a program for a handful of customers) 36 Motivations for Clustering Analysis Methods that partition a set of objects (e.g., customers) into cohesive groups (clusters), based on the (dis)similarity among objects. Clustering Analysis Algorithms that partition the examples into clusters Inter-cluster Intra-cluster distances are distances are maximized minimized Within cluster similarity: Customers in each cluster have similar attribute values Inter-cluster differences: Customers in a given cluster have different attribute values than those of customers in other clusters Whisky Types Different Whiskies vary along a variety of aromatic features, e.g., Body (Light- Heavy), Sweetness (Dry- Sweet), Smoky (Peaty), etc. How to define types of Whiskies? Whiskies of a given type should have similar aromatic properties Equally important when defining types of Whiskies: Whiskies of different types should differ with respect to their aromatic properties There 39 should be many Whiskies that fit each type. Whisky classified Classification of Single Malt Whiskies by clustering (from Whisky Classified) Cluster A ( Full-Bodied, Medium-Sweet, Pronounced Sherry with Fruity, Spicy, Malty Notes and Nutty, Smoky Hints): Balmenach, Dailuaine, Dalmore, Glendronach, Macallan, Mortlach, Royal Lochnagar; Cluster B ( Medium-Bodied, Medium-Sweet, with Nutty, Malty, Floral, Honey and Fruity Notes): Aberfeldy, Aberlour, Ben Nevis, Benrinnes, Benromach, Blair Athol, Cragganmore, Edradour, Glenfarclas, Glenturret, Knockando, Longmorn, Scapa, Strathisla; Cluster C (Medium-Bodied, Medium-Sweet, with Fruity, Floral, Honey, Malty Notes and Spicy Hints ): Balvenie, Benriach, Dalwhinnie, Glendullan, Glen Elgin, Glenlivet, Glen Ord, Linkwood, Royal Brackla; Cluster D (Light, Medium-Sweet, Low or No Peat, with Fruity, Floral, Malty Notes and Nutty Hints ): An Cnoc, Auchentoshan, Aultmore, Cardhu, Glengoyne, Glen Grant, Mannochmore, Speyside, Tamdhu, Tobermory; Clustering Algorithms Two basic sorts: Partitional and hierarchical algorithms Partitional Clustering Partitioning examples into non-overlapping clusters: each data object (example) is in exactly one cluster A Partitional Clustering K-means Algorithm The K-means algorithm is a partitional clustering approach Number of clusters, K, must be specified in advance by the analyst The basic algorithm is simple K-means Algorithm Step 1: Choose k and initial cluster seeds, k = 3 K-means Algorithm Step 2: The initial clusters are formed by assigning each point to the closest seed K-means Algorithm Step 3: In the update step, the cluster centroid is calculated as the average value of the cluster members K-means Algorithm Step 4: In the update step, each data point is assigned to the cluster whose centroid is closest to it. The algorithm terminates when no data points are reassigned. 1 3 2 Demo: 3-Means Algorithm The examples are color-coded Iteration Iteration6 5 1 2 3 4 based on the clustering/grouping 3 3 the algorithm identifies. 2.52.5 Initially, the algorithm randomly selects 2 2 three ponits , and assumes that each represents the center (centroid) 1.51.5 of one of the three clusters. y y 1 1 Then, the algorithm iteratively follows 0.50.5 two steps. At each iteration the algorithm refines the grouping/clustering 0 0 -2 -2 -1.5 -1.5 -1 -1 -0.5 -0.5 0 0 0.50.5 1 1 1.51.5 2 2 xx Example Frequency of visits Avg. amount spent Arbitrarily assigning two initial seeds Frequency B A Avg. amount Important concept: centroid -- the “average points” in the cluster. Below: the cetroid of a given cluster is a data point corresponding to the average visit frequency and the average spending amount of all the customers in the cluster Frequency B A Avg. amount Frequency B A Avg. amount B A Importance of Choosing Initial Seeds Iteration 6 1 2 3 4 5 3 2.5 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x Importance of Choosing Initial seeds … Iteration 5 1 2 3 4 3 2.5 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x Importance of Choosing Initial seeds … Iteration 6 Iteration 5 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x Example: Cluster Analysis Step 1: Describe the data Example: Cluster Analysis Step 1: Describe the data The graph seems to indicate that there are some distinct groups STATA Example Step 2: Clusters cluster k flex speed strength, k(4) name(group) s(r) Syntax Number of clusters Random starting seeds Example Step 3: Evaluate cluster performance tabstat flex speed strength, by(group) stat(mean) Group 1 is already doing well in flexibility and speed but will need extra strength training. Example Step 3: Evaluate cluster performance 0 5 10 1 111 111 1 10 11 11 111 1 1 1111 1 11 111 1 2 1 2 2 2 22 2 2 2222 2 2 2 2 2 2 22 2 2 2 2 22 2 2 2 2 2 22 2 2222 flexibility 2 4 4 4 4 2 5 33 4 444 4 4 4 44 3 3333 3 3 333333 4 4 4 44 4 333 3333 4 3 3 3333333333 333333 333 33 33 3 3 33 33 33 3 3 3 1 0 10 11 11 111 1111 1 44 4 44 4 111 1 11111 1 11 111 1 4 4 4 4 44 4 4444 4 4 2 2 4 2 5 2 222 2 2 speed 2 2222 2 222 2 2 2 22 2 22 22 2222 2 2 33333333333333 2 2 2 2 3 333 3333333 3 3 2 2 3 3 33 333 33 3 3 333 3 33 3333 33 3 33 3 3 0 3 333333333 3 33 33 10 333 3 33333333 3 33 333 33333333 3 33 33333333 33 3 3 22 22 22 222 3 3 22 2 2222 2 2 2 2 2222 2 2 22 22 2 22 2 2 2 1 1 2 2 1 1 11 11 1 1 111 1 1 1 111 1111 11 11 1 11 strength 5 4 4 4 4 44 4 4 444 4 44 4 4 4 4 44 0 0 5 10 0 5 10 graph matrix flex speed strength, mlabel(group) Cluster Example 2 You have just started a women’s club. Thirty women from throughout the community have sent in their requests to join. You have them fill out a questionnaire with 35 yes–no questions relating to sports, music, reading, and hobbies. Cluster Example 2 In planning the first meeting of the club, you want to assign seats at the five lunch tables on the basis of shared interests among the women. You really want people placed together who share the same positive interests, not who share dislikes. Cluster Analysis Cluster Example 2 Step 1: Describe the data Cluster Example 2 Step 2: Clusters cluster k bike-fish, k(5) name(group) s(r) Step 3: Evaluate cluster performance tabstat bike-foot, by(group) stat(mean) For next class Have a great Halloween! 66