Exploratory Data Analysis PDF
Document Details
Uploaded by InnocuousSelenium
2024
Dhruva V. Raman
Tags
Summary
This document contains a lecture on exploratory data analysis (EDA), a crucial topic in data science. It covers fundamental concepts like covariance and correlation, as well as practical techniques for visualizing data and drawing insights.
Full Transcript
Exploratory data analysis Introduction to Data Science 2024-2025 Dhruva V. Raman Exam info Learning to be a good data scientist ≠ Passing the exam Put the e ort here ff Comments on learni...
Exploratory data analysis Introduction to Data Science 2024-2025 Dhruva V. Raman Exam info Learning to be a good data scientist ≠ Passing the exam Put the e ort here ff Comments on learning Lecture Lecture with audio without audio Ambiguous! No full sentences Text + speech > speech Text: read a book/website! Options Watch the lecture Watch YouTube videos provided Use the internet Confusing bits recap A random variable is a function Outcomes of Number experiment Correlation vs covariance [X] 2m Covariance is an absolute quantity Bigger than mean cov(X, Y) =5 [Y] cov(2X,2Y) = 2 × 2 × 5 = 20 Smaller than mean cov(cX, dY) = c × d × cov(X, Y) 1.5m Smaller than mean Bigger than mean What happens if I change axis units? (e.g. metres to cm) 𝔼 𝔼 Correlation is scaled covariance [X] Correlation is invariant to scaling RVs Bigger than mean cov(X, Y corr(X, Y) = [Y] Var(X)Var(Y) Smaller than mean ∈ [−1,1] I,.e. between -1 and 1 Smaller than mean Bigger than mean corr(cX, dY) = corr(X, Y) Change axis units (e.g. metres to cm), and correlation won’t change! 𝔼 𝔼 Correlation is scaled covariance Doesn’t matter what the numerical values are! Correlation is invariant to scaling RVs cov(X, Y corr(X, Y) = Var(X)Var(Y) ∈ [−1,1] I,.e. between -1 and 1 corr(cX, dY) = corr(X, Y) This week Week 2 Week 3 Week 4 Exploratory data Probability theory Statistical inference analysis Statistics Visualisation Hypothesis testing Less time reviewing More time… lecture content doing it! Data is just a bunch of numbers with labels Task of data analysis: How can we get insight from it? Confirmatory data analysis Already have CDA: does the hypothesis data support it? Left-handed people have Data says yes/no! a lower life expectancy Exploratory data analysis No hypothesis, How can I build relevant just data hypotheses for CDA? CDA EDA Tests hypotheses Looks for hypotheses Settles questions Raises questions “Inferential statistics” “Descriptive statistics” Good EDA boils down to: How do we extract interesting questions/hypotheses from our data? Interesting patterns? What is a pattern/ weirdness? Interesting exceptions/ How can we nd anomalies/weirdnesses? them? fi Case study Anscombe quartet datasets Summary Four di erent datasets of x/y values Purpose/meaning? Wait on this… Exploratory question Compare datasets? (It’s a pandas dataframe) ff Dataframes can be interpreted as random variables Each row is outcome of an experiment Each column is a random variable … Summarise? Var[X | dataset, Y | dataset]? [Y | dataset]? [X | dataset]? 𝔼 𝔼 Numerical summaries Expected x,y values for each dataset Means are identical (Mean = expected value) [X | dataset] [Y | dataset] 𝔼 𝔼 Numerical summaries Standard deviation of x,y values for each dataset Standard deviations are identical (Standard deviation is square root of variance) σx = Var[X | dataset] σy = … Graph of mean and standard deviation Bar chart Dataset is a categorical variable => good x-axis of bar plots x,y are quantitative variables Are x and y related? So far, treated them as independent random variables cov(X, Y) corr(X, Y) = σxσy ∈ [−1,1] I,.e. between -1 and 1 Are x and y related? So far, treated them as independent random variables Correlation between x,y values cov(X, Y) for each dataset corr(X, Y) = σxσy ∈ [−1,1] I,.e. between -1 and 1 Lines of best fit Same slope, y-intercept! Correlation interpretation: how well the data matches the line Full plot: Anscombe’s quartet All of these are identical Expected value Standard deviation/variance Correlation/covariance Line of best t fi Full plot Humans are good at seeing visual patterns Easy to see the di erent patterns Disguised by the summaries ff “The greatest value of a picture is when it forces us to notice what we never expected to see” John Tukey, pioneer in EDA techniques 4 R’s of EDA Revelation Seeing the data graphically! Residuals = data - t …patterns are important, but so are deviations from the pattern! fi 4 R’s of EDA Resistance Summaries should be resistant to outliers (IV) Re-expression Coming soon… Opinions on this graph What’s good? What’s bad? Opinions on this graph What’s changed? What’s better? Line of best t suitable? fi Opinions on this graph What’s changed? What’s better? Re-expression Changing scales helps us see patterns Always good to nd linear pattern! Anything wrong with presenting this? fi Logarithmic scales give relative changes in data Distance is absolute di erence in value Square = 0.2 Distance is relative Square = di erence in value multiply by 10 ff ff Summary so far 4 Rs help us to get a ‘feel’ for the data Good visualisation extracts more insight from the same data Interpreting EDA can be dangerous Visualisations of data Good plot: quick insight into quantitative di erence Right > Left Bigger di erence for men ff ff Visualisations of data …tell a biased story Good plot: quick insight into quantitative di erence …that the plotter wanted you to see ff Visualisations of data …tell a biased story Same data. More comforting? What’s di erent? ff Visualisations of data …tell a biased story Same data. Less comforting? Always look at the axis scale!! Tempting conclusion Left-handed people Why? live shorter lives! Lots of hypotheses! Real reason Historically, left-handers were discriminated against LH-identifying population is younger Interpreting EDA can be dangerous Truth could not be inferred from data Purely data-based conclusions can be misleading Need cultural/domain- Sometimes it’s deliberate! speci c knowledge fi Another type of danger Data suggests gender bias in admissions CDA: why? Paradox? Simpson’s Paradox Simpson’s Paradox Num applicants Females disproportionately applied to harder courses Acceptance rate (%) What’s bad about plots? Solutions? Simpson’s Paradox Neglect data that’s irrelevant to your story (Only after EDA!) Challenge: convey the message in a single gure fi Real life Limited 1972 dataset not relevant to real life: don’t draw real-life conclusions! Our example Other real-world examples Disguised lack of bias Disguised presence of bias Concluding remarks EDA is not a black-and-white Use common sense! subject Humility to domain experts: 4Rs are a guideline data isn’t the whole story! How should I approach EDA? Creative numerical detective work! Are there simpli ed ways of …easier to nd patterns/ presenting the data? anomalies Are there unexpected patterns …how do I visualise them in the data? clearly fi fi