Data Analysis - Variables Quantitatives et Qualitatives PDF

Data analysis Michele Pezzoni 1 Some definitions Population Set of individuals defined by having a common characteristic – Voters – French students – European regions population...

Data analysis Michele Pezzoni 1 Some definitions Population Set of individuals defined by having a common characteristic – Voters – French students – European regions population 2 Some definitions Sample Subset of a population – A given number of voters drawn at random – A given number of students drawn at random – A given number of regions drawn at random population 3 Some definitions Observation Each element in the population or in the sample Each voter – Each student – Each region Size It is the number of individuals in the sample or in the population. It is represented by "n" in the case of the sample and by "N" in the case of the population. A population of N=50 million voters and a sample of n=100 voters – A population of N=10 million students and a sample of n=10,000 students – A population of N=50 regions and a sample of n=10 regions 4 Some definitions The variable It is the feature that we want to study. – The voting behaviour – The students’ grades – Car accident death rates Value The different “ways of being” of a variable – Vote: Republican or Democrat – Grade: 0,1,2,3,…,20 – Standardized mortality rate per 100,000 habitants: 1,2,3 … 5 Dataframe: Cars mtcars=data.frame(mtcars)Variable Value Observatio n Sample size n 32 Dataset description: The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). 6 3 types of datasets 1. “Cross-section”  Microeconomics, industrial economics, sociology, regional economics n individuals, m variables, 1 time interval t 2. “Time series”  Macroeconomics, finance  1 individual, m variables, n time intervals (t1…tn) 3. “Panel data” n individuals, m variables, k time intervals (t1…tk) 7 Cross-section Several individuals observed only once Variables Observatio ns 8 Time series An individual observed over time French GDP 9 Panel data Several individuals observed over time GDP of the OECD Individual Time Variables countries Observation 10 Two types of variables Quantitative variables Variables that are expressed as numeric values and that can be sorted in a precise order Age of an individual GDP of a country CO2 emissions Fuel consumption Qualitative variables Variables that, with their values, identify groups of observations Political preferences Gender of an individual Opinions 11 Quantitative variables 12 Quantitative variables Continuous (quantitative) variables Variables that can take any value within a given interval of real numbers The height of an individual The temperature Discrete (quantitative) variables Variables that can take a limited set of values. Most of the time, these values are integers the number of children in a family the number of TVs per house 13 mtcars variable description Variable types: mpg is a continuous variable cyl is a discrete quantitative variable am (automatic gearbox / Manual) is a qualitative variable 14 Car consumption We are interested in investigating the fuel consumption level of the cars included in our sample. What is the average consumption of the cars in our sample? Is the consumption level homogenous for all cars? The fuel consumption is calculated in Miles / (US) gallon, however in Europe we are more familiar with the metric system. Therefore, we create a new consomption variable measured in « Liters per 100 km »: 1 Gallon = 3.8 liters 1 mile = 1.61 kilometres consumption in « L per 100 km »: gallon_l = 3.785411784 mile_km = 1.60934 conversion_factor = (1/(mile_km/gallon_l))*100 mtcars$lau100km= conversion_factor/mtcars$mpg 15 Car consumption What is the average consumption of the cars in our sample? mean(mtcars$lau100km) Is the consumption level homogenous for all cars? … 16 Histogram Histograms are particularly useful to describe the distribution of a variable The highest bars represent the value classes that are more frequent in the study sample In practice a histogram is a set of rectangles, each rectangle correspond to a class of values. The rectangle surface is proportional to the frequency of that class. Frequence Value classes 17 Histogram: Continuous variables hist(mtcars$lau100km, xlab="Consumption [l/100km]") hist(mtcars$lau100km, breaks = c(0,10,20,30), xlab="Consumption [l/100km]") 18 hist(mtcars$lau100km, breaks = c(0,5,10,15,20,25), Histogram: Discrete variables hist(mtcars$cyl, xlab="Number of cylinders") 19 Different distribution shapes: Unimodal distribution (Simulated) One prominent peak (unimodal) 20 Different distribution shapes: A bimodal distribution (Simulated) Several major peaks (bimodal / Multimodal) Often it is the result of the union of two different distributions. For instance, the average speed of 2000 French trains – Sample of 1000 regional trains for which we recorded the average speed – Sample of 1000 TGVs for which we recorded the average speed 21 Different distribution shapes: Uniform (Simulated) No apparent peak (Uniform) 36 numbers, each number has the same probability of being drawn Histogram of extrations 600 500 400 Frequency 300 200 100 0 0 5 10 15 20 25 30 35 extrations 22 Different distribution shapes: Skewness Histogram right-skewed, a left-skewed, or Symmetrical Histograms are skewed on the side of the long tail of the distribution 23 Exercise 1: real data from the dataset quakes Comment on the distribution of the magnitude of earthquakes near Fiji islands quakes=data.frame(quakes) hist(quakes$mag) This dataset has been provided by the Harvard PRIM-H project. 24 Exercise 2: real data from the dataset quakes Comment on the sample distribution of the depth of the earthquakes hist(quakes$dept) 25 Exercice 3 Which of the following variables is uniformly distributed? (a) adult weight (b) salaries of a random sample of individuals (c) house prices (d) the birthdays of my classmates 26 Variance and standard deviation Is the consumption level more homogeneous for cars with manual or automatic transmission? 𝑛 1 𝑣𝑎𝑟 ( 𝑋 )= ∑ (𝑥 𝑖 −𝜇) 𝑠𝑑 ( 𝑋 ) =√ 𝑣𝑎𝑟 ( 𝑋) 2 𝑛 𝑖=1 = Variance 14.92 (l/100km)2 var(mtcars$lau100km) Standard deviation 3.86 l/100km sd(mtcars$lau100km) Why presenting standard deviation is better than variance? 27 Variance and standard deviation We create two subsample of cars, one with manual and one with automatic transmission mtcars_manual = subset(mtcars, am==1) mtcars_automatic = subset(mtcars, am==0) hist(mtcars_manual$lau100km) hist(mtcars_automatic$lau100km) sd(mtcars_manual$lau100km) 2.78 sd(mtcars_automatic$lau100km) 3.61 28 Quartiles Quartiles are the values that divide the datapoints in four equal parts 25, 50, 75% Deciles are the values that divide the datapoints in ten equal parts 10, 20,...., 90% Percentiles are the values that divide the datapoints in one quantile(mtcars$lau100km) hundred equal parts 1, 2,....,probs quantile(mtcars$lau100km, 99% = seq(0, 1, 0.1) ) summary(mtcars$lau100km) Inter-quartile range (Q1 - Q3): IQR(mtcars$lau100km) Any comment on the median and average values? 29 Exercise We select the cars with the 25th percentile highest consumption level How many cars do we expect to find in the subsample? subset(mtcars, mtcars$lau100km>15.25) Comments? Country? 30 Boxplot It is a graphical representation of the main characteristics of a continuous variable distribution It is composed by 1) a box representing the 1st and 3rd quartile 2) a line representing the median 3) an upper and lower bound to define the extreme values 4) points representing the extreme values (Outliers) boxplot(mtcars$lau100km, horizontal = FALSE) 31 Boxplot Upper and lower inner fences IQR = Q3 - Q1 h = 1.5*IQR = 1.5(Q3 - Q1) lower and upper inner fences: – UIF =min(max(x),Q3 + h) – LIF = max(min(x),Q1 – h) Eg. Calculate LIF / UIF summary(mtcars$lau100km) h=1.5*(15.25-10.32) UIF=min(max(mtcars$lau100km), (15.25+h)) LIF=max(min(mtcars$lau100km),(10.32- h)) 32 What if we find values outside the range LIF and UIF? Outliers Do we have extremely powerful cars in our sample? boxplot(mtcars$hp) summary(mtcars$hp) h=1.5*IQR(mtcars$hp) UIF=min(max(mtcars$hp),(180.0+h)) subset(mtcars, hp > UIF) Which car is the outlier? 33 Qualitative variables 34 Qualitative variables Qualitative variables identify the characteristics of an observation Religion Opinion … Often it is necessary to distinguish whether the observation does have or does not have the characteristic: variable dummy smoker (=1)/ non smoker (=0) woman (=1) / man (=0) red (=1) / not red (=0) manual (=1) / automatic (=0) gearbox 35 Table and bar plot It is common to describe a qualitative variable using a table or a barplot table(mtcars$am) 15 0=automati 1=manual c 19 13 10 5 0 Automatic Manual barplot(table(mtcars$am), names.arg = c("Automatic", "Manual")) 36 Table and bar plot (chickwts) Qualitative variables that are not binary variables. An example from the database chickwts chickwts=data.frame(chickwts) table(chickwts$feed) barplot(table(chickwts$feed)) 37 From a quantitative variable to a qualitative variable 38 From a quantitative variable to a qualitative variable We can define a new qualitative variable that identifies the most powerful cars We decide that the most powerful cars are those cars above the 75th percentile summary(mtcars$hp) We define a binary variable (1/0) mtcars$vp=as.numeric(mtcars$hp>180) table(mtcars$vp) Powerful car 0 1 25 7 39

Data Analysis - Variables Quantitatives et Qualitatives PDF

Document Details

Tags

Related

Summary

Full Transcript