MAS 5112 Exploratory Data Analysis Lecture 1 PDF
Document Details

Uploaded by AstonishedAgate4976
University of Colombo
Dr. Sunethra Abeysinghe
Tags
Summary
This document is Lecture 1 for MAS 5112: Exploratory Data Analysis from Dr. Sunethra Abeysinghe from the Department of Statistics from the University of Colombo. It covers introductory concepts such as data, variables (quantitative and qualitative), and descriptive methods.
Full Transcript
MAS 5112: EXPLORATORY DATA ANALYSIS LECTURE 1 - Dr. Sunethra Abeysinghe - Department of Statistics - University of Colombo CONTENTS The scope Population Data and Misuse of of and Variables Statistics Statistics...
MAS 5112: EXPLORATORY DATA ANALYSIS LECTURE 1 - Dr. Sunethra Abeysinghe - Department of Statistics - University of Colombo CONTENTS The scope Population Data and Misuse of of and Variables Statistics Statistics Samples What Is Statistics? Statistic is about DATA Are Data Numbers? Numbers with a context E.g:- 12 is a number (no context) The weight of 2-year old child is 12 kg ---- Number with a context ☺ The average weight of 2-year old is 10.5 kg ( Statistic) Data is Everywhere Weather data Stock market data Population data etc DATA IS EVERYWHERE 4 STATISTICS The science of Collecting Organizing Interpreting numerical facts Numerical facts→ DATA 5 More female students enter into government universities True or False? …women are paid less in private sector employment ? - True or False? 6 AGE AND PALM LINES …. Long lines predict long life? True or False? How are we to evaluate such claims? 7 TSUNAMI IN SRI LANKA Median income in areas close to the sea, has gone up post-Tsunami. True or False? 8 Goal of Statistics To gain understanding from data Data?? Numbers that have some context 7.6% 7.6% of children born have low BW Remember→ Goal of Statistics is to Gain an understanding from numbers.. A Statistician?? A Data Detective!! 9 Some Terms To Remember The target population is the complete collection of individuals or objects that are of interest to the study Eg: If we are interested in studying the problems of Colombo University students, our population is “all the students of the Colombo University.” 10 Some Terms To Remember Sample is a subset of the population. Large populations are difficult to study, and therefore, very often information is obtained from a sample of the population 11 Some Terms to Remember Data is a collection of some information about some individuals (Individuals are not only humans but may be objects!) A variable is some characteristic about some individual An observation is a value that a variable assumes for a single element of a population or sample. 12 TASK 1 Choose a publicly available dataset with following features Have fairly enough number of observations (large dataset) Have both categorical and quantitative variables Enter the details of the dataset into the google sheet https://docs.google.com/spreadsheets/d/1BJAs86jNHX3L- oQu_zK_bE5NH4LyZuyW/edit?usp=sharing&ouid=111111039116403803811&rtpof=true&sd=true EXAMPLE…. Number of defective items in 50 batches of an electronic components produced in a factory. 3 4 7 1 1 1 4 3 6 2 4 2 2 1 1 1 3 1 15 2 1 2 1 3 5 2 1 4 2 4 What do these numbers mean to you? Are there any interesting feature/s you need to know? 1 3 2 5 3 2 7 2 5 8 1 3 5 1 4 1 1 1 5 2 14 EXAMPLE…. It is difficult to look at each number in turn and draw conclusions. We need to organize and summarize data We can use numerical and graphical methods to summarize data. 15 Organizing and Analyzing Data Descriptive methods Procedures used to summarize information about samples in a convenient and understandable form without making any conclusions about the data. Inferential methods (A mixture of the two would be ideal in most situations) 16 EXAMPLE Marks of subject A – 63, 41, 55 Marks of Subject B – 60, 58, 59 On the basis of this information, we can report that subject A had an average of 53, and subject B had an average of 59. Here we have described the two data sets. -That is descriptive statistics 17 Types of Variables 1. Quantitative (numerical) variable is a variable whose values are numerical in nature. Weight of a person Exam marks Income 18 Variables….. 2. Qualitative Variable ivariable having categories or classifications that are not numerical in nature. Gender (Male, Female) - dichotomous Social class of a person (High, Med., Low) -multinomial 19 Variables… Discrete variable is a variable that can take only countable or finite values. Eg : 1. Number of customers arriving at a supermarket. 2. Number of children in a family. 20 Variables… Continuous variable is a variable that can take uncountable number of values or any real values. Eg : 1. Amount of rainfall. 2. Time taken to complete a computer job. 21 Discrete data -- Gaps between possible values Continuous data -- Theoretically, no gaps between possible values 22 SUMMARY Types of data Quantitative Qualitative (Numerical) (Categorical) Discrete Continuous Discrete 23 Examples There are many situations where a continuous quantitative variable is divided into arbitrary categories and treated as a qualitative variable. Age considered as age categories Monthly salary is often considered as a qualitative variable by grouping them into classes. 24 TASK 2 Enter following details with respect the dataset you chosen Categorical variables Numerical Variables Discrete Variables Continuous Variables Link to the sheet : https://docs.google.com/spreadsheets/d/1BJAs86jNHX3L- oQu_zK_bE5NH4LyZuyW/edit?usp=sharing&ouid=111111039116403803811&rtpof=true&sd=true SCALES OF MEASUREMENTS Scales of measurements become important when it comes to deciding what statistical methods can be used with the data. There are four types: Nominal Ordinal Interval Ratio 26 Nominal SCALE A qualitative grouping A question could be “what different types of dogs you have?” The answer would be the types and we could give counts for each type. We may also talk about the mode ( the type that gives the highest count ) with this measurement type. 27 Ordinal Scale There is ‘order’ in the measurement values. Class rank is a typical example. Student A with rank 1 had performed better at the examination than student B with rank 2. We do not know how better student A is than student B. Mode and median can be used to describe this measurement. 28 Interval Scale Preserves the order and tells you how far apart each observation is. 30 degrees F is 10 degrees warmer than 20 degrees F and 80 degree F is 5 degrees cooler than 85 degrees F. No absolute zero in Fahrenheit scale. 29 Ratio Scale Preserves the one unit difference across the scale to be the same. There is a zero point 4 units in a ratio scale is twice as high as 2 units. 30 degrees C is twice as hot as 15 degrees C as C (Kelvin) has a zero point. F does not have a zero point. 30 TASK 3 List down the variables with respect to the scale of Link to the sheet : https://docs.google.com/spreadsheets/d/1BJAs86jNHX3L- oQu_zK_bE5NH4LyZuyW/edit?usp=sharing&ouid=111111039116403803811&rtpof=true&sd=true HOME WORK For the dataset Chosen by you, write down few(at least 5) research questions of interests Link to the sheet : https://docs.google.com/spreadsheets/d/1BJAs86jNHX3L- oQu_zK_bE5NH4LyZuyW/edit?usp=sharing&ouid=111111039116403803811&rtpof=true&sd=true