Mathematics for Computer Science Engineers - Unit 1 PDF
Document Details
Uploaded by Deleted User
Mamatha.H.R
Tags
Summary
This document presents course content for a Mathematics for Computer Science Engineers course, focusing on unit 1. It covers topics like probability distributions, point estimation, confidence intervals, hypothesis testing, and distribution-free tests. The document also outlines applications, tools used (Jupyter Notebook, Python, etc.), and evaluation policies.
Full Transcript
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERSUE23MA242A Unit 1:Introduction Mamatha.H.R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Unit 1:Introduction Mamatha H R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCI...
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERSUE23MA242A Unit 1:Introduction Mamatha.H.R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Unit 1:Introduction Mamatha H R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Course content Unit 1: Applications of Probability Distributions and Principles of Point Estimation Introduction, Motivating Examples and Scope. Statistics: Introduction, Types of Statistics, Types of Data, Types of Experiments – Controlled and Observational study, Sampling: Sampling Methods, Sampling Errors, Case Study. Chebyshev's inequality, Normal Probability Plots, Introduction to Generation of Random Variates and mention the types, Acceptance-Rejection method, Sampling Distribution, The Central Limit Theorem and Applications, Principles of Point Estimation - Mean Squared Error for Bernoulli, Binomial, Poisson, Normal, Maximum Likelihood Estimate for Bernoulli, Binomial, Poisson, Normal and Case Study. Introduction to multivariate normal distribution, MAP distribution. Self-Learning: Generation of Random Variates -Inverse Transform Method. 16 Hours Unit 2: Confidence Intervals and Hypothesis Testing Confidence Intervals: Interval Estimates for Mean of Large and Small Samples, Student's t Distribution, Interval Estimates for Proportion of Large and Small Samples, Confidence Intervals for the Difference between Two Means, Interval Estimates for Paired Data. Factors affecting Margin of Error, Hypothesis Testing for Population Mean and Population Proportion of Large and Small Samples, Drawing conclusions from the results of Hypothesis tests, Case Study. Self-Learning: Confidence interval for difference between two proportions. 12 Hours MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Course content Unit 3: Distribution Free Tests and Multiple Linear Regression Distribution Free Tests, Chi-squared Test, Fixed Level Testing, Type I and Type II Errors, Power of a Test, Factors Affecting Power of a Test. Simple Linear Regression: Introduction, Correlation, the Least Square Lines, Predictions using regression models - Uncertainties in Regression Coefficients, Checking Assumptions and transforming data, Introduction to the Multiple Regression Model, Case Study. Self-Learning: F test for equality of Variance. 14 Hours Unit 4: Engineering optimization Introduction to Optimization-Based Design, Modelling Concepts, Unconstrained Optimization, Discrete Variable Optimization, Genetic and Evolutionary Optimization, Constrained Optimization. Self-Learning: Mathematical concepts of objective function, Constraints and Decision variables. 14 Hours MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Course content: Applications Unit 1: Applications: 1. Poisson distribution, calculation of number of calls received in a specified time duration in call centers. 2. Variance, standard deviation, identifying the customer satisfaction in online shopping 3. Central limit theorem, Load Balancing in distributed systems and internet traffic prediction 4. Sampling mean, Estimating database query response times Unit 2: Applications: 1. t-distribution, confidence interval, students’ performance analysis based on hours of study 2. z-test, application form processing in banking system. 3. Hypothesis testing, randomly trained students placement into tier-I and tier-II companies. Unit 3: Applications: 1. Linear regression, stock market prediction 2. using Chi-Square Test, Analyzing the association between vaccination and recovery of the patients considering COVID data. 3. Chi-Square Test and Test of Independence, Analyzing the relationship between gender and preference for a product purchase. 4. Identifying Type 1 and Type 2 Errors in Spam mail classification. Unit 4: Applications: 1.Minimize a Loss functions in Neural Networks using Batch gradient descent (Unconstrained Optimization) 2. Lagrange Multipliers to find local maxima and minima of a function subject to equations constraints (Constrained Optimization) 3. Case study on Bayesian Optimization with Discrete Variables (Discrete Variable optimization) 4. Use Genetic Algorithms to optimize Production Scheduling in a manufacturing environment, focusing on minimizing total production costs while meeting job deadlines and machine constraints. Evaluate the GA’s effectiveness against traditional scheduling methods. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Tools and Textbooks Tools / Languages/Libraries: Jupyter Notebook, Python, Pandas, Matplotlib, Scipy, Seaborn, BeautifulSoup, Numpy, Scikit learn. Text Book(s): 1. “Statistics for Engineers and Scientists”, William Navidi, McGraw Hill Education, India, 4th Edition, 2015. 2. “Optimization Methods for Engineering Design, Parkinson, A.R., Balling, R., and J.D. Hedengren, Second Edition, Brigham Young University, 2018 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Evaluation Policy ISA Components Conduction Reduced to ISA 1 40 20 ISA 2 40 20 Assignment Coding-5M 10 Datathon-20 Total-25 Assignment Components 1. Submission of the hands-on session code submission=5 Marks 2. Datathon----------------------------------------------------------=5 Marks Total=10 Marks Note 1. It is expected that the codes and solutions for hands-on sessions to be submitted on the same day they are conducted. 2. Datathon will be conducted for 20 Marks and will be reduced to 5M MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What is Data Science? This is one of the simplest applications of Data Science. Such tasks would be impossible without the availability of data. Thus in simple words, Data Science is all about using data to solve problems. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What is Data Science? Data Science is an interdisciplinary field. It is focused on extracting knowledge and insights from data. Those insights are then applied to solve problems across a wide range of domains Source: theblog.adobe..com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science Source: edureka.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Recommender systems Source : https://www.martechadvisor.com/articles/customer-experience-2/recommendation-engines-how-amazon-and-netflix-are- winning-the-personalization-battle/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Recommender systems Source: https://medium.com/swlh/recommendations-in-time-context-93b32f73d98d MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Weather Forecasting Source: phys.org/news MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Sports Source: https://arstechnica.com/information-technology/2015/10/big-data-an-it-buzzword-that-is-actually-producing-results/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Politics Source: https://fivethirtyeight.com/features/todays-polls-and-final-election/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Healthcare & Medicine Source: http://www.primeclasses.in/blog/2019/08/26/the- need-for-data-science-in-healthcare-industry/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in predicting people’s opinions Source: Simplilearn MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What is Data? Technically, data refers to individual facts, statistics, or items of information, often numeric, that are collected through observation. Source: https://www.twinkl.de/teaching-wiki/data MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data vs Information ➔ Data Raw facts, usually formatted in a special way. Based on records, observations etc. Unorganized. ➔ Information A collection of facts organized in such a way that they have additional value beyond the value of the facts themselves. Based on analysis of data. Organized and always depends on data. Ex : Data – thermometer readings of temperature taken every hour: (16.0, 17.0, 16.0, 18.5, 17.0,15.5….) [on transformation] Information – today’s high: 18.5, today’s low: 15.5 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data vs Information Source: https://effectualsystems.com/data-need-information/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Types of Data Data Represented by Alphanumeric data Numbers, letters, and other characters Image data Graphic images or pictures Audio data Sound, noise, tones Video data Moving images or pictures MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Structured, Unstructured & Semi-structured Data Source: https://towardsdatascience.com/data-extraction-from-a-pdf-table- with-semi-structured-layout-ef694f3f8ff1 Source: slidegeeks.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Structured, Unstructured & Semi-structured Data Structured Data: Structured data is the data whose elements are addressable for effective analysis. The data is organized into a formatted repository that is typically a database. Ex: Relational data. Semi-Structured Data: It is the data that doesn’t reside in relational database but has some organizational properties that make it easier to analyse. Ex: XML data. Unstructured Data: It is the data which is not organized in a predefined manner or doesn’t have a predefined data model, thus not a good fit for a mainstream relational database. Ex: Word, pdf, text etc. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data Information Science-latin word Scientia, Meaning Knowledge Science is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. Source: guru99.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Information Concepts Source: https://learningforsustainability.net MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Why do we need Data Science? Source: https://static.seekingalpha.com/uploads/2020/1/14/50485001-15789998083991578_origin.png MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Why do we need Data Science? The main reason why we need data science is the ability to process and interpret data. This enables users and industries to make informed decisions as well as helps in their growth, optimization, and performance. We know that, unstructured data is generated everywhere, every second. Unstructured data isn't well organized or easy to access. But its growth is enormous and importance of analyzing and drawing inferences from this type of data is crucial. Data Science provides a number of methods and techniques to deal with such data. This certainly helps many businesses and industries significantly to improve their productivity. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS How is Data generated? By 2025, it’s estimated that 463 exabytes of data will be created each day globally – that’s the equivalent of 212,765,957 DVDs per day! Source: theblog.adobe..com Slide courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data generation Source: https://trak.in/tags/business/2014/04/15/digital-data-universe-expansion-2020/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Growth in Data generation MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Growth in Data generation MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS How much of data is put into use? Source: IDC, 2014 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS How much of data is put into use? Though there is a huge amount of data getting generated each day, it shall serve no purpose if it is left unused. This can further lead to information overload where there is an overabundance of information but it is not put into work due to lack of time, resources, understanding of the information, irrelevance of the information or other reasons. Thus, it is important to understand the data and know how to utilize it in the right manner. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS But is data all we need? The graph below shows a cause & effect relationship between ‘Age of Miss America’ and ‘Murders by steam, hot vapour and hot objects’ which practically doesn’t seem correct. Thus, we see that the presence of interesting patterns need not imply their correctness. Blindly applying various processes and techniques on data can result in incorrect inferences. Source: https://i2.wp.com/boingboing.net/wp- content/uploads/2016/02/chart.jpg?fit=800%2C315&ssl=1 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Learn how to use data We need to learn how to utilize and handle the available data in the right manner to be able to arrive at correct results and draw meaningful inferences. ➔ Explore: identify patterns ➔ Infer: quantify what you know ➔ Predict: make informed guesses MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data Science project life cycle The correct process of using available data is shown in this life cycle. It outlines the major stages in a data science project. Source: https://static.javatpoint.com/tutorial/data- science/images/data-science-lifecycle.png MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data Science project life cycle Source: https://res.cloudinary.com/practicaldev MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data Scientist Data Scientists in simple words are those who make sense out of all the data that are available and figure out the things that can be done with it. Source: proschoolonline.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What does a Data Scientist do? Source: medium.com Slide courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Prerequisites for a Data Scientist Curiosity Common Communication Sense skills Sources: quickanddirtytips.com, Slide courtesy:Dr.Uma dreamstime.com,linkedin.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Prerequisites for a Data Scientist Source: data- Slide courtesy:Dr.Uma flair.training MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Demand for Data Scientist Data Science is a growing field. It is a popular and lucrative profession. Glassdoor has ranked this profession at #3 in 2022 despite the occurrence of the pandemic. Sources : Glassdoor, Forbes MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS How is it different from what Statisticians have been doing? Both Statisticians and Data Scientists work closely with data. Statisticians use mathematical equations and statistical models to analyze data and arrive at conclusions. Data Scientists however focus on delivering actionable results and sometimes need to deploy the model to the production system. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data Science vs Data Analysis Data Science is primarily used to make decisions and predictions making use of predictive causal analytics, prescriptive analytics (predictive plus decision science) and machine learning. Data Analysis includes descriptive analytics and prediction to a certain extent. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data Science vs Data Analysis Source: edureka! MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Common tasks in Data Science Source: Simplilearn MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Common tasks in Data Science Source: https://static.javatpoint.com/tutorial/data-science/images/how-to-solve-a-problem-in-data-science.png THANK YOU Dr.Mamatha H R Professor, Department of Computer Science [email protected] +91 80 2672 1983 Extn 712 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS UE23MA242A Unit 1: Population & Sampling Mamatha.H.R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Topics to be covered ❖ Statistical Analysis ❖ Population ❖ Sample ❖ Sampling ❖ Types of Population MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Problems to be solved Suppose, you are interested in finding Mean height of all male students of all the universities in India. OR Average marks of all female students of PES University. OR Relationship between the time a student spends on studying and the grades that he gets. OR Impact of rise in number of student assignments on their grades. Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What is Statistical Analysis? It’s the science of collecting, exploring and presenting large amounts of data to discover underlying patterns and trends. Statistics are applied every day – in research, industry and government – to become more scientific about decisions that need to be made. The basic idea behind all statistical methods of data analysis is to make inferences about a population by studying a relatively small sample chosen from it. Source: media3.giphy.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Population A population is the entire collection of objects or outcomes about which information is sought. As mentioned, statistical methods are based on the idea of analyzing a sample drawn from a population. For this idea to work, identifying the population, sample and choosing the sample in an appropriate manner becomes important. In research, a population doesn’t always refer to people. It can mean a group containing elements of anything you want to study, such as objects, events, organizations, countries, species, organisms, etc. Source: keydifferences.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sample A sample is a subset of a population, containing the objects or outcomes that are actually observed. Sample size: The number of items in a sample is called a sample size. The size of the sample is always less than the total size of the population. The process of taking a predetermined number of observations from a larger population is called sampling. Sources: i.gifer.com, keydifferences.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Population vs Sample Population Sample The population is a complete set. The sample is a subset of the population Population is hard to define and A sample is much easier to contact and observe in real life. observe. It is time consuming and costly to study It is relatively less time consuming and a population low cost to study a sample. Population contains all members of a Sample is a subset that represents the specified group. entire population. Reports on a population are a true Reports on a sample are have a margin representation of opinion. of error. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Population & Sample examples Population Sample All countries of the world Countries with published data available on birth rates and GDP since 2000 Songs from the Eurovision Song Contest Winning songs from the Eurovision Song Contest that were performed in English Undergraduate students in the 300 undergraduate students from three Netherlands Dutch universities who volunteer for your psychology research study Advertisements for IT jobs in the The top 50 search results for Netherlands advertisements for IT jobs in the Netherlands on May 1, 2020 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Populations & Samples In a recent survey, 250 college students at Union College Were asked if they smoked cigarettes regularly. 35 of the students said yes. Identify the population and the sample. Responses of all students at Union College (population) Responses of 250 students in survey (sample) MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What is Sampling? The process of selecting observations(a sample) in order to make an inference that can be generalized to the population. What you What you want to actually talk observe in about the data INFERENCE Source Image : aprendeconalf.es Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling The methodology used to sample from a larger population depends on the type of analysis being performed. The population All of the individuals of interest The results The sample from the sample are Selected from the generalized to the population population The sample The individuals selected to participate in the research study MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Why sampling? We know that resources such as time, money and people are limited. When the population is large in size, geographically dispersed, or difficult to contact, it’s necessary to use a sample. Thus, most projects aim to gather data from a sample, rather than from the entire population. Some reasons for sampling are: Necessity: Sometimes it’s simply not possible to study the whole population due to its size or inaccessibility. Practicality: It’s easier and more efficient to collect data from a sample. Cost-effectiveness: There are fewer participant, laboratory, equipment, and researcher costs involved. Manageability: Storing and running statistical analyses on smaller datasets is easier and reliable. Saves time: As sample size is relatively less, it increases data-collection speed MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Characteristics of a sample A sample must be representative of the population. It must be appropriately sized. i.e. it must be sufficiently large to represent the population and provide statistical stability or reliability. It must be unbiased. It should contain all types of groups/units present in the population in fair proportions. It must be selected at random. This means that any item in the group has an equal chance of being and selected and included in the sample. It must be economical. The objectives of the survey must be achieved in as minimum of cost and effort as possible. It must be goal-oriented. It must be oriented to the research objectives and fitted to the survey conditions. Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Is it a good sample? Study : Survey of the job prospects of the students studying in a university. Sample: Taking survey from the students who are in Canteen. Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Is it a good sample? This is not an example of a good sample as, The students in the canteen are not completely representative of the students studying in the university. The size of the sample (i.e. the number of students in the canteen) might not be appropriate or sufficient enough to represent the population (students studying in the university). The sample selection is not performed at random as each student studying in the university doesn’t have an equal chance of getting selected. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Types of population 1. Tangible or concrete population 1. Conceptual population MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Tangible population Populations where the members are physical objects, such as cars, bolts, apples, etc., are called tangible or concrete populations. Such populations are assumed to be always finite and therefore involves counting. After an item is sampled, the population size decreases by 1. In principle, one could in some cases return the sampled item to the population, with a chance to sample it again, but this is rarely done in practice. Source: https://www.hindivarta.com/jansankhya- Slide courtesy: Dr.Uma ki-samasya-aur-samadhan-par-nibandh/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Conceptual population Populations that do not consist of physical or actual objects are called Conceptual populations. Conceptual populations are mostly the result of a measurement. It involves measuring something multiple times. Ex: length of a metal rod. It consists of a not well-defined group of which all elements are not available at the time the sample is collected as the population increases every day. The size of a conceptual population is usually large. Ex:a measuring scale population can be all the possible outputs it can give. i.e. infinite. The measured values can be thought of as a sample from this infinite population. Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Tangible & Conceptual population examples Define the population, and state whether it is tangible or conceptual. A shipment of bolts is received from a vendor. To check whether the shipment is acceptable with regard to shear strength, an engineer reaches into the container and selects 10 bolts, one by one to test. Ans: All the bolts in the shipment: Tangible population The resistance of a certain resistor is measured 5 times with the same ohmmeter. Ans: All measurements that could be made on that resistor with that ohmmeter : Conceptual population MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Target and Study population Target or Theoretical population refers STUDY POPULATION to the entire group of individuals or objects to which researchers are interested in generalizing the conclusions. It must meet a set of criteria of interest to the researchers. Study population or accessible SAMPLE population is the population to which the researches can apply their conclusions to. It is a subset of the target population. It may be limited to region, state, city, county, or institution TARGET POPULATION Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Target and Study population examples Target Population Study Population All institutionalized elderly All institutionalized elderly with Alzheimer's in St. with Alzheimer's Louis county nursing homes All people with AIDS All people with AIDS in the metropolitan St. Louis area All low birth weight infants All low birth weight infants admitted to the neonatal ICUs in St. Louis city & county All school-age children with All school-age children with asthma treated in asthma pediatric asthma clinics in university-affiliated medical centers in the Midwest MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Terminologies related to Sampling Target or Theoretical Population: The population to which the investigator wants to generalize his results. Sampling Frame : The sampling frame is the list from which the potential respondents are drawn. Ex: List of Universities, List of Students, List of Airline Companies, Telephone Directory Sampling Unit : Smallest Unit from which sample can be selected. Sampling Scheme: Method of selecting sampling units from sampling frame. Sample: All selected respondents form a sample. Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling Breakdown Source: https://image.slidesharecdn.com/qrmtheory- 180918191951/95/how-to-do-sampling-8- 638.jpg?cb=1537298482 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling Breakdown Study : Find the mean weight of all students of all universities in India. Whom do you want to generalize results? All universities in India ➔ Target or Theoretical population What population can you get access to? All universities in Karnataka ➔ Study population How can you get access to them? List of Universities in Karnataka ➔ Sampling frame Who is in your study? Two Universities from Karnataka ➔ Sample Slide courtesy: Dr.Uma THANK YOU Dr.Mamatha H R Professor, Department of Computer Science [email protected] +91 80 2672 1983 Extn 834 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS UE23MA242A Unit 1: Sampling Methods Mamatha.H.R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Unit 1:Sampling Methods Mamatha H R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What are Sampling methods? In a statistical study, sampling methods refer to how we select members from the population to be included in the study. The selected sample must be representative of the population. There are many ways to select a sample—some good and some bad. Sources: blog.masterofproject.com, analytics-magazine.org Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling ➔ Factors that influence sample representativeness: Sampling procedure Sample size Participation (response) ➔ When might you sample the entire population? When your population is very small When you have extensive resources When you don’t expect a very high response Source: thumbs.dreamstime.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Representative & Biased Sample Sample 1 Representative of the population Sample 2 Population Biased Sample MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Types of Sampling methods Samples Probability Samples Non-Probability Samples Simple Random Stratified Judgement Snowball Cluster Systematic Convenience Quota MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Probability Sampling Probability sampling is a type of sampling in which every unit in the population has a chance/probability (greater than zero) of being selected in the sample, and this probability can be accurately determined. When every element in the population does have the same probability of selection, this is known as an 'equal probability of selection' (EPS) design. Source: www.mathstopia.net Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Non-Probability Sampling Non-Probability sampling is a type of sampling in which every unit in the population doesn’t have a chance/probability (greater than zero) of being selected in the sample. It involves the selection of elements based on assumptions regarding the population of interest, which forms the criteria for selection. The selection of elements is non random. Thus, non-probability sampling does not allow the estimation of sampling errors. It is more likely to produce a biased sample and restricts generalization. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Probability Sampling. Probability Samples Simple Systematic Stratified Cluster Random MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling Simple random sampling, as the name suggests, is an entirely random method of selecting the sample. Here, each subject or unit in the population has an equal chance of being selected. A table of random number or lottery system is used to determine which units are to be selected. Source: datasciencemadesimple.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling When to Use: Best to use when population is small General Procedure: Assign numbers to all members of the population & select randomly. ○ For a small population: Manual lottery method can be used for selection. ○ For a larger population : System generated numbers can be used to select elements from the population. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling Examples At a birthday party, teams for a game are chosen by putting everyone's name into a jar, and then choosing the names at random for each team. All students in the Computer Science department are assigned numbers and 100 random numbers are chosen to attend a webinar. Sources: c8.alamy.com, wordwall.net MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling Examples Here, each of the 20 coins have an equal probability of getting selected. Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling Examples Probability = (n/N) x 100 Calculating the probability of each coin getting selected. Total population size (N) = 20 Sample size (n) = 5 Probability = (5/20) x 100 = 25% Thus each coin has 25% of probability of getting selected. Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling Examples In a company consisting of 10,000 employees, 25 employees are selected to survey the average number of hours a day they are present in the office. Population frame: List of all employees numbered from 1-10,000 Sample : Random number table consisting of 25 random employees. Probability of selection of each employee : N = 10,000; n = 25 probability = (25/10,000) x 100 = 0.25% Source: 5found.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling: Advantages ➔ Advantages: This method is simple to use. Low sampling error. ➔ Disadvantages: If sampling frame is large, this method impracticable. This type of sampling can’t be employed where the units of the population are heterogeneous in nature. Variations Simple Random Sampling with replacement Simple Random Sampling without replacement MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling Systematic sampling relies on arranging the target population according to some ordering scheme and then selecting elements at regular intervals through that ordered list. The first element is selected randomly. Then it proceeds with the selection of every kth element. Where k is the size of the selection interval. k = (population size/sample size) It is important that the starting point is not automatically the first in the list, but is instead randomly chosen from within the first to the kth element in the list. A simple example would be to select every 10th name from the telephone directory (an 'every 10th' sample, also referred to as 'sampling with a skip of 10'). MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling Source: questionpro.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling It is not 'simple random sampling' because different subsets of the same size have different selection probabilities Ex: the set {2,5,8,11} has a one-in-twelve probability of selection, but the set {1,3,6,7} has zero probability of selection. Source: www.netquest.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling When to Use: When project budget is tight and less time to complete. General Procedure: ○ Assign numbers to each population element. ○ Order the population elements in an ordered sequence ○ Find ‘k’ the size of the selection interval. ○ Select the first sample element randomly from the first k population elements. ○ Thereafter, select the sample elements at a constant interval, k, from the ordered sequence frame. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling Examples From a classroom consisting of 64 students, the teacher wants to select 8 students to check their assignments. Population size = N = 64 Sample size = n =8 Size of selection interval = k = N/n Selecting the = 64/8 = 8 subsequent 8th student Randomly selecting the first student N = 64 n=8 k=8 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling Examples Purchase orders for the previous fiscal year are serialized 1 to 10,000. A sample of fifty purchases orders is needed for an audit. N = 10,000 n = 50 k = 10,000/50 = 200 First select an element randomly from the first 200 purchase orders. Assume the 45th purchase order was selected. Subsequent sample elements: 245, 445(245+200), 645(445+200),.. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling: Advantages Sample is easy to select. Sample evenly spreads over entire reference population. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling: Disadvantages This type of sampling might lead to bias if there is an underlying pattern/periodicity in the population which coincides with the selection. Each element does not have an equal chance in getting selected Ignorance of all the elements between two kth elements. The size of the population is needed. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling Stratified sampling is the type of sampling in which the population is divided into 2 or more groups called strata based on a shared characteristic or trait. Then simple random samples are selected from each group. The selected 2 or more samples are combined into one. The strata or groups don’t overlap. But, they represent the entire population. The shared characteristics based on which the population is divided could be gender, educational attainment, income, age etc. Source: datasciencemadesimple.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling Source: questionpro.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling Source: questionpro.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling When to Use: When population proportion must be reflected in sample. Key Aspect: Strata is homogeneous. General Procedure: ○ Divide the population into Strata or Groups. ○ Criteria for division could be: Gender, Hair Color, Eye Color, Salary, Designation, Age etc. ○ Selection of sample: Simple Random Sampling approach is used to sample units from each strata. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling examples Given 20 coins of different colours. Population of coins is divided into 4 strata based on their colours. Coins from each strata are sampled using simple random sampling. Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling examples A high school principal wants to conduct a survey to collect the opinions of students. The students are grouped into 4 stratums based on their grade. Then, simple random samples of 50 students from each grade are selected to be included in the survey. Source: statology.org MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling: Advantages It enhances the representativeness of the sample. It has higher statistical efficiency. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling: Disadvantages When examining multiple criteria to divide the population, stratifying variables may be related to some but not to others further complicating the design and potentially reducing the utility of the strata. In some cases (such as designs with a large number of strata, or those with a specified minimum sample size per group), stratified sampling can potentially require a larger sample than other methods. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling In cluster sampling, population is divided into non-overlapping clusters or areas. Each cluster is a miniature of the population. Each cluster should have similar characteristics to the whole sample. Instead of sampling individuals from each subgroup like in stratified sampling, in cluster sampling entire clusters are randomly selected. A subset of the clusters is selected randomly for the sample. Source: dataz4s.com Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling When to Use: When population is already broken up into groups (clusters). Key Aspect: Heterogeneous members in each group. General Procedure: ○ Population is divided into non-overlapping areas(clusters). ○ Each cluster is a miniature or microcosm of a population. ○ Clusters are selected randomly. ○ All elements of the selected-clusters are included in the sample or elements from the selected-clusters are chosen using simple random sampling. One-stage sampling: All of the elements within selected clusters are included in the sample. Two-stage sampling: A subset of elements within selected clusters are randomly selected for inclusion in the sample. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling examples Given a set of 20 coins of different colours Population is divided into 5 clusters each having 4 coins. A whole cluster is randomly selected to be included in the sample. Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling: Advantages It is more convenient for geographically dispersed populations. It can reduce the travel costs to contact sample elements. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling: Disadvantages There is higher sampling error. The method is prone to biases. If the clusters representing the entire population were formed under a biased opinion, the inferences about the entire population would be biased as well. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Difference between Strata and Clusters Although strata and clusters are both non-overlapping subsets of the population, they differ in selection process and homogeneous vs heterogeneous nature Source: miro.medium.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Non-probability Sampling Non-Probability sampling is a type of sampling in which every unit in the population doesn’t have a chance/probability (greater than zero) of being selected in the sample. Non-Probability Samples Judgement Snowball Convenience Quota MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Convenience Sampling This is a type of nonprobability sampling which involves the sample being drawn from that part of the population which is close to hand. That is, readily available and convenient. The researcher using such a sample cannot scientifically make generalizations about the total population from this sample because it would not be representative enough. Source: googleusercontent.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Convenience Sampling examples Given a set of 20 coins of different colours. Let’s say that the researcher likes the numbers 4,7,12,15,20. Thus, the coins with the same numbers are included in the sample. Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Convenience Sampling examples To research the opinions about student support services in your university After each of your classes, you ask your fellow students to complete a survey on the topic. This is a convenient way to gather data, but as you only surveyed students taking the same classes as you at the same level, the sample is not representative of all the students at your university. Source: assets.pearsonschool.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Convenience Sampling: Advantages & Disadvantages ➔ Advantages: This type of sampling is useful in pilot study. It costs less and is an inexpensive way to gather initial data for the research. It saves time. ➔ Disadvantages: may not be representative of the population, this type of sampling can’t produce generalizable results. It might lead to sampling errors. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Judgemental Sampling Judgemental or Purposive sampling is a type of non-probability sampling where the researcher chooses the sample based on who they think would be appropriate for the study. This is used primarily when there is a limited number of people that have expertise in the area being researched. Source: dataz4s.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Judgemental Sampling examples Given a set of 20 coins of different colours. Suppose, the experts believe that coins numbered 1, 7, 10, 15, and 19 should be considered for the sample as they may help us to infer the population in a better way. Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Judgemental Sampling: Advantages & Disadvantages ➔ Advantages: It consumes minimum time. The researcher is given an opportunity to bring his judgement and expertise to play. ➔ Disadvantages: It is prone to errors in judgment by researcher. Low level of reliability and high levels of bias. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quota sampling In this type of sampling, sample elements are selected until the quota controls are satisfied. The population is first segmented into mutually exclusive sub- groups, just as in stratified sampling. The population units are selected based on predetermined characteristics of the population. It is similar to Stratified sampling but it doesn’t involve random selection. Ex: recruiting the first 50 men and first 50 women that meet inclusion criteria. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quota sampling Source: questionpro.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quota sampling examples Given a set of 20 coins of different colours. Here we need to select items based on predetermined characteristics of the population. Suppose we have to select coins having a number in multiples of four for our sample. Thus, the coins 4,8,12,16,20 are sampled. Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quota sampling examples: Advantages & Disadvantages ➔ Advantages: It is a cost effective method. It is a speedy process. ➔ Disadvantages: Impossible to determine sampling error Can result in sampling bias Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Snowball sampling In this type of sampling, survey subjects are selected based on referral from other survey respondents. Existing subjects are asked to nominate further subjects known to them so that the sample increases in size like a rolling snowball. This method of sampling is effective when a sampling frame is difficult to identify. Usually applied when the subjects are difficult to trace. Ex: it will be extremely challenging to survey shelter less people or illegal immigrants. Source: cuttingedgepr.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Snowball sampling Source: questionpro.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Snowball sampling examples To select students from a class of 20 to be a part of a volunteer club. Here, we had randomly chosen person 1 for our sample, and then he/she recommended person 6, and person 6 recommended person 11, and so on. 1->6->11->14->19 Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Snowball sampling examples: Advantages & Disadvantages ➔ Advantages: The chain referral process allows the researcher to reach populations that are difficult to sample when using other sampling methods. This sampling technique needs little planning and fewer workforce compared to other sampling techniques. ➔ Disadvantages: There is a significant risk of selection bias in snowball sampling, as the referenced individuals will share common traits with the person who recommends them. It is usually impossible to determine the sampling error MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sample size The more heterogeneous a population is, the larger the sample needs to be. For probability sampling, the larger the sample size, the better. With nonprobability samples, sample size is not generalizable. The main factors affecting the sample size are: ○ Total size of the population ○ Margin of error ○ Confidence level MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sample statistic & Population parameter ➔ Sample statistic: A sample statistic is a piece of information you get from a fraction of a population i.e. a sample. It can also be defined as any number or statistic computed from the sample data. Example: sample average, median, sample standard deviation, and percentiles. ➔ Population parameter: A quantity or statistical measure, for a given population is called a population parameter. It can also be defined as data that refers to something about an entire population. Example: mean and variance of a population are population parameters. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sample statistic & Population parameter Decide whether the numerical value describes a population parameter or a sample statistic. a.) A recent survey of a sample of 450 college students reported that the average weekly income for students is $325. Ans: Because the average of $325 is based on a sample, this is a sample statistic. b.) The average weekly income for all students is $405. Ans: Because the average of $405 is based on a population, this is a population parameter. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Errors in sampling Sampling error or Random error occurs when sample is not representative of the population Errors in sampling Non-sampling error or Systematic error occurs during data collection, causing the data to differ from the true values. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling error The discrepancy between a sample statistic and its population parameter is called sampling error. Defining and measuring sampling error is a large part of inferential statistics. It occurs when the sample is not representative of the population. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling error As we can see there is a difference between population parameters and sample parameters. This is due to sampling error. Two samples of same population have differing parameters. This is due to sampling variation. It is also the reason why scientific experiments produce different result under identical scenarios. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Non-sampling error Non-sampling errors are the results of mistakes made in implementing data collection and data processing, such as ○ failure to locate and interview the correct household ○ errors in understanding of the questions by either the interviewer or the respondent ○ data entry errors ○ missing Data ○ poorly conceived concepts, unclear definitions, and defective questionnaires ○ response errors occurring when people are unaware, refuse to answer, or overstate in their answers MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling Bias Sampling bias occurs when a chosen sample is not representative of the larger population. It occurs due to the sampling technique/method used to perform data collection. It can be either selection bias and non-response bias. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Bias ex: Q) A new chemical process is run 10 times each morning for five consecutive mornings. If the new process is put into production, it will be run 10 hours each day, from 7 A.M. until 5 P.M. Is it reasonable to consider the 50 yields to be a simple random sample? Ans) Since the new process runs during both morning and afternoon, the population consists of all the yields that would ever be observed, including both morning and afternoon runs. The sample however is drawn only from that portion of the population that consists of morning runs, and thus it is not a simple random sample. It exhibits a bias is not representative of the population intended to be studied. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling variation Simple random samples always differ from their populations in some ways, and occasionally may be substantially different. Two different samples from the same population will differ from each other as well. This phenomenon is known as sampling variation. Sampling variation is one of the reasons that scientific experiments produce somewhat different results when repeated, even when the conditions appear to be identical. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Questions (Q) A quality engineer wants to inspect rolls of wallpaper in order to obtain information on the rate at which flaws in the printing are occurring. She decides to draw a sample of 50 rolls of wallpaper from a day’s production. Each hour for 5 hours, she takes the 10 most recently produced rolls and counts the number of flaws on each. Is this a simple random sample? Answer: No. Not every subset of 50 rolls of wallpaper is equally likely to comprise the sample. To construct a simple random sample, the engineer would need to assign a number to each roll produced during the day and then generate random numbers to determine which rolls comprise the sample. THANK YOU Dr.Mamatha H R Professor, Department of Computer Science [email protected] +91 80 2672 1983 Extn 712 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS UE23MA242A Unit 1: Types of Data & Experiments Mamatha.H.R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Unit 1: Types of Data & Experiments Mamatha H R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Topics to be covered ❖ Types of data ❖ Variables or Attributes ❖ Types of studies MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Types of data Source: lh5.googleusercontent.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Types of data Based on their mathematical properties, data are divided into four groups : NOIR- Nominal Ordinal Interval Ratio They are ordered with their increasing Accuracy Powerfulness Wide application Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Qualitative data Qualitative Data are measurements that cannot be recorded on a natural numerical scale, but are recorded in categories. It is also known as Categorical Data as the information can be sorted by category, not by number. Example: Year in school, Live on/off campus, Major, Gender, hair color etc. In general, there are 2 types of qualitative data: ○ Nominal data ○ Ordinal data MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Qualitative data: Nominal data They are categories without any particular order or direction. The nominal data sometimes is referred to as “labels" They are least powerful in measurement with no arithmetic origin or order. Hence, nominal data is of restricted or limited use. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Qualitative data: Nominal data It can’t be manipulated using mathematical operators. But, it can be visualized using pie chart. Source: dpbnri2zg3lc2.cloudfront.net, researchgate.net Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Qualitative data: Nominal data ➔ How to analyze Nominal Data? Using grouping method. Group them into categories. For each category, frequency or percentage can be calculated. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Qualitative data: Ordinal data ordinal data is qualitative data for which the values are ordered. Ordinal data may indicate superiority. But, we cannot do arithmetic operations with ordinal data because they only show the sequence. Ordinal data allows for setting up inequalities, but it has no absolute value. More precise comparisons are not possible. Examples: ○ Ranking of users in a competition: The first, second, and third, etc. ○ Rating of a product taken by the company on a scale of 1-10. ○ Economic status: low, medium, and high. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Qualitative data: Ordinal data Here, the order matters but not the difference between values. Example: Pain Scales ○ Patients are asked to express the amount of pain they are feeling on a scale of 1 to 10. ○ A score of 7 means more pain than a score of 5, and that is more pain than a score of 3. ○ But the difference between the 7 and Source: Questionpro, slideshare.net the 5 may not be the same as that between 5 and 3. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Qualitative data: Ordinal data examples Source: Questionpro, slideshare.net, Slide Courtesy:Dr.Uma analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quantitative data Quantitative Data are measurements that are recorded on a naturally occurring numerical scale. These are easily open for statistical manipulation and can be represented by a wide variety of statistical types of graphs and charts like line charts, bar graphs, scatter plots, etc. Example: Age, GPA, Salary, Cost of books this semester, Scores of tests and exams, weight of a person, temperature in a room etc. There are 2 general types of quantitative data: Discrete data & Continuous data and further into Interval & Ratio MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quantitative data: Discrete data A set of data is said to be discrete if the values belonging to the set are discrete and separate. The data values cannot be divided into smaller parts. Sometimes we can only count whole individuals but can’t count in fractions like 2.5 kids. It has a limited number of possible values e.g. days of the month. Examples of discrete data: ○ The number of students in a class. ○ The number of workers in a company. ○ The number of test questions you answered correctly. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quantitative data: Discrete data Bar charts can be used to display discrete numerical data. For example, the bar chart below shows the number of CDs bought by a group of children in a given month. Source: slideplayer.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quantitative data: Continuous data A set of data is said to be continuous if the values belonging to the set can take on any value within a finite or infinite interval. Can be meaningfully divided into its finer levels. Examples of continuous data: ○ The amount of time required to complete a project. ○ The height of children. ○ The speed of cars. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quantitative data: Interval data It is a data type which is measured along a scale, in which each point is placed at equal distance from one another. These data types are measurable and ordered with the nearest items but have no meaningful zero. some descriptive statistics we can calculate for interval data are: ○ Central measures of tendency (mean, median, mode) ○ Range (minimum, maximum) ○ Spread (percentiles, interquartile range, and standard deviation). MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quantitative data: Interval data Source: dpbnri2zg3lc2.cloudfront.net, slideshare.net Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quantitative data: Ratio data Ratio data, unlike interval data, ratio data has a true zero. This basically means that zero is an absolute, below which there are no meaningful values. Speed, age, or weight are all excellent examples since none can have a negative value (you cannot be -10 years old or weigh -160 pounds) These data are also in the ordered units that have the same difference. It is the most precise data and allow for application of all statistical techniques. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quantitative data: Ratio data Source: /www.chi2innovations.com Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Identify the type of data ➔ Number of cartons of milk manufactured each day. Quantitative data, Discrete data, ratio ➔ Temperatures of airplane interiors at a given airport in Celsius. Quantitative data, Continuous data, Interval data. ➔ College major of each student in a class. Qualitative data, Nominal data ➔ Method of payment Qualitative data, Nominal data ➔ Incomes of college students on work study programs. Quantitative data, Discrete data, ratio ➔ Weights of newborn calves. Quantitative data, Continuous data, Ratio data. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Identify the type of data ➔ Gender of each employee at a company. Qualitative data, Nominal data ➔ Number of tomatoes on each plant in a field. Quantitative data, Discrete data, ratio ➔ Number of defective items in a lot. Quantitative data, Discrete data, ratio ➔ Salaries of CEOs of oil companies. Quantitative data, Discrete data, ratio Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Attribute or Variable Attribute(or variable, feature, dimension) is a data field, representing a characteristic or feature of a data object. It is a property of a data object which is measured for each observation or record. It can vary from one observation to another. Example : name, age, Student-ID, address, marks, gender etc. attributes or variables are also classified similar to data types Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Attribute or Variable Source: towardsdatascience.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Properties of Attributes The type of an attribute depends on which of the following properties it possesses: Distinctness: =, ≠ Order: Addition: +, - Multiplication: *, / ➔ Nominal attribute: distinctness ➔ Ordinal attribute: distinctness & order ➔ Interval attribute: distinctness, order & addition ➔ Ratio attribute: all 4 properties MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Properties of Attributes Source: dpbnri2zg3lc2.cloudfront.net MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Examples In the table below identify which columns represent qualitative variables and which columns represent quantitative variables. Answer: Qualitative variables: Name, River, State Quantitative variable: height, Completed MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Types of studies We do studies to gather information and draw conclusions. The type of conclusion we draw depends on the study method used: I. Observational study: In an observational study, we measure or survey members of a sample without trying to affect them. II. Controlled study: In a controlled experiment, we assign people or things to groups and apply some treatment to one of the groups, while the other group does not receive the treatment. Used to report cause and study. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Controlled Study Sources: www.scienceabc.com Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Controlled Group vs Experimental Group Source: thoughtco.com Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Control Group vs Experimental Group ➔ Control Group: A control group is a group separated from the rest of the experiment such that the independent variable being tested cannot influence the results. This isolates the independent variable’s effect on the experiment and can help rule out alternative explanations of the experimental results. ➔ Experimental Group: An experimental group is a test sample or the group that receives an experimental procedure. This group is exposed to changes in the independent variable being tested. The values of the independent variable and the impact on the dependent variable are recorded. An experiment may include multiple experimental groups at one time. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Controlled Experiment While all experiments have an experimental group, not all experiments require a control group. Controls are extremely useful where the experimental conditions are complex and difficult to isolate. Experiments that use control groups are called controlled experiments. Source: cdn.kastatic.org MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Identify the types of study Q1.A study took random sample of adults and asked them about their bedtime habits. The data showed that people who drank a cup of tea before bedtime were more likely to go to sleep earlier than those who didn't drink tea. Answer : Observation Study Q2.A study took a group of adults and randomly divided them into two groups. One group was told to drink tea every night for a week, while the other group was told not to drink tea that week. Researchers then compared when each group fell asleep. Answer : Experimental Study Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Identify the types of study Q3.A study randomly assigned volunteers to one of two groups: One group was directed to use social media sites as they usually do. One group was blocked from social media sites. Answer : Experimental Study Q4.A study took a random sample of people and examined their social media habits. Each person was classified as either a light, moderate, or heavy social media user. The researchers looked at which groups tended to be happier. Answer : Observation Study Slide Courtesy:Dr.Uma THANK YOU Dr.Mamatha H R Professor, Department of Computer Science [email protected] +91 80 2672 1983 Extn 834 K.M Mitravinda [email protected] MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS UE23MA242A Unit 1: Types of Statistics & Summary Statistics Mamatha.H.R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Unit 1: Types of Statistics & Summary Statistics Mamatha H R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Statistics Statistics involves: Collecting Data Ex: Survey Presenting Data/Describing Data Ex: Charts & Tables Characterizing Data(Predicting) Ex: Average Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Branches of Statistics The study of statistics has two major branches: 1) Descriptive statistics 2) Inferential statistics Statistics Descriptive Inferential statistics statistics Involves Involves using a sample organization, to draw conclusions summarization, and about a population. display of data. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Descriptive Statistics For example, tables or graphs are used to organize data, and descriptive values such as the average score are used to summarize data... A descriptive value for a population is called a parameter and a descriptive value for a sample is called a statistic MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Descriptive Statistics ➔ Summarizing Data: Central Tendency (or Groups’ “Middle Values”) Mean Median Mode Variation (or Summary of Differences Within Groups) Range Interquartile Range Variance Standard Deviation MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Why is Descriptive Statistics used? Figure speaks it all !!! Source: luminousmen.com/post/, www.slideshare.net MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Why is Descriptive Statistics used? Source: slidetodoc.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Inferential Statistics Inferential statistics utilizes sample data to make estimates, decisions, predictions or other generalizations about a larger set of data. There are two main areas of inferential statistics: 1. Estimating parameters: This means taking a statistic from the sample data (for example the sample mean) and using it to say something about a population parameter (for example the population mean). 2. Hypothesis tests: This is where sample data can be used to answer research questions. For example, one might be interested in knowing if a new cancer drug is effective; or if breakfast helps children perform better in schools. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Why is Inferential Statistics used? Suppose you want to know the mean income of the subscribers of Netflix. Mean (µ) — a parameter of a population. You draw a random sample of 100 subscribers and determine that their mean income is $27,500. Mean( x̅ ) = $27,500 (a summary statistic). Conclusion : You conclude that the population mean income μ is likely to be close to $27,500 as well. This is an example of statistical inference. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Descriptive Statistics vs Inferential Statistics Descriptive Statistics Inferential Statistics Organize Generalize from samples to Summarize population Simplify Hypothesis testing Presentation of data Relationships among variables Describing data Make predictions Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Questions Q1. In a recent study, volunteers who had less than 6 hours of sleep were four times more likely to answer incorrectly on a science test than were participants who had at least 8 hours of sleep. Decide which part is the descriptive statistic and what conclusion might be drawn using inferential statistics. Ans: The statement “four times more likely to answer incorrectly” is a descriptive statistic. An inference drawn from the sample is that all individuals sleeping less than 6 hours are more likely to answer science question incorrectly than individuals who sleep at least 8 hours. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Types of Descriptive Statistics Interquartile Range Variance Source: geeksforgeeks MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of central tendency Source: sixsigma-institute.org Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of central tendency: Mean For a finite set of dataset with measurement values x1, x2, …., xn (a set of n numbers), sample mean is defined by the formula: Source: sixsigma-institute.org Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of central tendency: Weighted mean Weighted mean is an average where certain values of the data set contribute more to the mean value. For a finite set of dataset with measurement values x1, x2, …., xn (a set of n numbers), and the corresponding weights w1, w2,....wn it is defined by the formula: MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of central tendency: Trimmed mean The trimmed mean is computed by arranging the sample values in order, “trimming” an equal number of them from each end, and computing the mean of those remaining. If p% of the data are trimmed from each end, the resulting trimmed mean is called the “p% trimmed mean”. The most commonly used trimmed means are the 5%, 10%, and 20% trimmed means. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of central tendency: Trimmed mean If the sample size is denoted by n, and a p% trimmed mean is desired, the number of data points to be trimmed is np/100 It is used to reduce the effects of outliers on the calculated average. (An outlier is a data point that is radically “distant” or “away” from common trends of values in a given set. ) This measure is best suited for data with large, erratic deviations or extremely skewed distributions. Source: exceluser.com, MathBitsNotebook MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of central tendency: Mean ➔ Advantages: It takes into account all the available information. Easy and quick way to represent the entire data values by a single or unique number. ➔ Disadvantages: It is a very sensitive measure. It can only be used on interval or ratio data. Source: slideshare.net Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of central tendency: Median Median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it is not affected by a small proportion of extremely large or small values, and therefore provides a better representation MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of central tendency: Median ➔ Process of calculating median ➔ Arrange all the values of the data set in ascending order. X1,X2,X3,....,Xn If n s odd, median = (n+1/2)th element’s value If n is even, median = ( (n/2)th element’s value + (n/2 + 1)th element’s value))/2 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of central tendency: Median ➔ Advantages: Not affected by the outliers in the data set. ➔ Disadvantages: It cannot be utilized for further algebraic treatment. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of central tendency: Mode The mode is the value that appears most often in a set of data values mode is not unique ( multimodal). Empirical formula: MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of central tendency: Mode ➔ Advantages: Quick and easy to compute. Unaffected by extreme values. Useful to find the most “popular” or common item. ➔ Disadvantages: A given subgroup could make this measure unrepresentative of the population’s centre. If there are many values that have the same count, then mode can be meaningless. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Skewed and Symmetric distributions Skewness is a measure of the asymmetry of the distribution of about its mean. The skewness value can be positive, zero, negative, or undefined. Symmetric Distribution: A symmetric distribution is one where the left and right hand sides of the distribution are roughly equally balanced around the mean. In symmetric distributions, the mean, median, and mode are the same. Skewed Distribution: A skewed distribution is one where the left and right hand sides of the distribution are not balanced around the mean. In skewed data, the mean and median lie further toward the skew than the mode. The greater the distance of mean and median, the greater is the skewness of the distribution. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Skewed and Symmetric distributions Source:www.slideshare.net Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Skewed and Symmetric distributions If mean = median = mode, the shape of the distribution is symmetric. If mode < median < mean, the shape of the distribution trails to the right, is positively skewed. If mean < median < mode, the shape of the distribution trails to the left, is negatively skewed. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS When to use mean, median and mode? TYPE OF VARIABLE BEST MEASURE OF CENTRAL TENDENCY Nominal Mode Ordinal Median Interval / Ratio (not Mean skewed) Interval / Ratio (skewed) Median Source: sixsigma-institute.org, statistics.laerd.com Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of Spread/Dispersion help to interpret the variability of data It helps to know how much homogeneous or heterogeneous the data is. There are two main types of dispersion methods in statistics which are: (i) Absolute Measure of Dispersion (ii) Relative Measure of Dispersion Source: image.slidesharecdn.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of Spread/Dispersion Absolute Measure of Dispersion: It contains the same unit as the original data set. It includes range, standard deviation, quartile deviation, etc. Relative Measure of Dispersion: The relative measures of dispersion are used to compare the distribution of two or more data sets. This measure compares values without units. Common relative dispersion methods include: Coefficient of Range, Coefficient of Variation, Coefficient of Standard Deviation, Coefficient of Quartile Deviation, Coefficient of Mean Deviation. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of Spread: Range Class A Class B Observations: Since the range of Class A is smaller than in Class B, can we claim that the age distribution in Class A is more clustered (closely related) than in Class B? In other words, are the ages listed in Class A more uniform than in Class B? Source: Chilimath.com Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of Spread: Range ➔ Advantages: It is the simplest of the measure of dispersion Independent of change of origin ➔ Disadvantages: It is based on two extreme observations. Hence, get affected by fluctuations It can drastically be affected by outliers MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of Spread: When presenting or analysing measurements of a continuous variable it is sometimes helpful to group subjects into several equal groups. For example, to create four equal groups we need the values that split the data such that 25% of the observations are in each group. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of Spread: The general term for such cut off points is quantiles; other values likely to be encountered are deciles, which split data into 10 parts, quintiles which split data into 20 parts and centiles, which split the data into 100 parts (also called percentiles). MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measure of Spread:Percentile A percentile is a comparison measure between a particular value and the values of the rest of the data set. It shows the percentage of values that a particular element has surpassed. For example, if you score 75 points on a test, and are ranked in the 85th percentile, it means that the score 75 is higher than 85% of the scores. The percentile rank is calculated using the formula R= (P/100)* (N+1) where P is the desired percentile and N is the number of data points. The pth percentile of a sample, for a number p between 0 and 100, divides the sample such that, ○ p% of the sample values are less than the pth percentile ○ (100-p%) are greater than the pth percentile MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Percentile Steps to calculate the percentile rank: 1. Order the n samples values from smallest to largest. 2. Compute the quantity (P/100)(n+1), where n is the sample size. 3. If the above quantity is an integer, the sample value in this position is the percentile. 4. Otherwise, average the two sample values at the preceding and succeeding integer positions with respect to the quantity obtained in step 3. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of Spread: Quartiles Quartiles are the values that divide a list of numbers into quarters. Quartiles are obtained by first putting the list of numbers in order and then cutting the list into four equal parts. The Quartiles are at the "cuts" in the data. The first quartile, (Q1) is the middle number between the smallest number and the median of the data. The second quartile, (Q2) is the median of the data set. The third quartile, (Q3) is the middle number between the median and the largest number. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of Spread: Quartile example Source: mathsisfun.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of Spread: Inter-quartile Range Interquartile range is the distance or range between the 25th percentile and the 75th percentile. That is, quantifies the difference between the third and first quartiles. Interquartile Range = Upper Quartile(Q3) – Lower Quartile(Q1) IQR = Q3 –Q1 Source: mathsisfun.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of Spread: Inter-quartile Range question For the following data sets, calculate the quartiles and find the interquartile range. The following numbers represent the time in minutes that twelve employees took to get to work on a particular day. 18 34 68 22 10 92 46 52 38 29 45 37 Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of Spread: Variance Variance is a measure of the spread of the recorded values on a variable. It is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. The larger the variance, the further the individual cases are from the mean. Mean The smaller the variance, the closer the individual scores are to the mean. Mean MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of Spread: Variance It is the average of the distance that each score is from the mean (Squared deviation from the mean) Steps to calculate variance: 1. Find the mean value of the given data values. 2. Subtract mean from each data value. 3. Square each value that is obtained from step2. 4. Find the sum of all values that is obtained from step 3. 5. Divide the result that is obtained from step 4 by N(for population) and n-1(for sample). Source: standard-deviation-calculator.com Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of Spread: Standard Deviation The standard deviation does not decline as the sample size increases. The estimate of the standard deviation becomes more stable as the sample size increases. Source: exceluser.com, MathBitsNotebook MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Measures of Spread: Standard Deviation Larger the standard deviation, greater amounts of variation around the mean. Std deviation = 0 only when all values are the same (only when you have a constant and not a “variable”) If you were to “rescale” a variable, the s.d. would change by the same magnitude. Like the mean, the standard deviation will be inflated by an outlier case value. DATA ANALYTICS Measures of Spread: Standard Deviation example Calculate Standard Deviation for the following discrete data: Items 5 15 25 35 Frequency 2 1 1 3 Mean x¯=5×2+15×1+25×1+35×37=10+15+25+105 /7=22.15 https://www.tutorialspoint.com/statistics/ DATA ANALYTICS Measures of Spread: Standard Deviation example Calculate Standard Deviation for the following discrete data: Items Frequency x¯ x−x¯ f(x−x¯)2 x f 5 2 22.15 -17.15 580.25 15 1 22.15 -7.15 51.12 25 1 22.15 2.85 8.12 35 3 22.15 12.85 495.36 N=7 ∑f(x−x¯)2 =1134.85 https://www.tutorialspoint.com/statistics/ DATA ANALYTICS Measures of Spread: Standard Deviation example Calculate Standard Deviation for the following discrete data: https://www.tutorialspoint.com/statistics/ DATA ANALYTICS Measures of Spread: Standard Deviation example Calculate Standard Deviation for the following continuous data : Items 0-10 10-20 20-30 30-40 Frequency 2 1 1 3 In case of continous series, a mid point is computed as lower−limit+upper−limit/2 https://www.tutorialspoint.com/statistics/ DATA ANALYTICS Measures of Spread: Standard Deviation example Calculate Standard Deviation for the following data: https://www.tutorialspoint.com/statistics/ DATA ANALYTICS Measures of Spread: Standard Deviation example Calculate Standard Deviation for the following data: https://www.tutorialspoint.com/statistics/ THANK YOU Dr.Mamatha H R Professor, Department of Computer Science [email protected] +91 80 2672 1983 Extn 834 STATISTICS FOR DATA SCIENCE Data Visualization and Interpretation Dr. Mamatha H R Ms.Yousha Mahamuni Department of Computer Science and Engineering [email protected] STATISTICS FOR DATA SCIENCE Data Visualization and Interpretation - Bar Charts Dr. Mamatha H R Department of Computer Science and Engineering STATISTICS FOR DATA SCIENCE Data Visualization: Bar Charts Often known as the “King of Charts”, Bar Charts are one of the most commonly used charts in the field of Data Science. The advantage of bar plots (or “bar charts”, “column charts”) over other chart types is that the human eye has evolved a refined ability to compare the length of objects, as opposed to angle or area. STATISTICS FOR DATA SCIENCE Data Visualization: Bar Charts Summarizes categorical data. Horizontal axis represents categories, while vertical axis represents either counts (“frequencies”) or percentages (“relative frequencies”). Used to illustrate the differences in percentages (or counts) between categories. The graph represents categories on one axis and a discrete value in the other. The goal is to show the relationship between the two axes Bar charts can also show big changes in data over time STATISTICS FOR DATA SCIENCE Data Visualization: Types of Bar Graphs STATISTICS FOR DATA SCIENCE Data Visualization: Types of Bar Graphs - Horizontal Horizontal Bar Graphs: The classes are displayed on the y-axis, and the values(scores) of those classes are displayed on the x-axis. Useful only when comparing one set of data. STATISTICS FOR DATA SCIENCE Data Visualization: Types of Bar Graphs - Vertical Vertical Bar Graphs : The classes are displayed on the x-axis, and the values(scores) of those classes are displayed on the y-axis. Useful only when comparing one set of data. STATISTICS FOR DATA SCIENCE Data Visualization: Types of Bar Graphs - Stacked Stacked Bar Graphs : Each bar has multiple datasets to be compared, each set of values belonging to the class of different datasets are stacked over one other. Useful when comparing multiple datasets but having same set of classes STATISTICS FOR DATA SCIENCE Data Visualization: Types of Bar Graphs - Grouped Grouped Bar Graphs : Grouped bar charts are Bar charts in which multiple sets of data items are compared, with a single color used to denote a specific series across all sets. A grouped or clustered bar graph is used to represent discrete values for more than one item that share the same category. Grouped bar charts are a way of showing information about different sub-groups of the main categories. But care needs to be taken to ensure that the chart does not contain too much information making it complicated to read and interpret. STATISTICS FOR DATA SCIENCE Data Visualization: Bar Charts STATISTICS FOR DATA SCIENCE Data Visualization: Bar Charts Examples The vehicular traffic at a busy road crossing in a particular place was recorded on a particular day from 6am to 2 pm and the data was rounded off to the nearest tens. Construct a Bar Chart. Time in 6-7 7-8 8-9 9 - 10 10 - 11 11 - 12 12 - 1 1-2 Hours Number of 100 450 1250 1050 750 600 550 200 Vehicles STATISTICS FOR DATA SCIENCE Data Visualization: Bar Charts Examples STATISTICS FOR DATA SCIENCE Data Visualization: Bar Charts Examples Look at the graph given STATISTICS FOR DATA SCIENCE Data Visualization: Bar Charts Examples Read it carefully and answer the following questions. (i) What information does the bar graph give? (ii) In which subject is the student very good (iii) In which subject is he poor? (iv) What are the average of his marks? STATISTICS FOR DATA SCIENCE Data Visualization: Bar Charts Examples (i) It shows the marks obtained by a student in five subjects (ii) Mathematics (iii) Hindi (iv) 56 STATISTICS FOR DATA SCIENCE Data Visualization: Bar Charts Examples In a survey of 85 families of a colony, the number of members in each family was recorded, and the data has been represented by the following bar graph. STATISTICS FOR DATA SCIENCE Data Visualization: Bar Charts Examples Read the bar graph carefully and answer the following questions: (i) What information does the bar graph give? (ii) How many families have 3 members? (iii) How many people live alone? (iv) Which type of family is the most common? How many members are there in each family of this kind? STATISTICS FOR DATA SCIENCE Data Visualization: Bar Charts Examples (i)It gives the number of families containing 2, 3, 4, 5 members each. (ii) 40 (iii) none (iv) Family having 3 members, 3 members. STATISTICS FOR DATA SCIENCE Data Visualization:Bar Chart Determine ∙ Examine your data to find the bar with the the largest value. This will help you determine discrete the range of the vertical axis and the size of range each increment. Determine the number Examine your data to find how many bars of bars your chart will contain. Use this number to draw and label the horizontal axis Source: www.slideshare.net Slide Courtesy: Dr.Uma STATISTICS FOR DATA SCIENCE Data Visualization:Bar Chart Determin Bars may be arranged in any order. (A bar e the chart arranged from highest to lowest order of incidence is called a Pareto chart) the bars Draw If you are preparing a grouped bar the graph, remember to present the bars information in the same order in each grouping Source: www.slideshare.net Slide Courtesy: Dr.Uma STATISTICS FOR DATA SCIENCE Data Visualization:Bar Chart Source: www.slideshare.net Slide Courtesy: Dr.Uma STATISTICS FOR DATA SCIENCE Data Visualization:Bar Chart Difference between Bar and Histogram Bar Histo