Mathematics For Computer Science Engineers Unit 1 - PDF
Document Details
Uploaded by Deleted User
Mamatha H R
Tags
Summary
This document is a course outline for a course titled "Mathematics for Computer Science Engineers." It covers topics such as probability distributions, point estimation, confidence intervals, hypothesis testing, and linear regression. It also explores applications of data science in various fields, including call centers, online shopping, and sports.
Full Transcript
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Unit 1:Introduction Mamatha H R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Course content Unit 1: Applications of Probability Distributions and Principles of Point Estimation Introduction, Motivating E...
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Unit 1:Introduction Mamatha H R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Course content Unit 1: Applications of Probability Distributions and Principles of Point Estimation Introduction, Motivating Examples and Scope. Statistics: Introduction, Types of Statistics, Types of Data, Types of Experiments – Controlled and Observational study, Sampling: Sampling Methods, Sampling Errors, Case Study. Chebyshev's inequality, Normal Probability Plots, Introduction to Generation of Random Variates and mention the types, Acceptance-Rejection method, Sampling Distribution, The Central Limit Theorem and Applications, Principles of Point Estimation - Mean Squared Error for Bernoulli, Binomial, Poisson, Normal, Maximum Likelihood Estimate for Bernoulli, Binomial, Poisson, Normal and Case Study. Introduction to multivariate normal distribution, MAP distribution. Self-Learning: Generation of Random Variates -Inverse Transform Method. 16 Hours Unit 2: Confidence Intervals and Hypothesis Testing Confidence Intervals: Interval Estimates for Mean of Large and Small Samples, Student's t Distribution, Interval Estimates for Proportion of Large and Small Samples, Confidence Intervals for the Difference between Two Means, Interval Estimates for Paired Data. Factors affecting Margin of Error, Hypothesis Testing for Population Mean and Population Proportion of Large and Small Samples, Drawing conclusions from the results of Hypothesis tests, Case Study. Self-Learning: Confidence interval for difference between two proportions. 12 Hours MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Course content Unit 3: Distribution Free Tests and Multiple Linear Regression Distribution Free Tests, Chi-squared Test, Fixed Level Testing, Type I and Type II Errors, Power of a Test, Factors Affecting Power of a Test. Simple Linear Regression: Introduction, Correlation, the Least Square Lines, Predictions using regression models - Uncertainties in Regression Coefficients, Checking Assumptions and transforming data, Introduction to the Multiple Regression Model, Case Study. Self-Learning: F test for equality of Variance. 14 Hours Unit 4: Engineering optimization Introduction to Optimization-Based Design, Modelling Concepts, Unconstrained Optimization, Discrete Variable Optimization, Genetic and Evolutionary Optimization, Constrained Optimization. Self-Learning: Mathematical concepts of objective function, Constraints and Decision variables. 14 Hours MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Course content: Applications Unit 1: Applications: 1. Poisson distribution, calculation of number of calls received in a specified time duration in call centers. 2. Variance, standard deviation, identifying the customer satisfaction in online shopping 3. Central limit theorem, Load Balancing in distributed systems and internet traffic prediction 4. Sampling mean, Estimating database query response times Unit 2: Applications: 1. t-distribution, confidence interval, students’ performance analysis based on hours of study 2. z-test, application form processing in banking system. 3. Hypothesis testing, randomly trained students placement into tier-I and tier-II companies. Unit 3: Applications: 1. Linear regression, stock market prediction 2. using Chi-Square Test, Analyzing the association between vaccination and recovery of the patients considering COVID data. 3. Chi-Square Test and Test of Independence, Analyzing the relationship between gender and preference for a product purchase. 4. Identifying Type 1 and Type 2 Errors in Spam mail classification. Unit 4: Applications: 1.Minimize a Loss functions in Neural Networks using Batch gradient descent (Unconstrained Optimization) 2. Lagrange Multipliers to find local maxima and minima of a function subject to equations constraints (Constrained Optimization) 3. Case study on Bayesian Optimization with Discrete Variables (Discrete Variable optimization) 4. Use Genetic Algorithms to optimize Production Scheduling in a manufacturing environment, focusing on minimizing total production costs while meeting job deadlines and machine constraints. Evaluate the GA’s effectiveness against traditional scheduling methods. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Tools and Textbooks Tools / Languages/Libraries: Jupyter Notebook, Python, Pandas, Matplotlib, Scipy, Seaborn, BeautifulSoup, Numpy, Scikit learn. Text Book(s): 1. “Statistics for Engineers and Scientists”, William Navidi, McGraw Hill Education, India, 4th Edition, 2015. 2. “Optimization Methods for Engineering Design, Parkinson, A.R., Balling, R., and J.D. Hedengren, Second Edition, Brigham Young University, 2018 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Evaluation Policy ISA Components Conduction Reduced to ISA 1 40 20 ISA 2 40 20 Assignment Coding-5M 10 Datathon-20 ESA 100 50 Assignment Components 1. Submission of the hands-on session code submission=5 Marks 2. Datathon----------------------------------------------------------=5 Marks Total=10 Marks Note 1. It is expected that the codes and solutions for hands-on sessions to be submitted on the same day they are conducted. 2. Datathon will be conducted for 20 Marks and will be reduced to 5M MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What is Data Science? Have you ever wondered how YouTube recommends videos of your liking? How Google’s autocomplete works? How Gmail filters your emails into spam and non-spam categories? These are some of the simplest applications of Data Science. Such tasks would be impossible without the availability of data. Thus in simple words, Data Science is all about using data to solve problems. Source: https://coralogix.com/blog/elasticsearch- autocomplete-with-search-as-you-type/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What is Data Science? Data Science is an interdisciplinary field. It is focused on extracting knowledge and insights from data. Those insights are then applied to solve problems across a wide range of domains. It incorporates skills from Statistics, Computer Science, Mathematics, Business etc. Source: theblog.adobe..com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science Source: edureka.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science Source: edureka.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Airlines Industry Data Science is used for various purposes like: route planning, revenue management, prediction on in-flight sales and food supplies etc. Sources: Simplilearn, datasciencecentral.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Airlines Industry Sources: Simplilearn, datasciencecentral.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Logistics Industries Source: Simplilearn MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Logistics Industries Logistics is a sector where data scientists can make a significant impact in several areas such as: waste reduction optimizing delivery routes (which can translate into lower delivery costs) selecting carriers that deploy best practices in mitigating the effects of CO2 emissions ensuring that hazardous materials are handled with the utmost care forecasting the supply and demand cycles MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Recommender systems Source : https://www.martechadvisor.com/articles/customer-experience-2/recommendation-engines-how-amazon-and-netflix-are- winning-the-personalization-battle/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Recommender systems Amazon has a huge bank of data on online consumer purchasing behaviour. The data includes purchased shopping cart items added to carts but abandoned wish lists dwell time referral sites customers’ demographic information number of times viewed an item before final purchase click paths in session, pricing experiments online etc. Using this data it can easily find the hidden factors and patterns to generate the “Recommended for You” section which helps to create a personalized shopping experience for every customer. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Recommender systems Source: https://medium.com/swlh/recommendations-in-time-context-93b32f73d98d MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Recommender systems Netflix has set up 1300 recommendation clusters based on users viewing preferences. Netflix’s personalized recommendation algorithms produce $1 billion a year in value from customer retention and accounts for 80% of its total views. Some of the user information that Netflix captures to help in recommendation include: Viewer interactions with Netflix services like viewer ratings, viewing history, etc. Movie’s information about the categories, year of release, title, genres etc. Other viewers with similar watching preferences. Time duration of a viewer watching a show. The device on which a viewer is watching. The time of the day a viewer watches. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Weather Forecasting Source: phys.org/news MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Weather Forecasting Weather forecasts are made by collecting quantitative data about the current state of the atmosphere at a given place and using meteorology to project how the atmosphere will change. So in general, weather forecasting is driven by the data about the atmosphere. There are a wide variety of devices and technologies gathering information about the weather like: thermometers, barometers, anemometers, weather balloons, radar systems, satellites etc. Various weather models analyse and try to make sense of all the incoming information to accurately predict the weather. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Sports Source: https://arstechnica.com/information-technology/2015/10/big-data-an-it-buzzword-that-is-actually-producing-results/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Sports Players, team managers, coaches and fans rely on sports analytics before making decisions or developing strategies to win games. Sports data analysts spend their time collecting on-field and off- field data from a variety of sources and then analyzing and interpreting that data looking for meaningful insights. The main objective of sports analysis is to improve team performance and enhance the chances of winning the game. Major teams and their analytics partner: (i)Real Madrid and Microsoft (ii) Manchester United and Aon MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Sports Moneyball, an American biographical film accounts for the attempts of baseball team’s general manager to assemble a competitive team using sports analytics. He utilized sabermetrics to evaluate his potential roster by performing data mining on hundreds of individual baseball players, identifying statistics that were highly predictive of how many runs a player would score. Source: https://en.wikipedia.org/wiki/Moneyball_(film) MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Sports Source: https://fivethirtyeight.com/features/billion-dollar-billy-beane/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Politics Political parties and their strategists have realized the importance of mining real-time demographic and polling data. The various data points may include voter sentiment, mass emotions, citizen concerns in different constituencies, popular outlooks in various states, etc. Political parties can use these insights to, pull voter donations convert undecided voters enroll young volunteers organize resources social media campaigns improve effectiveness of electioneering activities etc. Mathematics for Computer Science Engineers Applications of Data Science : Data Science in Politics https://www.datacouncil.ai/talks/how-data-is-transforming-politics https://projects.fivethirtyeight.com/polls/generic-ballot/2024/ Mathematics for Computer Science Engineers Applications of Data Science : Data Science in Politics Mathematics for Computer Science Engineers Applications of Data Science : Data Science in Politics Political strategists and digital analysts can deploy modern software analytics to create detailed maps of voting patterns. Data analytics can help these campaigners to paint a vivid picture of political winds, party supporters, and trenchant opponents in every demographic region. This demographic data and other information can be used in campaign-spending management. It can help determine whether a voter would be most receptive to a phone call, a flyer or mailer, an in- person visit, or some other form of campaigning. By using data in this way, campaigns can avoid wasting money on ineffective or unnecessary advertising, and have a better chance of reaching someone who is receptive. Mathematics for Computer Science Engineers Applications of Data Science : Data Science in Politics Source: Historical U.S. Presidential Elections 1789-2020 - 270toWin Mathematics for Computer Science Engineers Applications of Data Science : Data Science in Politics Source: 270toWin - 2024 Presidential Election Interactive Map MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Healthcare & Medicine Source: http://www.primeclasses.in/blog/2019/08/26/the- need-for-data-science-in-healthcare-industry/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Healthcare & Medicine There are several fields in healthcare like medical imaging, drug discovery, genetics, predictive diagnosis etc that make use of data science. Hospitals analyse medical data and patient records to predict those patients that are likely to seek readmission within a few months of discharge. Omada Health is a digital medical company that uses smart devices to create customized behavioral plans and online training to help prevent chronic health conditions, such as diabetes, high blood pressure, and high cholesterol. On the mental health side, Canada’s new start-up, Awake Labs, is tracking data on children with autism in dress, informing parents before the meltdown. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in Healthcare & Medicine Source: https://allofus.nih.gov MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Applications of Data Science : Data Science in predicting people’s opinions Source: Simplilearn MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What is Data? Technically, data refers to individual facts, statistics, or items of information, often numeric, that are collected through observation. Source: https://www.twinkl.de/teaching-wiki/data MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data vs Information ➔ Data Raw facts, usually formatted in a special way. Based on records, observations etc. Unorganized. ➔ Information A collection of facts organized in such a way that they have additional value beyond the value of the facts themselves. Based on analysis of data. Organized and always depends on data. Ex : Data – thermometer readings of temperature taken every hour: (16.0, 17.0, 16.0, 18.5, 17.0,15.5….) [on transformation] Information – today’s high: 18.5, today’s low: 15.5 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data vs Information Source: https://effectualsystems.com/data-need-information/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Types of Data Data Represented by Alphanumeric data Numbers, letters, and other characters Image data Graphic images or pictures Audio data Sound, noise, tones Video data Moving images or pictures MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Structured, Unstructured & Semi-structured Data Source: https://towardsdatascience.com/data-extraction-from-a-pdf-table- with-semi-structured-layout-ef694f3f8ff1 Source: slidegeeks.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Structured, Unstructured & Semi-structured Data Structured Data: Structured data is the data whose elements are addressable for effective analysis. The data is organized into a formatted repository that is typically a database. Ex: Relational data. Semi-Structured Data: It is the data that doesn’t reside in relational database but has some organizational properties that make it easier to analyse. Ex: XML data. Unstructured Data: It is the data which is not organized in a predefined manner or doesn’t have a predefined data model, thus not a good fit for a mainstream relational database. Ex: Word, pdf, text etc. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Structured, Unstructured & Semi-structured Data Source: https://www.slidegeeks.com/pics/dgm/l/f/Forms_Type_Of_Big_Data_Ppt_PowerPoint_Presentation_Infographic_Template_Slide_1-.jpg MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data Information Source: guru99.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Information Concepts Source: https://learningforsustainability.net MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Science Science-latin word Scientia Meaning Knowledge Science is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Why do we need Data Science? Source: https://static.seekingalpha.com/uploads/2020/1/14/50485001-15789998083991578_origin.png MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Why do we need Data Science? The main reason why we need data science is the ability to process and interpret data. This enables users and industries to make informed decisions as well as helps in their growth, optimization, and performance. We know that, unstructured data is generated everywhere, every second. Unstructured data isn't well organized or easy to access. But its growth is enormous and importance of analyzing and drawing inferences from this type of data is crucial. Data Science provides a number of methods and techniques to deal with such data. This certainly helps many businesses and industries significantly to improve their productivity. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS How is Data generated? There is tons of data getting generated each day. Some of the major sources from which data is generated are: web, databases, media, IoT, cloud etc. Insight into data generation in a day over the internet: 500 million tweets are sent 294 billion emails are sent 4 petabytes of data are created on Facebook 4 terabytes of data are created from each connected car 65 billion messages are sent on WhatsApp 5 billion searches are made Slide courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS How is Data generated? By 2025, it’s estimated that 463 exabytes of data will be created each day globally – that’s the equivalent of 212,765,957 DVDs per day! Source: theblog.adobe..com Slide courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data generation In 2014, Oscars-host Ellen DeGeneres’ “celeb selfie” tweet that was viewed 26 million times across the Web during a 12- hour period. More than one billion hours of TV shows and movies are streamed from Netflix per month. Walmart, handles more than 1 million customer transactions every hour, feeding databases estimated at more than 2.5 petabytes. (the equivalent of 167 times the books in America's Library of Congress) Facebook, is home to 40 billion photos. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data generation Source: https://twitter.com/theellenshow MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data generation MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data generation Source: https://trak.in/tags/business/2014/04/15/digital-data-universe-expansion-2020/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Growth in Data generation The total amount of data created, captured, copied and consumed globally has been exponentially increasing. In 2020, the amount of data created & replicated was higher than expected caused by the increased demand due to the pandemic. Up to 2025, global data creation is projected to grow to more than 180 zettabytes. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Growth in Data generation MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Growth in Data generation MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS How much of data is put into use? Source: IDC, 2014 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS How much of data is put into use? Though there is a huge amount of data getting generated each day, it shall serve no purpose if it is left unused. This can further lead to information overload where there is an overabundance of information but it is not put into work due to lack of time, resources, understanding of the information, irrelevance of the information or other reasons. Thus, it is important to understand the data and know how to utilize it in the right manner. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS No one knows how to use it Source: https://image.slidesharecdn.com/instroductiontodatascience-160420090623/95/introduction-to-data-science-38- 638.jpg?cb=1461307670 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS But is data all we need? The graph below shows a cause & effect relationship between ‘Age of Miss America’ and ‘Murders by steam, hot vapour and hot objects’ which practically doesn’t seem correct. Thus, we see that the presence of interesting patterns need not imply their correctness. Blindly applying various processes and techniques on data can result in incorrect inferences. Source: https://i2.wp.com/boingboing.net/wp- content/uploads/2016/02/chart.jpg?fit=800%2C315&ssl=1 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS But is data all we need? The following work highlights the risk of amplifying and reinforcing biases present in the data by blindly applying machine learning on it. Source: https://arxiv.org/abs/1607.06520 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Learn how to use data The above examples help us understand that we need to learn how to utilize and handle the available data in the right manner to be able to arrive at correct results and draw meaningful inferences. ➔ Explore: identify patterns ➔ Predict: make informed guesses ➔ Infer: quantify what you know MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Learn how to use data Source:slidesharecdn.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data Science project life cycle The correct process of using available data is shown in this life cycle. It outlines the major stages in a data science project. Source: https://static.javatpoint.com/tutorial/data- science/images/data-science-lifecycle.png MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data Science project life cycle Source: https://res.cloudinary.com/practicaldev MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data Scientist Data Scientists in simple words are those who make sense out of all the data that are available and figure out the things that can be done with it. Source: proschoolonline.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data Scientist Source: edureka! MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What does a Data Scientist do? They are responsible for collecting, analyzing, modelling and interpreting large amounts of data. Their role combines Computer Science, Mathematics, Statistics etc. Source: https://edvancer.in/wp- content/uploads/2015/11/76c99311fc4be19bf4353 cfc3c2e94b2.png MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What does a Data Scientist do? Source: medium.com Slide courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Prerequisites for a Data Scientist Curiosity Common Communication Sense skills Sources: quickanddirtytips.com, Slide courtesy:Dr.Uma dreamstime.com,linkedin.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Prerequisites for a Data Scientist Source: data- Slide courtesy:Dr.Uma flair.training MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Demand for Data Scientist Data Science is a growing field. It is a popular and lucrative profession. Glassdoor has ranked this profession at #3 in 2022 despite the occurrence of the pandemic. Sources : Glassdoor, Forbes MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Demand for Data Scientist Source: https://cdn.ttgtmedia.com/rms/onlineimages/busin ess_analytics-data_scientist_01_mobile.png MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS How is it different from what Statisticians have been doing? Both Statisticians and Data Scientists work closely with data. Statisticians use mathematical equations and statistical models to analyze data and arrive at conclusions. Data Scientists however focus on delivering actionable results and sometimes need to deploy the model to the production system. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS How is it different from what Statisticians have been doing? Source: https://scientistcafe.com/ids/images/softskill1.png MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data Science vs Data Analysis Data Science is primarily used to make decisions and predictions making use of predictive causal analytics, prescriptive analytics (predictive plus decision science) and machine learning. Data Analysis includes descriptive analytics and prediction to a certain extent. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data Science vs Data Analysis Source: https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp- content/uploads/2017/01/Data-Analyst-vs-Data- Science-1-422x300.png MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Data Science vs Data Analysis Source: edureka! MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Common tasks in Data Science Source: Simplilearn MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Common tasks in Data Science Source: https://static.javatpoint.com/tutorial/data-science/images/how-to-solve-a-problem-in-data-science.png MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Unit 1:Population & Sampling Mamatha H R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Topics to be covered ❖ Statistical Analysis ❖ Population ❖ Sample ❖ Sampling ❖ Types of Population MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Problems to be solved Suppose, you are interested in finding Mean height of all male students of all the universities in India. OR Average marks of all female students of PES University. OR Relationship between the time a student spends on studying and the grades that he gets. OR Impact of rise in number of student assignments on their grades. Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What is Statistical Analysis? It’s the science of collecting, exploring and presenting large amounts of data to discover underlying patterns and trends. Statistics are applied every day – in research, industry and government – to become more scientific about decisions that need to be made. The basic idea behind all statistical methods of data analysis is to make inferences about a population by studying a relatively small sample chosen from it. Source: media3.giphy.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Population A population is the entire collection of objects or outcomes about which information is sought. As mentioned, statistical methods are based on the idea of analyzing a sample drawn from a population. For this idea to work, identifying the population, sample and choosing the sample in an appropriate manner becomes important. In research, a population doesn’t always refer to people. It can mean a group containing elements of anything you want to study, such as objects, events, organizations, countries, species, organisms, etc. Source: keydifferences.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sample A sample is a subset of a population, containing the objects or outcomes that are actually observed. Sample size: The number of items in a sample is called a sample size. The size of the sample is always less than the total size of the population. The process of taking a predetermined number of observations from a larger population is called sampling. Sources: i.gifer.com, keydifferences.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Population vs Sample Population Sample The population is a complete set. The sample is a subset of the population Population is hard to define and A sample is much easier to contact and observe in real life. observe. It is time consuming and costly to study It is relatively less time consuming and a population low cost to study a sample. Population contains all members of a Sample is a subset that represents the specified group. entire population. Reports on a population are a true Reports on a sample are have a margin representation of opinion. of error. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Population & Sample examples Population Sample All countries of the world Countries with published data available on birth rates and GDP since 2000 Songs from the Eurovision Song Contest Winning songs from the Eurovision Song Contest that were performed in English Undergraduate students in the 300 undergraduate students from three Netherlands Dutch universities who volunteer for your psychology research study Advertisements for IT jobs in the The top 50 search results for Netherlands advertisements for IT jobs in the Netherlands on May 1, 2020 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Populations & Samples In a recent survey, 250 college students at Union College Were asked if they smoked cigarettes regularly. 35 of the students said yes. Identify the population and the sample. Responses of all students at Union College (population) Responses of 250 students in survey (sample) MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Populations & Samples A city council member wanted to know how her constituents felt about a planned rezoning. She randomly selected 75 names from the city phone directory and conducted a phone survey. Identify the population and sample in this setting. Answer: The population is everyone listed in the city phone directory The sample is the 75 people selected to conduct a phone survey. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What is Sampling? The process of selecting observations(a sample) in order to make an inference that can be generalized to the population. What you What you want to actually talk observe in about the data INFERENCE Source Image : aprendeconalf.es Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling The methodology used to sample from a larger population depends on the type of analysis being performed. The population All of the individuals of interest The results The sample from the sample are Selected from the generalized to the population population The sample The individuals selected to participate in the research study MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling Sampling Population Sample Use statistics to summarize features Use parameters to summarize features Inference on the population from the sample MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Why sampling? We know that resources such as time, money and people are limited. When the population is large in size, geographically dispersed, or difficult to contact, it’s necessary to use a sample. Thus, most projects aim to gather data from a sample, rather than from the entire population. Some reasons for sampling are: Necessity: Sometimes it’s simply not possible to study the whole population due to its size or inaccessibility. Practicality: It’s easier and more efficient to collect data from a sample. Cost-effectiveness: There are fewer participant, laboratory, equipment, and researcher costs involved. Manageability: Storing and running statistical analyses on smaller datasets is easier and reliable. Saves time: As sample size is relatively less, it increases data-collection speed MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Characteristics of a sample A sample must be representative of the population. It must be appropriately sized. i.e. it must be sufficiently large to represent the population and provide statistical stability or reliability. It must be unbiased. It should contain all types of groups/units present in the population in fair proportions. It must be selected at random. This means that any item in the group has an equal chance of being and selected and included in the sample. It must be economical. The objectives of the survey must be achieved in as minimum of cost and effort as possible. It must be goal-oriented. It must be oriented to the research objectives and fitted to the survey conditions. Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Is it a good sample? Study : Survey of the job prospects of the students studying in a university. Sample: Taking survey from the students who are in Canteen. Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Is it a good sample? This is not an example of a good sample as, The students in the canteen are not completely representative of the students studying in the university. The size of the sample (i.e. the number of students in the canteen) might not be appropriate or sufficient enough to represent the population (students studying in the university). The sample selection is not performed at random as each student studying in the university doesn’t have an equal chance of getting selected. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Is it a good sample? ➔ Study : To measure teenage use of illegal drugs in a city. Sample : All high school students in the city. This type of sampling results in a biased sample as it does not include home- schooled students or dropouts. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Is it a good sample? ➔ Study: To calculate the average number of hours a person spends exercising. Sample: "Man on the street" interview which selects people who walk by a certain location. This type of sampling results in having an overrepresentation of healthy individuals who are more likely to be out of the home than individuals with a chronic illness. This may be an extreme form of biased sampling, because certain members of the population are totally excluded from the sample (that is, they have zero probability of being selected). MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Is it a good sample? ➔ Study : A test of the effectiveness of a new high school curriculum introduced Sample : Dividing an area by school district, then choosing a school or set number of schools at random and sampling students from each school. This type of sampling results in a unbiased sample as it each school district in an area has its representation in the sample. Also, each school has an equal chance of getting chosen. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Is it a good sample? ➔ Study: Conduct observations to ensure that employees are employing best practices in the company. Sample: Each employee is assigned a random number using computer software. The same software is used periodically to choose a number of the employees and are observer. This is a good sample as each employee has an equal chance of being selected. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Types of population 1. Tangible or concrete population 1. Conceptual population MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Tangible population Populations where the members are physical objects, such as cars, bolts, apples, etc., are called tangible or concrete populations. Such populations are assumed to be always finite and therefore involves counting. After an item is sampled, the population size decreases by 1. In principle, one could in some cases return the sampled item to the population, with a chance to sample it again, but this is rarely done in practice. Source: https://www.hindivarta.com/jansankhya- Slide courtesy: Dr.Uma ki-samasya-aur-samadhan-par-nibandh/ MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Conceptual population Populations that do not consist of physical or actual objects are called Conceptual populations. Conceptual populations are mostly the result of a measurement. It involves measuring something multiple times. Ex: length of a metal rod. It consists of a not well-defined group of which all elements are not available at the time the sample is collected as the population increases every day. The size of a conceptual population is usually large. Ex:a measuring scale population can be all the possible outputs it can give. i.e. infinite. The measured values can be thought of as a sample from this infinite population. Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Tangible & Conceptual population examples Define the population, and state whether it is tangible or conceptual. A shipment of bolts is received from a vendor. To check whether the shipment is acceptable with regard to shear strength, an engineer reaches into the container and selects 10 bolts, one by one to test. Ans: All the bolts in the shipment: Tangible population The resistance of a certain resistor is measured 5 times with the same ohmmeter. Ans: All measurements that could be made on that resistor with that ohmmeter : Conceptual population MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Tangible & Conceptual population examples Define the population, and state whether it is tangible or conceptual. A geologist weighs a rock several times on a sensitive scale. Ans: All the readings that the scale could produce: Conceptual population A pollster samples 1000 registered voters in a certain state and asks them which candidate they support for governor. Ans: All registered voters in that state : Tangible population MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Tangible & Conceptual population examples Define the population, and state whether it is tangible or conceptual. A quality engineer needs to estimate the percentage of bolts manufactured on a certain day that meet a strength specification. At 3:00 in the afternoon he samples the last 100 bolts to be manufactured. Ans: All bolts manufactured on that day : Tangible population In a clinical trial to test a new drug that is designed to lower cholesterol, 100 people with high cholesterol levels are recruited to try the new drug. Ans: All people with high cholesterol level: Tangible population MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Target and Study population Target or Theoretical population refers STUDY POPULATION to the entire group of individuals or objects to which researchers are interested in generalizing the conclusions. It must meet a set of criteria of interest to the researchers. Study population or accessible SAMPLE population is the population to which the researches can apply their conclusions to. It is a subset of the target population. It may be limited to region, state, city, county, or institution TARGET POPULATION Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Target and Study population examples Target Population Study Population All institutionalized elderly All institutionalized elderly with Alzheimer's in St. with Alzheimer's Louis county nursing homes All people with AIDS All people with AIDS in the metropolitan St. Louis area All low birth weight infants All low birth weight infants admitted to the neonatal ICUs in St. Louis city & county All school-age children with All school-age children with asthma treated in asthma pediatric asthma clinics in university-affiliated medical centers in the Midwest MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Terminologies related to Sampling Target or Theoretical Population: The population to which the investigator wants to generalize his results. Sampling Frame : The sampling frame is the list from which the potential respondents are drawn. Ex: List of Universities, List of Students, List of Airline Companies, Telephone Directory Sampling Unit : Smallest Unit from which sample can be selected. Sampling Scheme: Method of selecting sampling units from sampling frame. Sample: All selected respondents form a sample. Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling Breakdown Source: https://image.slidesharecdn.com/qrmtheory- 180918191951/95/how-to-do-sampling-8- 638.jpg?cb=1537298482 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling Breakdown Study : Find the mean weight of all students of all universities in India. Whom do you want to generalize results? All universities in India ➔ Target or Theoretical population What population can you get access to? All universities in Karnataka ➔ Study population How can you get access to them? List of Universities in Karnataka ➔ Sampling frame Who is in your study? Two Universities from Karnataka ➔ Sample Slide courtesy: Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Unit 1:Sampling Methods Mamatha H R Department of Computer Science and Engineering MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Topics to be covered ❖ Sampling methods ❖ Sampling process ❖ Probability and Non-probability sampling ❖ Advantages and disadvantages of different sampling methods MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS What are Sampling methods? In a statistical study, sampling methods refer to how we select members from the population to be included in the study. The selected sample must be representative of the population. If a sample isn't randomly selected, it will probably be biased in some way and the data may not be representative of the population. There are many ways to select a sample—some good and some bad. Sources: blog.masterofproject.com, analytics-magazine.org Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling process Define Target population Specify Sampling Specify Sampling (population of frame method concern) Sampling and data Implement the Determine collecting sampling plan sample size Reviewing the sampling process MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling ➔ Factors that influence sample representativeness: Sampling procedure Sample size Participation (response) ➔ When might you sample the entire population? When your population is very small When you have extensive resources When you don’t expect a very high response Source: thumbs.dreamstime.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling Frame Sampling frame is the list of items or events from which the potential respondents are drawn or which are possible to measure. Sometimes, it is possible to identify and measure every single item in the population and to include any one of them in our sample. However, in the more general case this is not possible. There is no way to identify all rats in the set of all rats. As a remedy, we seek a sampling frame which has the property that we can identify every single element and include any of them in our sample. The sampling frame must be representative of the population. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Representative & Biased Sample Sample 1 Representative of the population Sample 2 Population Biased Sample MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Types of Sampling methods Samples Probability Samples Non-Probability Samples Simple Random Stratified Judgement Snowball Cluster Systematic Convenience Quota MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Probability Sampling Probability sampling is a type of sampling in which every unit in the population has a chance/probability (greater than zero) of being selected in the sample, and this probability can be accurately determined. This type of sampling decreases bias and sampling error in the selection process. When every element in the population does have the same probability of selection, this is known as an 'equal probability of selection' (EPS) design. Such designs are also referred to as 'self-weighting' because all sampled units are given the same weight. Source: www.mathstopia.net Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Non-Probability Sampling Non-Probability sampling is a type of sampling in which every unit in the population doesn’t have a chance/probability (greater than zero) of being selected in the sample. Here, some elements of the population have no chance of selection (these are sometimes referred to as 'out of coverage'/'undercovered'), or the probability of selection can't be accurately determined. It involves the selection of elements based on assumptions regarding the population of interest, which forms the criteria for selection. The selection of elements is non random. Thus, non-probability sampling does not allow the estimation of sampling errors. It is more likely to produce a biased sample and restricts generalization. It is not an appropriate data collection method for most of the statistical analysis. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Probability Sampling Subjects of the sample are chosen based on known probabilities. Probability Samples Simple Systematic Stratified Cluster Random MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling Simple random sampling, as the name suggests, is an entirely random method of selecting the sample. Here, each subject or unit in the population has an equal chance of being selected. The sampling frame should include the whole population. A table of random number or lottery system is used to determine which units are to be selected. Simple random sampling is always an EPS design, but not all EPS designs are simple random sampling. Source: datasciencemadesimple.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling Source: questionpro.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling Purpose: It is random and thus results in a representative-sample. When to Use: Best to use when population is small as it produces a better representative-sample. Key Aspect: Each member of the population has an equal probability of getting selected. General Procedure: Assign numbers to all members of the population & select randomly. ○ For a small population: Manual lottery method can be used for selection. ○ For a larger population : System generated numbers can be used to select elements from the population. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling Examples At a birthday party, teams for a game are chosen by putting everyone's name into a jar, and then choosing the names at random for each team. A restaurant leaves a fishbowl on the counter for diners to drop their business cards. Once a month, a business card is pulled out to award one lucky diner with a free meal. All students in the Computer Science department are assigned numbers and 100 random numbers are chosen to attend a webinar. Sources: c8.alamy.com, wordwall.net MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling Examples Here, each of the 20 coins have an equal probability of getting selected. Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling Examples Probability = (n/N) x 100 Calculating the probability of each coin getting selected. Total population size (N) = 20 Sample size (n) = 5 Probability = (5/20) x 100 = 25% Thus each coin has 25% of probability of getting selected. Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling Examples In a company consisting of 10,000 employees, 25 employees are selected to survey the average number of hours a day they are present in the office. Population frame: List of all employees numbered from 1-10,000 Sample : Random number table consisting of 25 random employees. Probability of selection of each employee : N = 10,000; n = 25 probability = (25/10,000) x 100 = 0.25% Source: 5found.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling: Advantages ➔ Advantages: This method is simple to use. Estimates are easy to calculate. Random samples are usually fairly representative since they don't favor certain members of the population. Low sampling error. It needs only a minimum knowledge of the study group of population in advance. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling: Disadvantages ➔ Disadvantages: If sampling frame is large, this method impracticable. Minority subgroups of interest in population may not be present in sample in sufficient numbers for study. This type of sampling can’t be employed where the units of the population are heterogeneous in nature. Sometimes, it is difficult to have a completely cataloged universe. This method lacks the use of available knowledge concerning the population. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling with replacement This is a sampling procedure in which each sampling unit randomly selected from the population is measured or recorded and then returned to the population. Thus, a sampling unit may be sampled multiple times. When sampling the first marble, each marble has the same chance of 0.1 of being sampled. When sampling the second marble and all the subsequent marbles, each marble still has a 0.1 chance of being sampled. Each time we sample a unit, all units have similar chances of being sampled. Source: www.spss-tutorials.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Simple Random Sampling without replacement This is a sampling procedure in which sampling units are selected from a population of without replacement such that every sample unit has an equal probability of being selected. No element can be selected more than once in the same sample. For the first marble sampled, each marble has a 0.1 chance of being sampled. However, the first unit we sampled has a zero chance of being sampled again. Thus, the other 9 units each have a chance of 1 in 9 = 0.11 of being sampled as the second unit. Source: www.spss-tutorials.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling Systematic sampling relies on arranging the target population according to some ordering scheme and then selecting elements at regular intervals through that ordered list. The first element is selected randomly. Then it proceeds with the selection of every kth element. Where k is the size of the selection interval. k = (population size/sample size) It is important that the starting point is not automatically the first in the list, but is instead randomly chosen from within the first to the kth element in the list. A simple example would be to select every 10th name from the telephone directory (an 'every 10th' sample, also referred to as 'sampling with a skip of 10'). MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling Source: questionpro.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling Systematic sampling is an Equal Probability Sampling method, as all elements have the same probability of selection (in the below example given, one in twelve). It is not 'simple random sampling' because different subsets of the same size have different selection probabilities Ex: the set {2,5,8,11} has a one-in-twelve probability of selection, but the set {1,3,6,7} has zero probability of selection. Source: www.netquest.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling When to Use: When project budget is tight and less time to complete. Key Aspect: Find the kth value to select every kth member. k=N/n General Procedure: ○ Assign numbers to each population element. ○ Order the population elements in an ordered sequence ○ Find ‘k’ the size of the selection interval. ○ Select the first sample element randomly from the first k population elements. ○ Thereafter, select the sample elements at a constant interval, k, from the ordered sequence frame. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling Examples From a classroom consisting of 64 students, the teacher wants to select 8 students to check their assignments. Population size = N = 64 Sample size = n =8 Size of selection interval = k = N/n Selecting the = 64/8 = 8 subsequent 8th student Randomly selecting the first student N = 64 n=8 k=8 MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling Examples Purchase orders for the previous fiscal year are serialized 1 to 10,000. A sample of fifty purchases orders is needed for an audit. N = 10,000 n = 50 k = 10,000/50 = 200 First select an element randomly from the first 200 purchase orders. Assume the 45th purchase order was selected. Subsequent sample elements: 245, 445(245+200), 645(445+200),.. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling Examples Given a set of 20 coins, 5 coins must be selected from the population. N = 20; n = 5 k = N/n = 20/5 = 4 Randomly selecting the first element = 3 (suppose) Subsequent coins are to be selected at an interval 4 from the 3rd coin Sampled coins = { 3, 3+4 = 7, 7+4 = 11, 11+4 = 15, 15+4 = 19} Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling: Advantages Sample is easy to select. Suitable sampling frame can be identified easily. Sample evenly spreads over entire reference population. It is a cost effective sampling method. It guarantees that the entire population is evenly sampled. Systematic sampling also carries a low-risk factor because there is a low chance that the data can be contaminated. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Systematic Sampling: Disadvantages This type of sampling might lead to bias if there is an underlying pattern/periodicity in the population which coincides with the selection. Ex : If the HR database groups employees by team, and team members are listed in order of seniority, there is a risk that the interval might skip over people in junior roles, resulting in a sample that is skewed towards senior employees. Difficult to assess precision of estimate from one survey. Each element does not have an equal chance in getting selected Ignorance of all the elements between two kth elements. The size of the population is needed. Without knowing the specific number of participants in a population, systematic sampling does not work well. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling Stratified sampling is the type of sampling in which the population is divided into 2 or more groups called strata based on a shared characteristic or trait. Then simple random samples are selected from each group. The selected 2 or more samples are combined into one. The strata or groups don’t overlap. But, they represent the entire population. The shared characteristics based on which the population is divided could be gender, educational attainment, income, age etc. Source: datasciencemadesimple.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling Each stratum is sampled as an independent sub-population. Every unit in a stratum has same chance of being selected. Using same sampling fraction for all strata ensures proportionate representation in the sample. Adequate representation of minority subgroups of interest can be ensured by stratification & varying sampling fraction between strata as required. Since each stratum is treated as an independent population, different sampling approaches can be applied to different strata. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling Source: questionpro.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling Source: questionpro.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling Purpose: To obtain an unbiased random sample from a larger population. When to Use: When population proportion must be reflected in sample. Key Aspect: Sample proportion is same as Population proportion, Strata is homogeneous. General Procedure: ○ Divide the population into Strata or Groups. ○ Criteria for division could be: Gender, Hair Color, Eye Color, Salary, Designation, Age etc. ○ Selection of sample: Simple Random Sampling approach is used to sample units from each strata. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling examples Given 20 coins of different colours. Population of coins is divided into 4 strata based on their colours. Coins from each strata are sampled using simple random sampling. Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling examples To find out the most popular song among the FM radio listeners. All listeners are stratified by age. Listeners from each age group are selected using simple random sampling and surveyed for their favourite song of the year. Stratified by Age 20 - 30 years old (homogeneous within the stratum) Strata are Heterogeneous 30 - 40 years old (homogeneous within the stratum) Strata are Heterogeneous 40 - 50 years old (homogeneous within the stratum) MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling examples A high school principal wants to conduct a survey to collect the opinions of students. The students are grouped into 4 stratums based on their grade. Then, simple random samples of 50 students from each grade are selected to be included in the survey. Source: statology.org MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling: Advantages It enhances the representativeness of the sample. It is easy to carry out. It has higher statistical efficiency. A stratified sample can provide a higher precision than a simple random sample of the same size. As it provides a greater precision, this type of sampling often requires a smaller sized sample which saves money. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Stratified Sampling: Disadvantages Sampling frame of the entire population has to be prepared separately for each stratum. When examining multiple criteria to divide the population, stratifying variables may be related to some but not to others further complicating the design and potentially reducing the utility of the strata. In some cases (such as designs with a large number of strata, or those with a specified minimum sample size per group), stratified sampling can potentially require a larger sample than other methods. It is time consuming and expensive. It leads to classification errors. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling In cluster sampling, population is divided into non-overlapping clusters or areas similar to Stratified sampling. Each cluster is a miniature or microcosm of the population. Each cluster should have similar characteristics to the whole sample. Instead of sampling individuals from each subgroup like in stratified sampling, in cluster sampling entire clusters are randomly selected. A subset of the clusters is selected randomly for the sample. If the number of elements in the subset of clusters is larger than the desired value of n(sample size), these clusters may be subdivided to form a new set of clusters and subjected to a random selection process. Source: dataz4s.com Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling Source:www.netquest.com Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling When to Use: When population is already broken up into groups(clusters). Key Aspect: Heterogeneous members in each group. General Procedure: ○ Population is divided into non-overlapping areas(clusters). ○ Each cluster is a miniature or microcosm of a population. ○ Clusters are selected randomly. ○ All elements of the selected-clusters are included in the sample or elements from the selected-clusters are chosen using simple random sampling. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling examples Given a set of 20 coins of different colours Population is divided into 5 clusters each having 4 coins. A whole cluster is randomly selected to be included in the sample. Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling examples An athletic organization wishes to find out which sports Grade 11 students are participating in across Canada. It would be too costly and lengthy to survey every Canadian in Grade 11, or even a couple of students from every Grade 11 class in Canada. Instead, each school is consisting of Grade 11 students is considered as a cluster and 100 schools are randomly selected from all over Canada. These schools provide clusters of samples. Then, every Grade 11 student in all 100 clusters is surveyed. In effect, the students in these clusters represent all Grade 11 students in Canada. Source: s4be.cochrane.org MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling examples The municipal council of a small city wants to investigate the use of health care services by residents. The council first obtains electoral subdivision maps that identify and label each city block. From these maps, the council creates a list of all city blocks. This list will serve as the sampling frame. Every household in that city belongs to a city block, and each city block represents a cluster of households. The council randomly picks a number of city blocks. Source:coronainsights.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling: Advantages It is more convenient for geographically dispersed populations. It can reduce the travel costs to contact sample elements. It simplifies the administration of the survey. It is more feasible. The division of the entire population into homogeneous groups increases the feasibility of the sampling. Since each cluster represents the entire population, more subjects can be included in the study. Requires fewer resources. Since cluster sampling selects only certain groups from the entire population, the method requires fewer resources for the sampling process. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling: Disadvantages It is statistically less efficient when the cluster elements are similar. Costs and the number of problems occurring are greater than that of simple random sampling. There is higher sampling error. The method is prone to biases. If the clusters representing the entire population were formed under a biased opinion, the inferences about the entire population would be biased as well. It’s difficult to guarantee that the sampled clusters are really representative of the whole population. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling: Types There are 2 types of cluster sampling methods. One-stage sampling: All of the elements within selected clusters are included in the sample. Two-stage sampling: A subset of elements within selected clusters are randomly selected for inclusion in the sample. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling: One-stage cluster sampling Here, the population is divided into clusters. Then, some of the clusters are randomly selected and all members from those clusters are included in the sample. Source:statology.org MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Cluster Sampling: Two-stage cluster sampling As the name suggests, this method of sampling involves 2 stages. Step 1: Split a population into clusters, then randomly select some of the clusters. Step 2: Within each chosen cluster, randomly select some of the members to be included in the survey. Source:statology.org MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Difference between Strata and Clusters Although strata and clusters are both non-overlapping subsets of the population, they differ in several ways. All strata are represented in the sample. But only a subset of clusters are in the sample. With stratified sampling, the best survey results occur when elements within strata are internally homogeneous. However, with cluster sampling, the best results occur when elements within clusters are internally heterogeneous. Source: miro.medium.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Non-probability Sampling Non-Probability sampling is a type of sampling in which every unit in the population doesn’t have a chance/probability (greater than zero) of being selected in the sample. Non-Probability Samples Judgement Snowball Convenience Quota MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Convenience Sampling Sometimes it is also known as grab or opportunity sampling or accidental or haphazard sampling. This is a type of nonprobability sampling which involves the sample being drawn from that part of the population which is close to hand. That is, readily available and convenient. Here, sample elements are selected for the convenience of the researcher. The researcher using such a sample cannot scientifically make generalizations about the total population from this sample because it would not be representative enough. Source: googleusercontent.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Convenience Sampling Source: questionpro.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Convenience Sampling When to Use: When population is not clearly defined or sampling unit is not clear or complete source list is not available. Key Aspect: Subjects for a study are easily available within the proximity of the researcher. General procedure: ○ It is done at the “convenience” of the researcher. ○ Selection : The individuals that are convenient and easiest to reach are selected to be included in the sample. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Convenience Sampling examples Given a set of 20 coins of different colours. Let’s say that the researcher likes the numbers 4,7,12,15,20. Thus, the coins with the same numbers are included in the sample. Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Convenience Sampling examples To research the opinions about student support services in your university After each of your classes, you ask your fellow students to complete a survey on the topic. This is a convenient way to gather data, but as you only surveyed students taking the same classes as you at the same level, the sample is not representative of all the students at your university. Source: assets.pearsonschool.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Convenience Sampling examples To record the popular opinions of people about the current laws of the city. The researcher surveys all people that pass by his house. Again, this is a convenient way of studying the opinions of people living in the city. But, it doesn’t reflect the opinions of all the residents of the city. Source:slideshare.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Convenience Sampling: Advantages & Disadvantages ➔ Advantages: This type of sampling is useful in pilot study. It costs less and is an inexpensive way to gather initial data for the research. It saves time. It is relatively easy to get a sample. It is simple and easy to implement. ➔ Disadvantages: It is prone to significant bias as the sample may not be representative of the characteristics of the population. Since the same may not be representative of the population, this type of sampling can’t produce generalizable results. It might lead to sampling errors. A study conducted on a convenience sample will have limited external validity. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Judgemental Sampling Judgemental or Purposive sampling is a type of non-probability sampling where the researcher chooses the sample based on who they think would be appropriate for the study. This is used primarily when there is a limited number of people that have expertise in the area being researched. The sample depends on the judgement of the experts conducting the study. It is not a scientific method of sampling. Source: dataz4s.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Judgemental Sampling Source: questionpro.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Judgemental Sampling When to Use: This is used primarily when there is a limited number of people that have expertise in the area being researched. Also, the researcher must be confident that the chosen sample is truly representative of the entire population. Key Aspect: The researcher selects a sample based on experience or knowledge of the group to be sampled. General Procedure: ○ On the basis of the researcher’s knowledge and judgment elements of the population are sampled. ○ Selection : Elements that own the qualities expected by the researcher. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Judgemental Sampling examples Given a set of 20 coins of different colours. Suppose, the experts believe that coins numbered 1, 7, 10, 15, and 19 should be considered for the sample as they may help us to infer the population in a better way. Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Judgemental Sampling examples To know more about the opinions and experiences of disabled students at your university You purposefully select a number of students with different support needs at your university in order to gather a varied range of data on their experiences with student services. Source: rm-15da4.kxcdn.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Judgemental Sampling examples A panel decides to understand the factors which lead a person to select ethical hacking as a profession. The researchers who understand what ethical hacking is will be able to decide who should form the sample to learn about it as a profession. Researchers can easily filter out those participants who can be eligible to be a part of the research sample. Source:statisticshowto.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Judgemental Sampling: Advantages & Disadvantages ➔ Advantages: It consumes minimum time. The researcher is given an opportunity to bring his judgement and expertise to play. No special knowledge of statistics is needed. Real time results can be obtained. ➔ Disadvantages: It is prone to errors in judgment by researcher. Low level of reliability and high levels of bias. Inability to generalize research findings to the entire population. It is difficult to choose the appropriate sample size. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quota sampling In this type of sampling, sample elements are selected until the quota controls are satisfied. The population is first segmented into mutually exclusive sub- groups, just as in stratified sampling. Then judgment is used to select subjects or units from each segment based on a specified proportion. The population units are selected based on predetermined characteristics of the population. It is similar to Stratified sampling but it doesn’t involve random selection. Ex: recruiting the first 50 men and first 50 women that meet inclusion criteria. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quota sampling Source: questionpro.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quota sampling When to Use: If a study aims to investigate a trait or a characteristic of a certain subgroup, this type of sampling is the ideal technique. Key Aspect: Sample elements are selected until the quota controls are satisfied. General Procedure: ○ Divide the population into subgroups. ○ Identify proportions or weightage in which the subgroups are present in the population. ○ Select an appropriate sample size while maintaining the proportions of the subgroups. ○ Conduct the surveys according to the quotas defined Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quota sampling examples Given a set of 20 coins of different colours. Here we need to select items based on predetermined characteristics of the population. Suppose we have to select coins having a number in multiples of four for our sample. Thus, the coins 4,8,12,16,20 are sampled. Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quota sampling examples To survey individuals about what smartphone brand they prefer to use. Suppose the researcher considers a sample size of 500 respondents. Also, the researcher is only interested in surveying ten states in the US. The researcher divides the population as follows Gender: 250 males and 250 females Age: 125 respondents each between the ages of 1-50, and 51+ Location: 50 responses per state Source: ovationmr.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quota sampling examples A cool drinks company wants to find out what age group prefers what brand of drinks in a particular city. The researcher applies quotas on the age groups of 11-21,22-31, 32-41, 42-51. The researcher then samples people from each quota and surveys them to gauge the trend among the population of the city. Source: ovationmr.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Quota sampling examples: Advantages & Disadvantages ➔ Advantages: It is a cost effective method. There is convenience in execution of this sampling. It is a speedy process. The information can be deciphered once the sampling is done. It improves the representation of certain groups within the population and also ensures that they are not over-represented. ➔ Disadvantages: Impossible to determine sampling error as the sample is not chosen using random selection. Can result in sampling bias if the selection of units was based on ease of access and cost considerations. It is not possible to make statistical inferences from the sample to the population leading to the problems of generalization. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Snowball sampling In this type of sampling, survey subjects are selected based on referral from other survey respondents. Existing subjects are asked to nominate further subjects known to them so that the sample increases in size like a rolling snowball. This method of sampling is effective when a sampling frame is difficult to identify. Usually applied when the subjects are difficult to trace. Ex: it will be extremely challenging to survey shelter less people or illegal immigrants. Source: cuttingedgepr.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Snowball sampling Source: questionpro.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Snowball sampling When to Use: When the desired sample characteristic is rare. Key Aspect: Research starts with a key person and introduce the next one to become a chain. It may be extremely difficult or cost prohibitive to locate respondents in these situations. How: ○ Identify an initial subject and ask these people to identify others. ○ Selection : This technique relies on referrals from initial subjects to generate additional subjects. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Snowball sampling examples To select students from a class of 20 to be a part of a volunteer club. Here, we had randomly chosen person 1 for our sample, and then he/she recommended person 6, and person 6 recommended person 11, and so on. 1->6->11->14->19 Source: analyticsvidhya.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Snowball sampling examples To study the level of customer satisfaction among the members of an elite country club. It is extremely difficult to collect primary data sources unless a member of the club agrees to have a direct conversation with you and provides the contact details of the other members of the club. Thus the primary data source is randomly selected and it nominates other potential data sources that will be able to participate in the research studies. Source: cdn.scribbr.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Snowball sampling examples To research the experiences of homelessness in your city. Since there is no list of all homeless people in the city, probability sampling isn’t possible. You meet one person who agrees to participate in the research, and she puts you in contact with other homeless people that she knows in the area. Source: miro.medium.com MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Snowball sampling examples: Advantages & Disadvantages ➔ Advantages: The chain referral process allows the researcher to reach populations that are difficult to sample when using other sampling methods. The process is cheap, simple and cost-efficient. This sampling technique needs little planning and fewer workforce compared to other sampling techniques. ➔ Disadvantages: There is a significant risk of selection bias in snowball sampling, as the referenced individuals will share common traits with the person who recommends them. It is usually impossible to determine the sampling error or make inferences about populations based on the obtained sample. The researcher has little control over the sampling method. Representativeness of the sample is not guaranteed. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sample size The more heterogeneous a population is, the larger the sample needs to be. For probability sampling, the larger the sample size, the better. With nonprobability samples, sample size is not generalizable. The main factors affecting the sample size are: ○ Total size of the population ○ Margin of error ○ Confidence level MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sample statistic & Population parameter ➔ Sample statistic: A sample statistic is a piece of information you get from a fraction of a population i.e. a sample. It can also be defined as any number or statistic computed from the sample data. Example: sample average, median, sample standard deviation, and percentiles. ➔ Population parameter: A quantity or statistical measure, for a given population is called a population parameter. It can also be defined as data that refers to something about an entire population. Example: mean and variance of a population are population parameters. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sample statistic & Population parameter Decide whether the numerical value describes a population parameter or a sample statistic. a.) A recent survey of a sample of 450 college students reported that the average weekly income for students is $325. Ans: Because the average of $325 is based on a sample, this is a sample statistic. b.) The average weekly income for all students is $405. Ans: Because the average of $405 is based on a population, this is a population parameter. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Errors in sampling Sampling error or Random error occurs when sample is not representative of the population Errors in sampling Non-sampling error or Systematic error occurs during data collection, causing the data to differ from the true values. Slide Courtesy:Dr.Uma MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling error The discrepancy between a sample statistic and its population parameter is called sampling error. Defining and measuring sampling error is a large part of inferential statistics. It occurs when the sample is not representative of the population. The sampling error for a given sample is unknown but when the sampling is random, for some estimates (for example, sample mean, sample proportion) theoretical methods may be used to measure the extent of the variation caused by sampling error. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling error As we can see there is a difference between population parameters and sample parameters. This is due to sampling error. Two samples of same population have differing parameters. This is due to sampling variation. It is also the reason why scientific experiments produce different result under identical scenarios. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Non-sampling error Non-sampling errors are the results of mistakes made in implementing data collection and data processing, such as ○ failure to locate and interview the correct household ○ errors in understanding of the questions by either the interviewer or the respondent ○ data entry errors ○ missing Data ○ poorly conceived concepts, unclear definitions, and defective questionnaires ○ response errors occurring when people are unaware, refuse to answer, or overstate in their answers Major sources : Sampling Bias, Non-response Bias. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling Bias Sampling bias occurs when a chosen sample is not representative of the larger population. It occurs due to the sampling technique/method used to perform data collection. It can be either selection bias and non-response bias. A sampling method has a sampling bias if all subjects in the population are not equally likely to be included in a sample. That is, a sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Selection Bias & Nonresponse bias ➔ Selection bias: It is a bias in which a sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others. It results in a biased sample of a population in which all individuals, or instances, were not equally likely to have been selected. ➔ Nonresponse bias: Nonresponse bias is a type of sampling bias that occurs because of the absence of certain objects or subjects from a sample. For example, some subjects don’t respond to surveys because they refuse, cannot be contacted, or have a lack of interest in the survey content. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Bias ex: Q) A new chemical process is run 10 times each morning for five consecutive mornings. If the new process is put into production, it will be run 10 hours each day, from 7 A.M. until 5 P.M. Is it reasonable to consider the 50 yields to be a simple random sample? Ans) Since the new process runs during both morning and afternoon, the population consists of all the yields that would ever be observed, including both morning and afternoon runs. The sample however is drawn only from that portion of the population that consists of morning runs, and thus it is not a simple random sample. It exhibits a bias is not representative of the population intended to be studied. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Sampling variation Simple random samples always differ from their populations in some ways, and occasionally may be substantially different. Two different samples from the same population will differ from each other as well. This phenomenon is known as sampling variation. Sampling variation is one of the reasons that scientific experiments produce somewhat different results when repeated, even when the conditions appear to be identical. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Independence The items in a sample are said to be independent if knowing the values of some of them does not help to predict the values of the others. With a finite, tangible population, the items in a simple random sample are not strictly independent, because as each item is drawn, the population changes. This change can be substantial when the population is small. However, when the population is very large, this change is negligible and the items can be treated as if they were independent The sample can be considered independent if sample size is smaller than 5% of population size. Since conceptual population have infinite/very large size the sample obtained (ex: measuring a rock) is always independent MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Questions (Q1.) A physical education professor wants to study the physical fitness levels of students at her university. There are 20,000 students enrolled at the university, and she wants to draw a sample of size 100 to take a physical fitness test. She obtains a list of all 20,000 students, numbered from 1 to 20,000. She uses a computer random number generator to generate 100 random integers between 1 and 20,000 and then invites the 100 students corresponding to those numbers to participate in the study. Which sampling technique is used? Answer: The simple random sampling technique is used. Note that it is analogous to a lottery in which each student has a ticket and 100 tickets are drawn. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Questions (Q2) A quality engineer wants to inspect rolls of wallpaper in order to obtain information on the rate at which flaws in the printing are occurring. She decides to draw a sample of 50 rolls of wallpaper from a day’s production. Each hour for 5 hours, she takes the 10 most recently produced rolls and counts the number of flaws on each. Is this a simple random sample? Answer: No. Not every subset of 50 rolls of wallpaper is equally likely to comprise the sample. To construct a simple random sample, the engineer would need to assign a number to each roll produced during the day and then generate random numbers to determine which rolls comprise the sample. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Questions (Q3) A construction engineer has just received a shipment of 1000 concrete blocks, each weighing approximately 50 pounds. The blocks have been delivered in a large pile. The engineer wishes to investigate the crushing strength of the blocks by measuring the strengths in a sample of 10 blocks. Which sampling method is suitable? Answer: To draw a simple random sample would require removing blocks from the center and bottom of the pile, which might be quite difficult. For this reason, the engineer might construct a sample simply by taking 10 blocks off the top of the pile. convenience sample MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Questions (Q4) A quality inspector draws a simple random sample of 40 bolts from a large shipment and measures the length of each. He finds that 34 of them, or 85%, meet a length specification. He concludes that exactly 85% of the bolts in the shipment meet the specification. The inspector’s supervisor concludes that the proportion of good bolts is likely to be close to, but not exactly equal to, 85%. Which conclusion is appropriate? Answer: Because of sampling variation, simple random samples don’t reflect the population perfectly. However, they are often fairly close. It is therefore appropriate to infer that the proportion of good bolts in the lot is likely to be close to the sample proportion, which is 85%. It is not likely that the population proportion is equal to 85%. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Questions (Q5) Another inspector repeats the study with a different simple random sample of 40 bolts. She finds that 36 of them, or 90%, are good. The first inspector claims that she must have done something wrong, since his results showed that 85%, not 90%, of bolts are good. Is he right? Answer: No, he is not right. This is sampling variation at work. Two different samples from the same population will differ from each other and from the population. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Questions (Q6) A geologist weighs a rock several times on a sensitive scale. Each time, the scale gives a slightly different reading. Under what conditions can these readings be thought of as a simple random sample? What is the population? Answer: If the physical characteristics of the scale remain the same for each weighing, so that the measurements are made under identical conditions, then the readings may be considered to be a simple random sample. The population is conceptual. It consists of all the readings that the scale could in principle produce. MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS Questions (Q7) What sampling method can be recommended? Determining proportion of undernourished five year olds in a village. Investigating nutritional status of preschool children. In estimation of immunization coverage in a province, data on seven children aged 12-23 months