Data Science Grade X Student Handbook PDF

DATA SCIENCE GRADE X Version 1.0 DATA SCIENCE GRADE X Student Handbook ACKNOWLEDGMENT Patrons Sh. Ramesh Pokhriyal 'Nishank', Minister of Human Resource Development, Government of India Sh. Dhotre Sanjay Shamrao, Minister of State for Human Resource Development, Government of India Ms. Anita Karwal, IAS, Secretary, Department of School Education and Literacy, Ministry Human Resource Development, Government of India Advisory Editorial and Creative Inputs Mr. Manuj Ahuja, IAS, Chairperson, Central Board of Secondary Education Guidance and Support Dr. Biswajit Saha, Director (Skill Education & Training), Central Board of Secondary Education Dr. Joseph Emmanuel, Director (Academics), Central Board of Secondary Education Sh. Navtez Bal, Executive Director, Public Sector, Microsoft Corporation India Pvt. Ltd. Sh. Omjiwan Gupta, Director Education, Microsoft Corporation India Pvt. Ltd Dr. Vinnie Jauhari, Director Education Advocacy, Microsoft Corporation India Pvt. Ltd. Ms. Navdeep Kaur Kular, Education Program Manager, Allegis Services India Value adder, Curator and Co-Ordinator Sh. Ravinder Pal Singh, Joint Secretary, Department of Skill Education, Central Board of Secondary Education ABOUT THE HANDBOOK In today’s world, we have a surplus of data, and the demand for learning data science has never been greater. The students need to be provided a solid foundation on data science and technology for them to be industry ready. The objective of this curriculum is to lay the foundation for Data Science, understanding how data is collected, analyzed and, how it can be used in solving problems and making decisions. It will also cover ethical issues with data including data governance and builds foundation for AI based applications of data science. Therefore, CBSE is introducing ‘Data Science’ as a skill module of 12 hours duration in class VIII and as a skill subject in classes IX-XII. CBSE acknowledges the initiative by Microsoft India in developing this data science handbook for class X students. This handbook introduces the concept of distributions, identifying patterns, data merging with practical examples. The course covers the theoretical concepts of data science followed by practical examples to develop critical thinking capabilities among students. The purpose of the book is to enable the future workforce to acquire data science skills early in their educational phase and build a solid foundation to be industry-ready Contents USE OF STATISTICS IN DATA SCIENCE 1 1. Introduction 1 2. What are subsets? 1 3. Two-way frequency table 3 4. Interpreting two-way tables 4 5. Two-way relative frequency table 5 6. Meaning of mean 5 7. Median 6 8. Mean Absolute Deviation 8 9. What is Standard Deviation? 8 10. Activity 10 Exercises 14 DISTRIBUTIONS IN DATA SCIENCE 16 1. Introduction 16 2. What is distribution in data science? 16 3. What are different types of distributions? 18 4. Statistical Problem Solving Process 18 5. Activity – Choosing groups for school dance program 20 Exercises 24 IDENTIFYING PATTERNS 27 1. What is partiality, preference and prejudice? 27 2. How to identify the partiality, preference and prejudice? 28 3. Probability for Statistics 29 4. The Central Limit Theorem 30 5. Why is the Central Limit Theorem important? 32 Exercises 33 DATA MERGING 36 I 1. Overview of Data Merging 36 2. What is Z-Score? 39 3. How to calculate a Z-score? 39 4. How to interpret the Z-score? 40 5. Why is a Z-score so important? 40 6. Concept of Percentiles 40 7. Quartiles 41 8. Deciles 42 Exercises 45 ETHICS IN DATA SCIENCE 48 1. Note about data governance framework 48 2. Ethical guidelines around data analysis 48 3. Discarding the Data 49 References 53 II CHAPTER USE OF STATISTICS IN DATA SCIENCE like what are subsets, what is mean, Studying this chapter should median and relative frequency? We will enable you to understand: also see how these are used in the What are subsets and context of data science. relative frequency? Meaning of mean 2. What are subsets? What is median and its Many a time, we encounter situations usage in data science? where we have a lot of data with us. What is mean absolute However, for analysis, we do not need deviation? entire data for consideration. What is Standard Deviation? Thus, instead of working with the whole data set, we can take a certain part of the data for our analysis. This division 1. Introduction of a small set of data from a large set of In the previous classes of data science, data is known as a Subset. we have seen how data plays a vital role in our daily lives. We have also seen the Subsetting the data is a useful indexing significance of analyzing and visualizing feature for accessing object elements. It the data. Now it is time to get into little can be used for selecting and filtering more details of data analysis techniques variables and observations. We subset and understand some of the statistical the data from a data frame to retrieve a terminologies that are frequently used in part of the data that we need for a data science. In this chapter, we will get specific purpose. This helps us to to know some of the statistical concepts 1 observe just the required set of data by filtering out unnecessary content. For example, if you have a Table of 100 rows and 100 columns and you want to perform certain actions on the first 5 rows and the first 5 columns, you can separate it from the main table. This small table of 5 rows and 5 columns is known as a “Subset” in Data Analytics. Let us now try to understand each of them in more details 1. Row-based subsetting In this method of subsetting, we take some rows from the top or bottom of the table. Consider you have a table of 6 How do we subset the data? rows and 4 columns. You take the top 3 Subsetting is a very significant rows from the table. component of data management and there are several ways that one can 2. Column based subsetting subset data. Let us now understand Sometimes the original data set may different ways of subsetting the data. contain a large number of columns and all of them may not be necessary to perform the analysis. We then select specific columns from the dataset. This process of subsetting is known as column-based subsetting. 2 3. Data based subsetting If you now break down the data into age categories of (5 – 10 years), (10 – 15 years), and (15 – 20 years), and plot the number of people who liked and disliked chocolates then the table would look like To subset the data based on specific below. data we use data-based subsetting. In the above figure, we have selected only those rows which are colored. 3.Two-way frequency table Consider you are conducting a poll asking people if they like chocolates. You record the data in the below format. This type of table is called a two-way frequency table. A two-way table is a statistical table that demonstrates the observed number or frequency for two variables, the rows indicate one category and the columns 3 indicate the other category. Two-way The information collected is used to frequency tables show how many data build the following two-way table: points fit in each category. The row category in this example is “5-10 years”, “10-15 years” and “15-20 years”. The column category is their choice “Like chocolates” or “Do not like chocolates”. Each cell tells us the number (or frequency) of the people. There is a lot of information that we can get from this small table. For example, 4.Interpreting two-way How many people were questioned? Answer: 10 tables The entries in the table are counts. How many people like chocolates? The table has several features: Answer: 6 Categories are in the left column In which age group do people like and top row chocolate the most? The counts are placed in the center of the table. Answer: 10 – 15 The totals are at the end of each Let us now have a look at another row and column. example: A sum of all counts (a total) is placed at the bottom right Example: A survey of eighty people (40 men and 40 women) was taken on what genre of Example: movie they would choose to watch, and the following responses were recorded: 8 men preferred comedy movies. 12 men preferred action movies. 14 men preferred horror movies. 16 women preferred comedy movies. 12 women preferred action movies. 6 women preferred horror movies. In the above example, the rows of the table tell us whether it's a Male or a Female and the columns of the table tell 4 us if they own a car or not. To convert this into a relative two-way Each cell tells us the number (or frequency table we will convert frequency) of people. individual cells into percentages. For example, 40 females own a car. Activity 1.1 Record how many of your friends like cricket and how many like football. Create a two-way relative frequency table with the data. 5.Two-way relative Two-way relative frequency tables are frequency table helpful when there are different sample Two-way relative frequency table very sizes in a dataset. Percentages make it similar to the two-way frequency type of easier to compare the preferences. table. The only difference here is we consider percentage instead of numbers. Two-way relative frequency tables represent what is the percentage of data points that fit in each category. We can take the help of row relative frequencies or column relative frequencies; it depends on the context of the problem. Let us consider the below two-way table recording preferences of boys and girls 6.Meaning of mean Mean is a measure of central tendency. In data science, Mean, also termed as the simple average, is an average value of a data set. Mean is a value in the data set around which the entire data is spread out. While mean is calculated, all with regards to indoor and outdoor values used in calculating the average sports. are weighted equally. 5 The mean of a data set is calculated by adding up all the values in the data set and later dividing them by the number of values present in the data frame. Let us understand how to find a mean with the help of the below example. Consider that we have a set of 11 numbers 10 to 20 in a data set. Array = {10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20} So mean is calculated by adding up 10 numbers in the data set. Sum of all the numbers = 165 Mean = 165/10 = 16.5 Activity 1.2 Let us now try to understand the real- Height of Ravi: 156cm life application of mean. We will create a Height of Juhi: 148cm data set for five cities in India for their Height of Shweta: 151cm minimum day temperature on a Height of Kishan: 158cm particular day. Let us record the What is the mean? temperatures in a table. Following is the sample of data that is collected: Mean as represented in the above graph To calculate the mean we will add the shows it stands for central tendency, temperatures of all cities and divide it by meaning, it points to the center of the 5. data set. Mean = (21 + 13 + 24 + 15 + 20)/5 = 18.5° C 7.Median The median like the mean is another form of central tendency. It is the middle point of a sorted data set. To calculate the median, we must order our data set in ascending or descending order. If the data set is sorted from smallest value to biggest value, the exact middle value of the set is the Median. Consider the below data set of 5 values. 6 Array = [12, 34, 56, 89, 32] Mean vs median Now let us sort the data set. So mean and median both represent the Sorted array = [12, 32, 34, 56, 89] central tendency of a data set. So when do we use median over mean? The value at 3 rd position is the middle point of the sorted list. So, 34 is our Median is a more accurate form of median for the array. central tendency especially in scenarios where there are some irregular values also known as outliers. For example, consider the below scenario. Your father gets his blood pressure checked every week. But due to some error in the device, the recording for one week was too high. In the previous example, we had a data set with an odd number of records. So, we could easily find out the middle point. But what if the data set has an even number of records? For these situations, there will be two middle points. Thus, we need to calculate the average of the two to get the median. The below example illustrates how to So, for the above scenario we see the calculate median from an even number mean value deviates greatly from the of records. regular blood pressure values due to a device error. Whereas the median value still accurately represents the central point of the data set. So under circumstances where there are outliers in the data set, median is a more effective measure of central tendency. 7 Step 3: 8.Mean Absolute Calculate the mean of the distances. Deviation Mean Absolute Deviation (MAD) is the Mean of distances = (2 + 2 + 4 + 4 + 3 + average of how far away all values in a 5) / 6 = 3.33 data set are from the mean. So 3.33 is our mean absolute deviation, Let understand this with an example. and the mean is 14. Consider the below data set. The value of Mean absolute deviation 12, 16, 10, 18, 11, 19 gives a very good understanding of the variability of the data set or in other Step 1: Calculate the mean words how scattered the data set is? Mean = (12 + 16 + 10 + 18 + 11 + 19) / 6 = 14 (rounded off) Activity 1.3 Calculate the mean absolute Step 2: Calculate the distance of each data point from the mean. We need to deviation for the below data set. find the absolute value. For example if 26, 35, 22, 28, 40, 38, 19 the distance is -2, then we ignore the negative sign. Below is the table after calculating the distance of each data point from the 9. What is Standard mean. Deviation? The Standard Deviation is the measure of how spread out the numbers are. To be specific, standard deviation represents how much the data is spread out around the mean or an average. For example, are all the points close to the average? Or are there lots of points way above or below the average? In order to find standard deviation: 1. Calculate the mean by adding up all the data pieces and dividing it by the number of pieces of the data. 2. Subtract mean from every value |-2| = 2 3. Square each of the differences 8 4. Find the average of squared 30.8/5 = 6.16 (Variance) numbers calculated in point number 3 to find the variance. 5. Lastly, find the square root of Step 5: Find the square root of the variance. That is the standard variance deviation. The square root of 6.16 = 2.48 For example, Take the values 1,2,3,5 and 8 Step 1: Calculate the mean Thus, Standard deviation of values 1,2,3,5 and 8 is 2.48 1+2+3+5+8 = 19 19/5 = 3.8 (mean) Activity 1.4 Read how standard deviation is used in calculating average rainfall in your city. Step 2: Subtract mean from every value 1- 3.8= -2.8 2- 3.8= -1.8 Graphically, the standard deviation 2.48 can be represented like below: 3- 3.8= -0.8 5- 3.8= 1.2 8- 3.8= 4.2 Step 3: Square each difference -2.8*-2.8 = 7.84 -1.8*-1.8 = 3.24 -0.8*-0.8 = 0.64 1.2*1.2 = 1.44 4.2*4.2 = 17.64 Few real-life implementations of standard deviation include: Step 4: Calculate the average of the 1. Grading Tests – If teacher wants to squared numbers to get the variance know whether students are performing at the same level or 7.84+3.24+0.64+1.44+17.64 = 30.8 9 whether there is a higher standard Let us now move ahead and start getting deviation. answers to these questions step by step. 2. To calculate the results of any Collect/consider data Survey – If someone wants to have some measure of reliability of the Suppose the teacher decides to work responses received in the survey, to with five students at a time in the predict how bigger group of people classroom and asks each student, “How may answer the same questions. many people, including yourself, are in 3. Weather Forecasting – If a weather the household that you live in?”. As an forecaster is analyzing the low answer to this, each student represents temperature forecasted for three their family size with a collection of snap different cities. A low standard cubes. deviation will always show reliable weather forecast. The data for “family size” is represented with snap cubes in Fig 1.17 10. Activity What is the average family size of households of each student in your school? Consider that your school head wants to calculate the average family size of students in your school. To carry out this activity at a large scale, let us first break it down into smaller sub parts. Thus, to start with, we will ask teachers to take our family size of students of Analyze the Data each classroom. We need to get an answer to the following questions over To examine the distribution of the here: household size of the collected data, students first need to arrange the stacks 1. What is the intended population? of snap cubes in increasing order as (Households of students in each shown in Fig 1.18. teachers’ class) 2. The variable to be measured. (The You must have realized that family sizes number of people in a household) vary. The next question that we need to 3. Anticipating variability. (Asking ask is, How many people would be in about typical household sizes) each family if all five families were the same size? 10 When we make all the family sizes the same, family size does not vary. You may We will continue this process until all use two equivalent approaches: the stacks are level, or nearly level when 1. Disconnect all the snap cubes there is a remainder as shown in Fig and redistribute them one at a 1.20. time to the five students until all snap cubes have been allocated. In this case, there are 15 snap cubes. Redistributing them among the five students yields 5 stacks of 3 cubes each. 2. Remove one snap cube from the highest stack and place it on one of the lowest stacks, continuing until all the stacks are leveled out. Both these approaches yield an equal After the final move all five stacks are family size of three, which we can levelled with three cubes each. This consider as an equal share or a fair represents that the family size of three is share. an equal share. That means, if all five For the second approach, you can start family sizes were the same, the number with removing a snap cube from the of people in the household would have highest stack and placing it on one of the been three. This equal share is nothing lowest stacks. This will result in a new but the mean of the distribution. arrangement of cubes as shown in Fig By now, we know how to calculate the 1.19 mean by adding up all the observations 11 and dividing it by the number of the equal share family size for each observations. However, what does mean group is 6. tell us about the distribution? How are we expected to interpret the mean? How are we expected to describe the variability in a distribution in relation to its mean? We can investigate the following problem to get an answer to these questions: Suppose two other groups of five students in the classroom found their equal share value to be six. What are some different snap cube representations that they could have constructed? To answer this, we should first realize that we need to start with 30 snap cubes. We can then create two different Because the equal share value for each distributions of family size where the group is 6, the two groups cannot be equal share value is 6. For example, distinguished based on the equal share consider the following two groups, value. An analysis question in this case Group 1 (shown in Fig 1.21) and Group may be: 2 (shown in Fig 1.22) of data on five family sizes from the classroom where Which group is closer to being equal? We can offer different answers to this question, including: 1. Group 2, as this group has the highest frequency of stacks of six snap cubes. 2. Group 1, as for this group, we need fewer snap cubes to level out all the stacks to the equal share value of six. The second method of having fewer snap cubes to move can be thought of as counting the “number of steps to equal”, or, how many steps we need to move the snap cubes to create the equal-sized 12 groups. Fewer steps indicate that the What is the average family size of distribution is closer to being equal and students in your school? has less variability from the mean. We Using the results from the last two can go through the process to check that groups, we can comment that if the for Group 1, we need to move two cubes families are of equal size, the number of a total of two steps. For Group 2, we need to move two cubes a total of two people in a household will be six. This will be equal share or mean value. steps each. Thus, Group 1 and Group 2 has equal variability from the mean. Thus, with the help of this activity we have learnt how mean helps us to get a quick resolution to our day to day activities. Interpret the results Now, it is time to interpret the results to answer the original question, Recap Sub setting is used to get a smaller chunk of data from a big data set A two-way table is a statistical table that shows the observed number or frequency for two variables Mean is a measure of central tendency that indicates the average value of a data set Median is the middle point of a sorted data set. Mean Absolute Deviation is the average of how far away all values in a data set are from the mean 13 Exercises Objective Type Questions Please choose the correct option in the questions below. 1. We want to get the cars of red color from the below data set. Which type of subsetting should be used? a) Column based subsetting b) Data based subsetting c) Row based subsetting d) None of the above 2. Which is a more accurate measure of central tendency when there are outliers in the data set? a) Mean b) Median 3. Mean absolute deviation is an identifier of the variability of the data set. Is this a correct statement? a) Yes b) No 4. The mean absolute deviation is divided by coefficient of mean absolute deviation to calculate a) Variance b) Median c) Arithmetic Mean d) Coefficient of Variation 5. In a manufacturing company, the number of employees in unit A is 40, the mean is Rs. 6400 and the number of employees in unit B is 30 with the mean of Rs. 5500 then the combined arithmetic mean is 14 a) 9500 b) 8000 c) 7014.29 d) 6014.29 6. The mean deviation about the mean for the following data: 5, 6, 7, 8, 6, 9, 13, 12, 15 is: a) 1.5 b) 3.2 c) 2.89 d) 5 7. The arithmetic mean of the numerical values of the deviations of items from some average value is called the a) Standard Deviation b) Range c) Quartile Deviation d) Mean Deviation Standard Questions 1. Explain the different ways of subsetting data. 2. When should we use median over mean? 3. What is Mean Absolute Deviation? 4. What is a two way relative frequency table? How is it different from two way frequency table? 5. What are two way frequency table beneficial for? 6. What is Standard Deviation? 7. How to calculate Standard Deviation? 8. Name five real-life applications of Standard Deviation 9. Explain five real-life situations where subsetting data can be advantageous Higher Order Thinking Skills (HOTS) 1. Draw a graph to represent Standard deviation of 4.6 2. Calculate the mean of this data set - [56, 89, 76, 58, 58, 65] 3. Calculate the median of this data set - [56, 89, 76, 58, 58, 65] Applied Project Calculate the average student height and weight for students in your classroom. 15 CHAPTER DISTRIBUTIONS IN DATA SCIENCE 2. What is distribution Studying this chapter should enable you to understand: in data science? Distribution in data science is a method What is distribution in data which shows the probable values for a science? variable and how often they occur. Different types of continuous While the concept of probability gives us distribution the mathematical calculations, Different types of discrete distributions help us actually visualize distribution what is happening underneath. For example, consider a coin which has 1. Introduction two sides, head and tail. Now when you Now that we have understood the throw the coin up in the air, what is the statistical terminology that are probability of getting a head? It is ½ or frequently used in data science in half right? And what is the probability of previous chapter, it is now time to learn getting a tale? It is again ½ or half. about distribution of data in statistics. However, if you say, what is the In this chapter, we will learn about probability of getting a 3 rd side on coin? different types of data distributions and Isn’t it NIL? It is impossible to get a third characteristics of each distribution in side on the coin which has only two detail. sides, head and tale. Thus probability is zero. 16 The distribution of an event consists not underlying probabilities and not by the just the input values that can be seen, graph. Graph is just a visual but made up of all possible values. representation. We had studied topic of data visualization in the previous grade. So, the distribution of the event, tossing the coin will be given by the following Now, let us extend our problem table. The probability of getting the head statement to tossing two coins. What are is 0.5. The probability of getting the tail the possibilities over here? Head-Head, is 0.5 and so on. You can be sure that Head-Tail, Tail-Tail and Tail-Head. you have exhausted all the values when Below is the table of all possible the sum of probabilities is equal to 1% to combinations. 100%. For all other values apart from this, the probability of occurrence is zero. Let us now understand the probability distribution for this scenario. Look at the below graph. By looking at the graph we can understand that probability of getting a head in both the coins is 0.25. Similarly, getting a head in one coin and tail in Every probability distribution is another coin is 0.25. Probability of associated with a graph which describes getting tail in one coin and head in the likelihood of occurrence of each another coin is 0.25. And probability of event. Below graph represents our getting a tail in both the coins is 0.25. example. This type of distribution is called as a Uniform Distribution. However, point to note over here is that distribution in statistics is defined by 17 Activity 1.2 Read about income distribution between different classes of people in India and try to classify it in a type of distribution 4. Statistical Problem Solving Process The purpose of Statistical Problem- Solving Process which is detailed in Fig 2.5 is to collect and analyze data to answer the statistical investigative Thus, the graph of probability questions. distribution in this case should look as shown in Fig 2.4 3. What are different types of distributions? Types of distributions in data science is solely based on what kind of data we can encounter with while dealing with problems. The data can be discrete or continuous. Discrete Data is the data that takes only specified values. For example, if you give a test, you can either pass or fail. This investigative process involves four So, data is discrete in this case as it has components, each of which involves only two specified outcomes. exploring and addressing variability: Continuous Data is the data that can 1. Formulate Statistical take any value within a given range. This Investigative Questions range can be either finite or infinite. For 2. Collect/Consider the Data example, depth of an ocean, weight of a 3. Analyze the Data person or length of a road. 4. Interpret the Data Let us understand each step-in detail now. 18 Formulate Statistical Investigative illustrated throughout the examples at Questions the different levels. This can also be called as anticipating In addition to anticipating variability, variability while beginning with the there are other features of a statistical process. investigative question that are important. The variables of interest Formulating statistical investigative must be clear, the group or population questions that anticipate variability that the question is focused on must be leads to productive investigations. For clear, the intent of the question should example, the following are all statistical be clear – is the question requiring a investigative questions that anticipate description of the data, is the question variability and can lead to a rich data comparing a variable across two or more collection process and subsequent groups, is the question looking at an analysis of the data: association between two variables, the How fast can my plant grow? question should be about the whole Do plants exposed to more group (anticipating variability) and not sunlight grow faster? about an individual (giving a How does sunlight affect the deterministic answer), the question growth of a plant? should be answered through data collection (primary data) or with the data In contrast, the question How tall is the in hand (secondary data), and the plant? is answered with a single height, question should be purposeful. it is therefore not a statistical investigative question. Collect/Consider the Data How tall is the plant is a question we ask This step can be called as acknowledging to collect data? Many other data variability while designing for collection questions could be asked to differences. help collect the necessary data to answer the statistical investigative question: Do Data collection designs must acknowledge variability in data. Few plants exposed to more sunlight grow methods are used to reduce and detect faster? The fact that there will be variability in data, such as Statistical differing heights for the different Process Control and random sampling. exposures of sunlight implies that we While others are used to induce anticipate an answer based on variability to test treatments, such as measurements of plant heights that Design of Experiments. In the latter vary. While statistical investigative approach, experimental designs are questions begin worthwhile studies, the use of questioning is prominent chosen to acknowledge the differences between groups subjected to different throughout all four components of the treatments. Random assignment to the statistical problem-solving process. Such uses of questioning will be groups is intended to reduce differences 19 between the groups due to factors that describing the overlap and the are not manipulated or controlled in the separation of the distribution of the two experiment. In all designs, the main teams. statistical focus is to look for, account Interpret the Results for and explain variability. This step can also be called as allowing After the data is available, whether it is for variability while looking beyond the collected first-hand or acquired from data. another source, it needs to be interrogated. For example, questions Statistical interpretations are made in about how the variables differ by type, the presence of variability and must take the possible outcomes of each of the variability into account. When variables, and how the data was interpreting the results of a randomized collected is necessary to clarify whether comparative medical experiment, we must remember there are two important the data is useful for answering the sources of variability: randomization to statistical investigative question. The treatment group, and variability from data collection design impacts the scope individual to individual. When we of generalizability and the possible generalize the results and look beyond limitations in analysis and the study data collected; we must interpretation. consider these sources of variability. Analyze the Data 5. Activity – Choosing This step can also be called as accounting of variability while the groups for school distributions. dance program When we analyze the data, we try to Consider that there is an annual event understand its variability. Reasoning in your school for which you all are about distributions is key to accounting planning to shortlist a musical group for for and describing variability at all a single grade. You can do this by development levels. Graphical displays conducting a class census. Let us ask and numerical summaries are used to following statistical question to go start explore, describe, and compare with the activity: variability in distributions. What type of music do the students in our For example, the batting averages of grade like? Indian Cricket Team and the batting To start with, we can start collecting averages of Australia Cricket Team for a data for each class. We will collect data particular year can be displayed in two for entire population of the class i.e. comparative dot plots and boxplots. each student will answer this question. These graphs show the variability of Later, we will extend the collection to the each teams’ distributions of batting entire grade. averages. We can consider variability by 20 On similar grounds, we may also plan to which can foreshadow answering the collect data for the entire school. This is other statistical investigative questions. because one single class may not For example, we can pose a series of represent the preferences of all students survey questions that allow us to explore in one grade or all students at the in more depth the types of music school. students like. After collecting all the Now that we have all the data with us, data, we can look at whether an each class can compare preferences of association appears to be likely between their class with the preferences of other different types of music students like. classes of the school and explore the This information might tell us the choice following statistical question: of music for the school dance. What type of music do the students at our Question 1: Tick yes for any of the school like? following music types you like. Check no for any you don’t like Collect/consider data The statistical question, What type of music do the students at our school like? asks about the preferences of students at the school overall. In this case, a data collection plan may use a single class, for example, grade 10 English class, as a sample to make decisions for the whole school. For this situation, we can Question 2: What is your most discuss the limitations of the chosen favourite type of music? sample. Alternatively, what we can do is, we can randomly select few students from each class or select couple of classes and get all the students in those classes to complete a survey. To excel further, we can improve on survey questions used before by understanding potential pitfalls to avoid in survey design (like ambiguous Question 3: What is your second wording and misleading questions) or maybe by providing more choices in the favourite type of music? answers. Additionally, we can collect the data on multiple aspects of the topic 21 Question 4: Would you prefer a live band or a DJ at the annual dance event at school? Analyze the data The bar graph shows the frequencies of Many of the graphical, tabular, and the students who like and dislike each type numerical summaries that we have of music. From this graph, it is evident made while data collection can be that Rock has the highest number of enhanced and used for more students in the class responding to yes, sophisticated analyses at later stages. that they like it. The second highest is Displays at later stages mostly represent Rap followed by Classical. Thus, the multiple variables and/or use multiple graph suggests that Rock, Rap and displays to answer the statistical Classical are the most liked music investigative questions. To analyze the among students. survey data collected using a class as a sample for the school, we can graph the Responses to survey question 2 can be number of students who like each type analyzed to see what students favourite of music. The bar graph in Fig 2.6 uses type of music is. Fig 2.7 shows there are the student answers to survey question an equal number of students who voted 1 noted above, where each music type is Classical and Rock as their favorite a variable. music. 22 way graph shows 36 bins (6 possible second favorite music types by 6 possible favorite music types). The bin in the top left corner has two dots in it, representing the two students in the class who answered that their favorite type of music is Classical, and their second favorite type of music is Rock. Analysis of the favorite and second favorite types of music shows that nearly all the students (17 students, those shaded lighter in Fig 2.8) in the class have voted Bollywood, Pop and Rock as their first or second choice. Only two students (those shaded in a darker color) Students can explore favorite and did not rank Bollywood, Pop and Rock in second favorite music which are top two. answers to questions 2 and 3 on the Students can also look at the choice survey through two-way graph as shown between live band and DJ and add this in Fig 2.8. to their final conclusions about the types of music that students like (Fig 2.9). The difference in the number of students who prefer a live band versus a DJ is zero. Here, they can say that all but two students have Pop and Rock as their first and/or second choices for type of music preferences. In this graph, each dot represents a student in the class who responded to the survey. This two- 23 Interpret Results Rock as their favorite and/or second favorite type of music. In addition, The analysis shows that Bollywood, Pop students might note that a live band and and Rock are very similar in terms of the a DJ are equally chosen among class number of students who choose them as members. their favorite type of music. Most of the students picked up Bollywood, Pop, and Recap Distribution in data science is a method which shows the probable values for a variable and how often they occur. Discrete Data is the data that takes only specified values. Continuous Data is the data that can take any value within a given range. This range can be either finite or infinite. Exercises Objective Type Questions Please choose the correct option in the questions below. 1. If a card is chosen from a standard deck of cards, what is the probability of getting a five or a seven? a) 4/52 b) 1/26 c) 8/52 d) 1/169 2. Which of the following is the condition for Uniform Distribution? a) Each value in the set of possible values has the exact same possibility of happening. b) Have a constant probability of success c) Has only two possible outcomes d) Must have at least 3 trials 3. The collection of one or more outcomes from an experiment is called 24 a) Probability b) Distribution c) Event d) Random Experiment 4. Which of the following are types of distributions? a) Continuous b) Discrete c) Both of them 5. Which of the following is not an example of discrete probability distribution? a) The sale or purchase price of a house b) The number of bedrooms in a house c) The number of bathrooms in a house d) Whether or not a home has a swimming pool in it 6. A discrete probability distribution may be represented by a) A table b) A graph c) A Mathematical Equation d) All of these 7. What is the probability that a ball is drawn at random from a jar? a) 0.1 b) 1 c) 0.5 d) 0 e) Cannot be determined from given information 8. Statistical investigative process has which of the following components: a) Formulate Statistical Investigative Questions b) Collect/Consider the Data c) Analyze the Data d) Interpret the Data e) All of the above 25 Standard Questions 1. Explain what distribution in data science with the help of two examples is. 2. Explain what is a Statistical Problem-Solving process. 3. Explain how distributions are broadly categorized. Support your answer with appropriate examples for each category. 4. Explain in detail how do we formulate statistical investigative questions. 5. Name five instances where you have observed a uniform distribution. Higher Order Thinking Skills (HOTS) 1. Consider that there are 60 students in your class out if which 20 get affected with cold and flu every semester. Note down five statistical investigative questions for determining a students’ immunity to a catching cold and flu. 2. Consider you are taking a part in an animal welfare campaign. One of the most recent concerns raised by people is dogs not being able to tolerate sudden rise in temperatures due to global warming. Note down five statistical investigative questions to understand how dogs react to changing weather. Applied Project Consider that you have a food event in your residential society. Perform detailed analysis and interpret what should be the top five cuisines that most people in the society prefer for this event. 26 CHAPTER IDENTIFYING PATTERNS This partiality, preference and prejudice Studying this chapter should towards a set of data is called as a Bias. enable you to understand: In Data Science, bias is a deviation from How to identify partiality, the expected outcome in the data. preference and prejudice Fundamentally, you can also call bias as What is Central Limit error in the data. However, it is observed Theorem? that this error is indistinct and goes unnoticed. So, question to be asked is, 1. What is partiality, why does the bias occur in first place? Bias basically occurs because of preference and sampling and estimation. If we would prejudice? know everything about all the entities in We often come across situations where if our data and would store information on we have a special fondness towards a all probable entities, our data would particular thing, we tend to be slightly never have any bias. However, data partial towards it. This, in majority cases science is often not conducted in may affect the outcome or you can say it carefully controlled conditions. It is can deviate the outcome in favor of mostly done of the “found data”, i.e. the certain thing. Naturally, it is not the data that is collected for a purpose other right way of dealing with the data on than modelling. That is the reason why larger scale. this data is very likely to have biases. Next question that may arise in your mind is, why does the bias really matter? Well, the answer is that predictive 27 models often consider only the data that user response for the content that is not is used for training. In fact, they know displayed is unknown. no other reality other than the data that Linearity Bias is fed in their system. Naturally, if the data that is fed into the system is biased, Linearity bias assumes that change in model accuracy and fidelity are one quantity produces an equal and compromised. Biased models can also proportional change in another. Unlike tend to discriminate against certain selection bias, linearity bias is a groups of people. Therefore, it is very cognitive bias. This is produced not important to eliminate the bias to avoid through some statistical process but these risks. rather through how mistakenly we perceive the world around us. 2. How to identify the Confirmation Bias partiality, Confirmation Bias or Observer Bias is an preference and outcome of seeing what you want to see in the data. This can occur when prejudice? researchers go into a project with some We can categorize common statistical subjective thoughts about their study, and cognitive bias in following ways: which is either conscious or unconscious. We can also encounter 1. Selection Bias 2. Linearity Bias this when labelers allow their subjective thoughts to control their labeling habits, 3. Confirmation Bias 4. Recall Bias which results in inaccurate data. 5. Survivor Bias Recall Bias Selection Bias Recall Bias is a type of measurement bias. It is common at the data labeling This type of bias usually occurs when a stage of any project. This type of bias model itself influences the creation of occurs when you label similar type of data that is used to train it. Selection data inconsistently. Thus, resulting in bias is said to occur when the sample lower accuracy. For example, let us say data that is gathered is not we have a team labeling images of representative of the true future damaged laptops. The damaged laptops population of cases that the model will are tagged across labels as damaged, see. This bias occurs mostly in systems partially damaged, and undamaged. that rank the content like Now, if someone in the team labels an recommendation systems, polls or image as damaged and some similar personalized advertisements. This is because, user responses for the content image as partially damaged, your data will obviously be inconsistent. that is displayed is collected and the 28 Survivor Bias Question: If a coin is tossed 10 times, how many times will we get “tail” on the The survivorship bias is based on the top face. concept that we usually tend to twist the data sets by focusing on successful Problem 2: You pick up a coin examples and ignoring the failures. This Question: Is this a fair coin? That is, does type of bias also occurs when we are each face have an equal chance of looking at the competitors. For example, appearing? while starting a business we usually take the examples of businesses in a Problem 1 is a mathematical probability similar sector that have performed well problem. Problem 2 is a statistics and often ignore the businesses which problem that can use the mathematical have incurred heavy losses, gone probability model determined in bankrupt, merged etc. Problem 1 as a tool to seek a solution. While this is arguable point that we don’t The answer to neither question is want to copy the failure, we can still deterministic. Tossing coin produces learn a lot by understanding a range of random outcomes, which suggests that customer experiences. The only way to the answer is probabilistic. The solution avoid survivor bias in our systems is by to Problem 1 starts with the assumption finding as many inputs as possible and that the coin is fair. It later proceeds to study the failures as well as average logically deduce the numerical performers. probabilities for each possible count of “tails” after a toss resulting from 10 3. Probability for tosses. The possible counts are 0,1….,10. Statistics Probability is all about counting The solution to Problem 2 starts with an randomness. It is the basics of how we unfamiliar toss; we do not know if it is make predictions in statistics. We can fair or biased. The search for an answer use probability to predict how likely or is experimental: toss the coin, see what unlikely particular events may be. We happens, and examine the resulting can also, if needed, consider informal data to see whether they look as if they predictions beyond the scope of the data came from a fair toss or a biased toss. which we have analyzed. One possible approach to making this judgement would be the following: Toss Probability is a very essential tool in the coin 10 times and record the number statistics. There are two problems and of count when you got a “tail”. Repeat nature of their solution that will this process of tossing the coin 100 illustrate the difference. times. Compile the number of times you Problem 1: Assume a coin is “fair” got a “tail” in each of these 100 trials. Compare these results to the frequencies produced by the 29 mathematical model for a fair toss in ✓ Sample sizes that are equal to or Problem 1. If the frequencies from the greater than 30 are considered experiment are quite dissimilar from enough for the Central Limit those predicted by the mathematical Theorem to hold. model for a fair toss and the observed ✓ Key aspect of the Central Limit frequencies are not likely to be due to Theorem is that the average of chance variability, then we can conclude sample mean, and the standard that the toss is not fair. deviation will always equal the population mean and the standard In Problem 1, we form our answer from deviation. logical deductions. In Problem 2, we ✓ A significantly large sample size can form our answer by observing predict the characteristics of a experimental results. population very accurately. 4. The Central Limit Let us now understand the Central Limit Theorem with the help of an example. Theorem The Central Limit Theorem states that Consider that there are 50 houses in distribution of sample approaches a your area. And each house has 5 people. normal distribution as the sample size Our task is to calculate average weight gets larger irrespective of what is the of people in your area. shape of the population distribution. The usual approach that majority follow The Central Limit Theorem is a is: statistical theory stating that given a 1. Measure the weights of all people in significantly large sample size from a your area population with finite variance, the 2. Add all the weights mean of all samples from same set of 3. Divide the total sum of weights with population will be roughly equal to the the total number of students to mean of the population. This holds true calculate the average regardless of whether the source population is normal or skewed provided However, the question over here is, that the sample size is significantly what if the size of data is enormous? large. Does this way of calculating the average make sense? Of course, the Few points to note about the Central answer is no. Measuring weight of all Limit Theorem are: the people will be a very tiring and ✓ The Central Limit Theorem states lengthy process. that the distribution of sample As a workaround, we have an means nears a normal distribution alternative approach that we can take. as the sample size gets bigger. 30 1. To start with, draw groups of people Example: at random from your area. We will In India, the recorded weights of the call this a sample. We will draw male population are following a normal multiple samples in this case, each distribution. The mean and the standard consisting of 30 people deviations are 68 kgs and 10 kgs, 2. Calculate the individual mean of each sample set respectively. If a person is eager to find the record of 50 males in the population, 3. Calculate the mean of these sample then what would mean and the standard means deviation of the chosen sample? 4. To add up to this, a histogram of sample mean weights of people will Over here, resemble a normal distribution. Mean of Population – 68 kgs This is what the Central Limit Theorem is all about. Now let us move ahead and Population Standard Deviation (σ) – 10 understand what the formula for the kgs central limit theorem is. Sample size (n) – 50 Solution: And, Mean of Sample is the same as the mean of population. The mean of the population is 68 since the sample size > 30. Sample Standard Deviation is calculated Where, using below formula: μ = Population mean σx = σ/√n σ = Population standard deviation μx¯¯¯ = Sample mean Thus, Sample Standard Deviation = 10/√50 σx¯¯¯ = Sample standard deviation n = Sample size Sample Standard Deviation is 1.41. Now, that we have understood what the central limit theorem is, let us now see what its real-life applications are and what is the formula to calculate it, let us learn why the central limit theorem is so important. Let us have a look at below example of the Central Limit Theorem Formula: 31 5. Why is the Central Activity 3.1 Limit Theorem Read about how population is measured for your city and how the Central Limit important? Theorem can help in counting the large The Central Limit Theorem states that group of population. no matter what the distribution of population is, the shape of the sampling distribution will always approach normality as the sample size increases. This is helpful, as any research never knows which mean in the sampling distribution is the same as population mean, however, by selecting many random samples from population, the sample means will cluster together, allowing the researcher to make a good estimate of the population mean. Having said that, as the sample size increases, the error will always decrease. Some practical implementations of the Central Limit Theorem include: 1. Voting polls estimate the count of people who support a particular election candidate. The results of news channels that come with confidence intervals are all calculated using the Central Limit Theorem. 2. The Central Limit Theorem can also be used to calculate the mean family income for a specific region. 32 Recap In Data Science, bias is a deviation from the expected outcome in the data. Selection bias is said to occur when the sample data that is gathered is not the representative of the true future population of cases that the model will see. Linearity bias assumes that change in one quantity produces an equal and proportional change in another. Confirmation Bias or Observer Bias is an outcome of seeing what you want to see in the data. Recall bias occurs when you label similar type of data inconsistently. The survivorship bias is based on the concept that we usually tend to twist the data sets by focusing on successful examples and ignoring the failures. The Central Limit Theorem states that distribution of sample approaches a normal distribution as the sample size gets larger irrespective of what is the shape of the population distribution. Exercises Objective Type Questions Please choose the correct option in the questions below. 1. What is the Data Science term used to describe partiality, preference, and prejudice? a) Bias b) Favoritism c) Influence d) Unfairness 2. Which of the following is NOT a type of bias? a) Selection Bias b) Linearity Bias c) Recall Bias d) Trial Bias 33 3. Which of the following is not a correct statement about a probability a) It must have a value between 0 and 1 b) It can be reported as a decimal or a fraction c) A value near 0 means that the event is not likely to occur/happen d) It is the collection of several experiments 4. The central limit theorem states that sampling distribution of the sample mean is approximately normal if a) All possible samples are selected b) The sample size is large c) The standard error of the sampling distribution is small 5. The central limit theorem says that the mean of the sampling distribution of the sample mean is a) Equal to the population mean divided by the square root of the sample size b) Close to the population mean if the sample size is large c) Exactly equal to the population mean 6. Sample of size 25 are selected from a population with mean 40 and standard deviation 7.5. The mean of the sampling distribution sample mean is a) 7.5 b) 8 c) 40 Standard Questions 1. Explain what is Bias and why it occurs in data science? 2. Explain Selection Bias with the help of an example 3. Explain Recall Bias with the help of an example 4. Explain Linearity Bias with the help of an example 5. Explain Confirmation Bias with the help of an example 6. What is the central limit theorem? 7. What is the formula for central limit theorem? 8. What is real life application of central limit theorem? 9. Why central limit theorem is important? 10. The coaches of various sports around the world use probability to better their game and create gaming strategies. Can you explain how probability is applied in this case and how does it help players? 34 Higher Order Thinking Skills 1. As per reports, in October 2019, researchers found that an algorithm used on more than 200 million people in US hospitals to predict which patients who would likely need extra medical care heavily favored white patients over black patients. Can you reason about what must have caused this bias and categorize it into the types of bias that you learnt in this chapter? 2. The recorded percentage of the population who speaks English in India are following a normal distribution. The mean and the standard deviations are 62 and 5, respectively. If a person is eager to find the record of 50 people in the population, then what would mean and the standard deviation of the chosen sample? Applied Project Consider that your friend is planning to open a clothing store in your area. With the help of central limit theorem, determine what should be the type and collection of clothes that will sell better in your area. 35 CHAPTER DATA MERGING However, while merging the data from different sources there are many issues Studying this chapter should that occur that require corrections for enable you to understand: successful data merging. Different data 1. How to merge data sets? sources will always have different 2. What is Standard Deviation naming conventions than the main data and what are different ways source. They may have different ways of to calculate it? grouping the data and so on. Many times, it happens that the additional data source happens to be created at a very different time by different people 1. Overview of Data with a different objective and use-cases. Owing to all these factors, it should not Merging sound strange if there is a lot of In Data Science, data merging is the difference between multiple data process of combining two or more data sources. sets into a single data frame. This process is necessary when we have raw In this topic, we will explore various data stored in multiple files or data ways of simplifying the process of data tables, that we want to analyze all in one merging. There are many places where go. these data merging techniques will help you. For example, if you have two different systems that operate in parallel with each other. Suppose that you have to perform some analysis of the 36 relationship where you are having a designed to contain unique values. In legacy system with a very poorly the Employee table, the Employee ID formatted data that you are willing to field is the primary key, in the Contact integrate with your new system. This is Info table, the Employee ID field is a where data merging comes into the foreign key. picture. Let us now dive deep into data merging techniques. The one to one relationship returns the related records when the value in the We can perform data merging by Employee ID field in the Contact Info implementing data joins on the table is the same as the Employee ID databases in frame. There are three field in the Employees table. categories of data joins: This is how one to one join works, by 1. One to One Joins merging the data tables using this 2. One to Many Joins primary key. 3. Many to Many Joins One to Many Joins One to One Joins In a one to many join, one record in a One to one join is probably one of the table can be related to one or many simplest join techniques. In this type of records in another table. For example, join, each row in one table is linked to a each student can have multiple books by single row in another table using a “key” school library. column. In the database, a one to many For example, in a company database, relationships looks like this: each employee has only one Employee ID, and each Employee ID is assigned to only one employee. In the database, a one to one relationship looks like this: In this example, the primary key field in the Students table, Student ID, is designed to contain the unique values. The foreign key field in the Library table, Student ID, is designed to allow multiple instances of the same value. In this example, the “key” field in each table is “Employee ID”. This “key” field is 37 The one to many relationships returns using a third table which is called as a the related records when the value in the join table. Every record in a join table Student ID field in the Library Table is contains a match field that contains the the same as the value in the Student ID value of the primary keys of two tables field in the Students table. that it joins. In a join table, usually these match fields are called as foreign keys. This is one to many join works, by These foreign keys are populated with merging databases using primary key the data as records in the join table are which demonstrates one to many created from either table that it joins. relationships. The below table demonstrates the Student table, which contains a record for every student. It also contains a Courses table, which contains a record for each course. A join table called Enrollments creates two one to many relationships, the one between each of the two tables. Many to Many Joins A many to many relationships is said to occur when multiple records in one table The primary key Student ID is a unique are related to multiple records of other identifier for every student in Students table. For example, a many to many table. The primary key Course ID is a relationships exists between students unique identifier for every course in the and courses. A student can register for Courses table. The Enrollments table multiple courses. A course can have carries the foreign keys Student ID and multiple students. Course ID. It is not easy to perform join on tables To set up a join table for many to many which have many to many relationships. relationships, As a workaround, to perform a join, you can break a many to many relationships into two one to many relationships by 38 1. Using the above example, you can Z = (x-μ)/σ create a table called Enrollments. Where, This will act as a join table. 2. In the Enrollments table, make a X = raw score Student ID field and a Course ID field. μ = Population mean 3. Make a relationship between the σ = Population Standard Deviation two Student ID fields in the tables. Later, make a relationship between Thus, the z-score is the raw score minus two Course ID field in the tables. the population mean, divided by the population standard deviation. We can use this design, if a student registers for four courses, we can ensure Whenever we come across situations that the student has only one record in where the population mean and the the Students table and four records in population standard deviation are the Enrollments table, one for each unknown, the standard score can be course student is enrolled in. calculated using the sample mean i.e. x̄ and the sample standard deviation as 2. What is Z-Score? estimates of population values. A Z-score describes the position of a Now we will consider an example that point in terms of its distance from the will illustrate the use of z-score formula. mean when it is measured in the Consider that we know about a standard deviation units. The z-score is population of group of kids having always positive if the value of z-score lies weights that are normally distributed. above the mean and it is negative if its Further to this, consider that we know value is below the mean. that the mean of the distribution is 10 Z-score is also known as standard score kgs and the standard deviation is 2 kgs. as it allows comparison of scores on Now consider the below questions: different types of variables by 1. What is the z-score for 12 kgs? standardizing the distribution. 2. What is the z-score for 5 kgs? A standard normal distribution is a 3. How many kgs corresponds to a normally shaped distribution with a z-score of 1.25? mean of value as 0 and a standard For the first question, we simply plug deviation of value as 1. x=12 in our z-score formula. The result is: (12-10)/2 = 1. 3. How to calculate a This means that 12 is one standard Z-score? deviations above the mean. The mathematical formula for calculating the z-score is as following: 39 The second question is also very similar. Simply put x=5 into the formula. Thus, 5. Why is a Z-score so the result for this is: important? (5-10)/2= -2.5 It is very helpful to standardize the values of a normal distribution by The interpretation of this is that 5 is 2.5 converting them into z-score because: standard deviations below the mean. 1. It gives us an opportunity to For the last question, we now know our calculate the probability of a z-score. For this problem we plug z = value occurring within a normal 1.25 into the formula and use basic distribution. algebra to solve for x: 2. Z-score allows us to compare two values that are from the different 1.25 = (x-10)/2 samples. Multiply both the sides by 2: 2.5 = (x-10) Add 10 to both the sides: 6. Concept of 12.5 = x Percentiles The maximum value of the distribution Hence, we see that 12.5 kgs corresponds can be considered in an alternative way. to a z-score of 1.25. We can represent it as a value in a set of data having 100% of the observations at 4. How to interpret the or below it. When we consider the maximum value this way, it is called the Z-score? 100th percentile. The value of a z-score always tells us how many standard deviations we are A percentile can be defined as the away from the mean. For example, if the percentage of the total ordered z-score is equal to 0, it is on the mean. observations at or below it. Therefore, pth percentile of a distribution is the value A positive z-score tells us that the raw such that p percentage of the ordered score is higher than the mean average. observation falls at or below it. For example, if the z-score is equal to +2, it is 2 standard deviations above the Consider the following data set: [10, 12, mean. 15, 17, 13, 22, 16, 23, 20, 24] A negative z-score tells us that the score Here, if we want to find the percentile for is below the mean average. For example, element 22, we follow the steps below: if a z-score is equal to -3, it is 3 standard deviations below the mean. 40 ▪ Sort the dataset in ascending Using the values of the quartiles, we can order. Once sorted, the dataset also find out the interquartile range. An will look like [10, 12, 13, 15, 16, interquartile range can be defined as 17, 20, 22, 23, 24] the measure of middle 50% of the values ▪ The number of values at or below when ordered from lowest to highest. the element 22 is 8. The total The interquartile range can be number of elements in the calculated by subtracting first dataset is 10. quartile(Q1) from the third quartile(Q3). ▪ Thus, going by the definition, 80 percent of the values are at or below the element 22. Thus, IQR = Q3 – Q1 percentile for the element 22 is 80 percentiles. Let us consider the following 10 data points: [10, 20, 30, 40, 50,60, 70, 80, 90, 100] 7. Quartiles Here, as there are ten values (an even Quartiles of dataset partitions the data number of values), the median is into four equal parts, with one-fourth of halfway between the fifth & sixth data the data values in each part. The total of values, which gives us 55 as the median, 100% is divided into four equal parts: or Q2. 25%, 50%, 75% & 100%. Since the median is defined as the middlemost value in the observation, the median will have 50% of the observations at or below it. Thus, the second quartile(Q2) or the 50th percentile demarcates the median. The first quartile or Q1 is the median of The most frequently used percentiles all the values to the left of Q2. Thus here, 30 is the middle number of numbers to other than the median are the 25 th percentile and the 75 th percentile. The the left of the actual median (Q2 ). 25th percentile defines the first quartile, The third quartile or Q3 is the median of the 75th percentile defines the third all the values to the right of Q2. Thus quartile, and the 100 th percentile here, 80 is the middle number of represents the fourth quartile. numbers to the right of the actual The first quartile is the median of all the median (Q2 ). values to the actual median's (Q2) left. Similarly, the third quartile is the median of all the values to the actual The interquartile range (IQR) can be median's (Q2) right. calculated as Q3 – Q1, which is 80 - 30 = 50. 41 i is the ith decile and can be represented as: 1st Decile, D1 = 1 * (n + 1)/ 10 th data 2nd Decile, D2 = 2 * (n + 1)/ 10 th data and so on An important application of quartiles is in temperature ranges for the day as Steps to calculate decile: reported on a weather report. In the a. Find out the number of data or presence of irregularities, the range values can be significantly influenced by variables in the sample or population. This is denoted by n. them. Hence, it is preferred to use the IQR instead, thereby ignoring the top 25 percentile and the bottom 25 percentile of the data points. In the presence of b. In the next step, sort all the data irregularities, IQR is more robust as well or variables in the sample or as a better representation of the amount population in ascending order. of spread in the data. c. In the next step, based on the 8. Deciles decile that is required, calculate the decile by using the formula: Just like quartiles, we have deciles. While quartiles sort the data into four quarters, deciles sort the data into ten equal parts: the 10 th, 20th, 30th, 40th, 𝑖 ∗ (𝑛 + 1 ) 𝐷𝑖 = 50th, 60th, 70th, 80th, 90th,100th. 10𝑡ℎ 𝐷𝑎𝑡𝑎 The higher the place in the decile ranking, the higher is the overall d. Lastly, based on the decile value, ranking. For example, a person receiving determine the corresponding 99 percentiles in a test would be placed variable from amongst the in a decile ranking of 10. However, a population data. person receiving 5 percentiles in the same test would be placed in a decile ranking of 1. Let us look at an example to understand The mathematical formula to calculate the concept in detail: decile is: 𝑖 ∗ (𝑛 + 1 ) Suppose we have been given 23 random 𝐷𝑖 = 10𝑡ℎ 𝐷𝑎𝑡𝑎 numbers between 20 and 80. We need to Where n is the number of data in the represent them as deciles. population sample. 42 Let’s say the raw numbers are: [24, 32, Now D1 = 1 * (n+1)/ 10 th data 27, 32, 23, 62, 45, 77, 60, 63, 36, 54, 57, 36, 72, 55, 51, 32, 56, 33, 42, 55, 30] = 1* (23 + 1)/ 10 Following the steps mentioned above, we = 2.4 th data i.e. data between first determine the number of variables digit number 2 & 3 in the sample (n). Here n = 23. Which is 24 + 0.4 * ( 27- 24 ) = 25.2 We then need to sort the 23 random Similarly, numbers in ascending order, as shown below. D2 = 2 * (n+1)/ 10 th data SR. No Digit = 2 * (23 + 1)/ 10 1 23 = 4.8th data i.e. data between digit 2 24 number 4 & 5 3 27

Data Science Grade X Student Handbook PDF

Document Details

Tags

Related

Summary

Full Transcript