classXI_DS_Teacher_Handbook.pdf

DATA SCIENCE Grade XI DATA SCIENCE Version 1.0 Volume 1.0 CLASS XI DATA SCIENCE GRADE XI Teacher Handbook ACKNOWLEDGMENT Patrons Sh. Ramesh Pokhriyal 'Nishank', Minister of Human Resource Development, Government of India Sh. Dhotre Sanjay Shamrao, Minister of State for Human Resource Development, Government of India Ms. Anita Karwal, IAS, Secretary, Department of School Education and Literacy, Ministry Human Resource Development, Government of India Advisory Editorial and Creative Inputs Mr. Manuj Ahuja, IAS, Chairperson, Central Board of Secondary Education Guidance and Support Dr. Biswajit Saha, Director (Skill Education & Training), Central Board of Secondary Education Dr. Joseph Emmanuel, Director (Academics), Central Board of Secondary Education Sh. Navtez Bal, Executive Director, Public Sector, Microsoft Corporation India Pvt. Ltd. Sh. Omjiwan Gupta, Director Education, Microsoft Corporation India Pvt. Ltd Dr. Vinnie Jauhari, Director Education Advocacy, Microsoft Corporation India Pvt. Ltd. Ms. Navdeep Kaur Kular, Education Program Manager, Allegis Services India Value adder, Curator and Co-Ordinator Sh. Ravinder Pal Singh, Joint Secretary, Department of Skill Education, Central Board of Secondary Education ABOUT THE HANDBOOK In today’s world, we have a surplus of data, and the demand for learning data science has never been greater. The students need to be provided a solid foundation on data science and technology for them to be industry ready. The objective of this curriculum is to lay the foundation for Data Science, understanding how data is collected, analyzed and, how it can be used in solving problems and making decisions. It will also cover ethical issues with data including data governance and builds foundation for AI based applications of data science. Therefore, CBSE is introducing ‘Data Science’ as a skill module of 12 hours duration in class VIII and as a skill subject in classes IX-XII. CBSE acknowledges the initiative by Microsoft India in developing this data science handbook for class XI teachers. This handbook introduces to R programming in Data Science, solving practical examples. The course covers the theoretical concepts of data science followed by practical examples to develop critical thinking capabilities among students. The purpose of the book is to enable the future workforce to acquire data science skills early in their educational phase and build a solid foundation to be industry ready. Contents ETHICS IN DATA SCIENCE 1 1. Lesson Structure 1 2. Lesson Plan 1 3. Introduction 2 4. How Data Ecosystem is evolving 2 5. Why do Data Scientists need to understand ethics? 3 6. What is data governance framework? 4 ASSESSING DATA 8 1. Lesson Structure 8 2. Lesson Plan 8 3. Introduction 8 4. Story vs. facts 8 5. Trial assessment 9 6. Activity 11 FORECASTING ON DATA 20 1. Lesson Structure 20 2. Lesson plan 20 3. Introduction 20 4. Forecasting 21 5. Observational study 21 I RANDOMIZATION 26 1. Lesson Structure 26 2. Lesson Plan 26 3. Introduction 27 4. Let us do a survey 27 5. Sampling Bias 29 6. How sure are you? 29 7. Let us act on a sense 31 8. Online Data 31 9. Charm of XML 31 INTRODUCTION TO R STUDIO 37 1. Lesson structure 37 2. Lesson Plan 37 3. Introduction 38 4. Orientation with R Studio 38 5. Coding for Data Science using R-Studio 71 6. Code examples with R-Studio 92 REFERENCES 107 II CHAPTER Ethics in Data Science 2. Lesson Plan Studying this chapter should Subtopics Method enable you to understand:. How Data Ecosystem Theory Evolution of Data Ecosystem is evolving? Necessity to Theory Need for Data scientists to understand ethics for understand ethics Data Scientists Concept of Data governance framework and its benefits. Data governance Theory framework 2.1.Teacher’s Note 1. Lesson Structure Discussion: Evolution of data 1) How Data Ecosystem is evolving? ecosystem 2) Necessity to understand ethics The teacher should explain to the for Data Scientists. students how data has evolved over time 3) Data governance framework – with a few simple examples What & why? Discussion: Necessity to understand ethics for Data scientists 1 The teacher should explain to the However, as data continued to evolve, it students in details why it is extremely did not remain small, structural, and important for Data scientists to exhibit an basic. The data that evolved became ethical behavior while dealing with data. large, unstructured, and in motion. This The teacher should give few simple change in data behavior led to the need examples to explain the concept in a to develop skills that look different from better way. the skills required to look at data when the data was small, structural, and Discussion: Data governance basic. Here, one must learn about framework sensor-based data, IoT data, machine The teacher should explain the students learning skills, and concepts like about data governance framework and support vector machines. In short, the its associated benefits. things that one needed to know in this scenario, to translate large, unstructured data into information, were different from those needed to 3. Introduction know when data was small, structured, In this chapter, we will learn about and basic. ethical guidelines and data governance frameworks in data science. The kind of data that we see today is massive, integrated, and dynamic. Here we are talking about a system of data 4. How Data Ecosystem is talking to another system of data. To execute such behavior, one must learn evolving concepts of machine learning and deep In the initial stages of data science, the networks. kind of data we all dealt with, be it for academic purposes or business needs The figure shown below highlights the was small, structural, and static. This is evolution of the data ecosystem. the kind of data that was easy to put into Here, one must remember that the rows and columns and displayed via issues related to data ethics when data spreadsheets. In short, we can say that is small, structured, and basic are very this was a happy place to be in for simplistic. However, the issues get statisticians. Here, the traditional tools complicated when questions need to be such as descriptive statistics, predictive asked to retrieve information from modeling, and classifications were used massive, integrated, and dynamic data. to serve the purpose. 2 5. Why do Data Scientists a) A few people can do an immense amount of harm: need to understand ethics? In the last decade, we have seen many organizations becoming vulnerable to data breaches. Since data scientists have access to a Hackers worldwide are on the vast pool of data in their data analysis, lookout to crack through a it becomes essential for them to adhere reputed organization's firewalls to ethical guidelines. and steal important data from their servers. The stolen data are The use of protective mechanisms and then sold out for a hefty sum. policies to discourage the mishandling and unethical use of data should be To date, Yahoo holds the title for made part of best practices. the largest data breach in the Some of the negative scenarios that may history of the internet. In 2016, arise if ethical guidelines are the company disclosed that it had disrespected include: been the victim of multiple data breaches over the years, starting 3 2013. The data breaches had 6. What is data exposed the email addresses, names, dates of birth of around governance three billion people who have framework? used Yahoo. Another example of a data breach Data governance framework can be is Marriott (Starwood) hotel. In defined as a collection of practices and 2018, Marriott’s data team had processes that ensure the authorized confirmed that around 383 management of data in an organization. million accounts of the guests The primary purpose behind were compromised in the year implementing data governance by any 2016. The breach had exposed organization is to achieve better control the names, addresses, contact over its data assets, including the numbers, and passport methods, technologies, and behaviors information of the guests whose around the proper management of data. accounts were hacked. It also deals with the security, privacy, integrity, and management of data b) Lack of consent: flowing in and out of the organization. One of the leading social Some of the benefits of implementing a networking sites experimented data governance framework are: wherein without consent they purposely fed the users in their 1. Procedures around regulation newsfeed highly extreme point of and compliance activities become view, particularly incendiary part exact. of the news feed because they 2. There is greater transparency were trying to elicit a reaction within data-related activities. from the users and then to see if 3. Increase in value of organization’s that impacted what the users data. were posting back. The newsfeed 4. Better resolution of issues around was curated with the purposeful current data. intention to see if that ultimately impacted the way a user interacted with the rest of the network. Was the consent from the users taken before performing this experiment? The answer to this is “No”. 4 Recap In this chapter we have learnt that data over time has evolved both in size as well as in complexity. With increase in the volume of data getting generated every day, data scientists gathering data to perform analysis need to understand the importance of ethics. We have also learnt about data governance framework and its benefits Exercises Objective Type Questions 1) A person runs a small business and keeps all his/her business records on an unprotected personal computer. These records include essential information about his/her customers. Since it is a small business, that person believes that he/she is unlikely to be a target for hackers. According to him/her, several years have passed, and information on his/her unprotected computer has never been compromised. Are the actions of that person ethical? a) Yes b) No Answer b) 2) One comes up with an idea to improve the way patient data is collected into electronic medical records, thereby reducing errors and better-integrating data entry with patient care workflow. When an experiment is run to evaluate the idea, what kind of data is expected to be used? a) Prospective data b) Retrospective data Answer a) 5 3) A supermarket has prominently displayed boards at various places in the store "We videotape you for your security". Later, you find out that the supermarket analyzes videos to decide store layout and product placement. You feel that the signage is misleading since the store uses the videos not just for security but also to boost profits. Are the supermarket's actions ethical? a) Yes b) No Answer a) This is done without causing any harm to anyone 4) Suppose a celebrity goes to a supermarket for shopping. The next day, images of the celebrity taken from the supermarket's video camera appear on a leading tabloid. Is the supermarket right in selling the images to the tabloid? a) Yes b) No Answer b) 5) Once I have voluntarily shared some information about myself on the web, it means that this information is no longer private and can be shared freely. a) True b) False Answer b) 6) Undesired analysis of previously collected personal data violates privacy. a) True b) False Answer a) 7) Undesired dissemination of previously collected data violates privacy. a) True b) False Answer a) 6 Standard Questions 1) Explain in detail how data has evolved over time. (refer to the chapter for pointers) 2) Explain with relevant examples why data scientists need to understand data and follow data ethics? (refer to the chapter for pointers) (refer to the chapter for pointers) 3) What is data governance framework? (refer to the chapter for pointers) 4) What are some of the benefits of implementing data governance framework? (refer to the chapter for pointers) Higher Order Thinking Skills (HOTS) Please answer the questions below in no less than 200 words. 1) What, according to you, should be the ethical principles for conducting research that involves dealing with other people's data? (Try to think about researching 4 to 5 completely different topics, the kind of private and public data that are needed to complete the research. To better understand the importance of ethical principles in these scenarios, the students should be asked to think of the situation wherein their own personal data has been compromised and misused) 2) Should there be differences in expectations about what is ethical online versus offline regarding the handling of data? (Rules regarding ethical handling of data should be the same for all forms of data, be it online or offline. If hard copy of a classified document falls in wrong hand, the data could be leaked very easily.) Applied Project Suppose you have visited a restaurant for dining. At the end of the meal, the restaurant manager provides you with a form wherein you need to fill up your details along with contact number. According to the manager, the purpose for collecting this data is to enable them to inform us of the exciting deals as and when they come up. Explain in detail, the precautions you need to take before handing out your details to the restaurant manager. (If the data gets into the hands of unethical people, fake mobile app links and phishing page URLs can be forwarded to the customers of the restaurant. The user interfaces put up for these fake apps and phishing pages are so well designed that it is easy for a layman to get tricked into believing the authenticity of these apps and web pages and thereby providing their card numbers to order home delivery of food) 7 CHAPTER Assessing Data Studying this chapter should 3. Introduction enable you to understand: In the previous chapter, we learnt about ethical guidelines and data governance Difference between story and framework in data science. fact Trial assessment in detail In this chapter, we will learn to make distinction between story and fact. We will also explore the various aspects involved in performing trial assessment 1. Lesson Structure and the insights these assessments 1) Difference between story and fact generate. 2) Trial assessment in detail 4. Story vs. facts 2. Lesson Plan A story is an account of experiences of events presented by someone. A story Subtopics Method may contain disproportionate weightage Story vs. fact Theory either in favor of or against an idea or Trial assessment in Theory thing. You can have ten different people detail go through the same set of events, and Exercises Practical each of them may present a completely different experience for those set of events. 8 On the other hand, a fact is something that has occurred or occurs. In other words, it is a truth that has either happened or continues to happen in this universe. In general, people generally inspect a fact or a series of facts to derive a conclusion. 5. Trial assessment A trial assessment is a set of steps executed to support, reject or confirm an assumption. Fig 2.5.1 Causation and its effect Concept of correlation and causation Role of causation in a trial In statistics, two or more variables are assessment related if their values alter so that the rise or fall in one variable's value is either directly or inversely proportional The idea behind performing the trial to the rise or fall in the other variable's assessment is to test something. In value. other words, a trial assessment can be referred to as an experiment. There are In statistics, correlation describes the usually two sets of variables in every direction of a relationship between two experiment: the treatment variable and or more variables. However, we cannot the response variable. assume that change in one variable gives rise to the other variables' change. By treatment variable, we refer to the E.g., an Increase in sales of winter care procedure variable. The treatment products in the United States of America variable is generally an independent is correlated to the increase in summer variable. On the other hand, the care products in Australia. response variable is a dependent variable. In statistics, an experiment is On the other side, causation shows that defined as a supervised study in which a one events' occurrence originates from researcher tries to understand the cause the other events' occurrence. E.g., How and effect relationship. different human activities, livestock farming, rising emissions, and cutting Based on the analysis, the researcher down trees in forests ultimately affect concludes that the treatment followed temperature. had a causal effect on the response variable. 9 Fig 2.5.2 Example of cause and effect on students of a grade Let us perform a trial assessment Imagine a person taking a launch chamber ride in a water park. In this ride, riders enter the chambers in nearly One needs to perform a trial assessment vertical positions. There is typically a to understand what is meant by the countdown after which a trap door cause and effect relationship. opens on the chamber floor to release To start, let us take all the students of a the passenger. The anticipation and grade in an academic year and split almost 90-degree launch make these among the most thrilling water park them into two halves. The first half is rides. In this case, let us consider that subjected to the treatment of no practice exam. On the other hand, the second half is subjected to the treatment of practice exams regularly. At the end of the academic year, both the groups' annual exam results are compared. (Illustration is shown above) Perception of time assessment Fig 2.5.3 Perception of time The perception of time assessment highlights a person's subjective the average duration of the ride is experience of time duration within an around 5 seconds. ongoing event. This perceived duration can alter significantly between different Suppose we interview anyone who has individuals in different circumstances. just completed the ride for the first time and ask him/her how long that ride 10 lasted. In that case, that person will towels were put inside a clear plastic estimate around 10 – 11 seconds, which bag. The paper towels were stapled in a is more than the ride's actual duration. line about one-third from the bag's bottom to hold the paper towel in place and provide a seam to hold the radish seeds. 6. Activity The effects of different duration of light on the growth of radish For the study, one hundred twenty seeds seedlings? were available. The growth setup allowed A class of biology students in a school only thirty seeds. The class agreed to use presented a question 3 – What are the the additional seeds and create a total of effects of duration of light on the growth four growth setups – one for the light of radish seedlings? treatment, one for the mixed treatment, and two for the dark treatment. Thirty To answer the above question, the seeds were selected randomly and students designed and carried out an placed along the stapled seam of the experiment using the following steps: light treatment bag. Next, thirty seeds were again chosen from the remaining Collect/consider data ninety seeds and were placed inside the Data were required to be collected to mixed treatment bag. Finally, thirty of answer the above question. The radish the remaining sixty seeds were randomly seeds were exposed to three different chosen and placed inside one of the dark treatments – 24 hours of light (light), 12 treatment bags. The last thirty seeds hours of light and 12 hours of darkness were placed inside the other dark (mixed), and 24 hours of darkness. The treatment bag. Care was taken to make above three treatments covered two the four growth setups as identical as extreme cases and one in the middle. possible. With two growth setups for the same condition, their results could be With the assistance of a teacher, the compared to ensure similar handling. students agreed to use plastic bags as a Three days later, the length of the radish growth setup. The plastic bags allowed seedlings for the germinating seeds was the students to observe and measure the measured in millimeters. seeds' germination without interfering with them. Two layers of moist paper 11 The data were noted in a summary format like the one shown here. The data could be represented in another table with each observation (seed) on a separate row and each variable on a separate column for analysis purposes. Fig 2.6.1 Table showing Radish Seedling length after 3 days (sorted) The table shows the sorted values. In each treatment type, some seeds did not germinate. Such values were Fig 2.6.2 Long Format Listing of Radish Seedling Lengths considered missing values and were recorded as “x”. Thus, there were 114 In the above table, the observed units observations (28 for light treatment, 28 are individual seeds. Growth bag for mixed treatment, 58 for dark indicates the bag (1, 2, 3, or 4) in which treatment). the seed is present, and treatment It would have been more engaging if the indicates the treatment (1-Light, 2- students were encouraged to discuss Mixed, or 3-Dark) the seeds are whether excluding the seeds that did receiving. Both growth bag & treatment not germinate could add bias to the are said to be categorical variables. On conclusions. In this scenario, because the other hand, length is a quantitative the number of seeds in each category variable, measuring the length of the was roughly the same, the missing seedlings in millimeters. values likely happened by chance. When the mean, median, and standard Conversely, if all the missing values deviation were calculated on the seed were in one category, it would have samples exposed to different treatments suggested that that category's (1-Light, 2-Mixed, or 3-Dark), the conditions hindered the growth, and so results produced were as shown in the missing data were not accidental. table here: 12 difference might have been due to one bag being fortunate enough to get a large count of good seeds. It could also be due to the light and water not being uniformly distributed among the treatment groups. But if a difference Fig 2.6.3 Treatment Summary Statistics this large (6.2 mm) was probably the result of the randomization of seeds, then differences of such magnitude would have been observed quite often, if Based on the results produced, the the measurements were rejumbled and students might have come up with few a new difference in observed mean was questions: calculated. a) Was there proof that 12 hours of The students could have mocked this light and 12 hours of dark group rejumbling by noting down each seedling's length on a separate card. had a significantly higher mean length than the 24 hours of light Thus, for the 56 seedlings, there would group? have been 56 cards. These cards would b) Was there proof that 24 hours of then be shuffled and divided into two dark group had a significantly piles of 28 cards each. The first pile higher mean length than the 12 would represent the 1-Light treatment hours of light and 12 hours of group and the second pile would dark group? represent the 2-Mixed treatment group. c) Were these differences in mean The students would have then computed the difference in mean large enough to rule out their occurrences by chance as a between the two piles. The difference possible clarification for the generated this way would have been observed difference? entirely due to chance. By repeating the above process multiple times, the student would have able to interpret how the difference in mean varies when The mean length of the seedlings in the the treatment has no effect on the Treatment 2 – Mixed group was 6.2 mm more than the treatment 1 – Light growth of the seedlings. group's mean length. Even though there Figure 2.6.4 shown herewith was was a difference of 6.2 mm, it might not produced when technology was used to be massive enough to rule out chance, mix the growth measurements from 1- and so it might become difficult to claim Light treatment and 2-Mixed treatment a treatment effect. This noticeable together and haphazardly divide the 13 measurements into two groups of 28. rejumbling. This gave strong proof Here difference in mean was recorded, against the supposition that the and the operation was repeated 200 difference between means for times in total. treatments 1 and 2 was due to chance alone. The observed difference of 6.2 mm was never exceeded in the simulation of 200 Fig 2.6.4 Difference in means of radish seedlings When a similar type of procedure was stated above was done with samples of followed for samples of treatment 2 and treatment 2 and treatment 3. treatment 3, the observed difference of Thus. here too, it gave a strong proof 6 mm in the mean length was never that the observed difference in mean observed. length between treatment 2 and Figure 2.6.5 shown herewith was treatment 3 was not due to chance produced when a similar procedure as alone. 14 Fig 2.6.5 Difference in means of radish seedlings Recap In this chapter we have learnt that there exists a difference between a story and fact. We have learnt about the concept of correlation and causation We have learnt how causation affects the outcome of a trial assessment We have learnt how a person perceives time under different circumstances 15 Exercises Objective Type Questions 1) Which of the following is incorrect about a story? a) A story generally consists of one or more character(s) b) A story represents the point of view of a person c) A story has a theme associated with it d) Things revealed in a story are always correct and exist in the real world. Answer d) 2) What is the nature of the correlation of two variables when they move in the same direction? a) Neutral b) Negative c) Positive d) None of the above Answer c) 3) Height (in 160 165 170 175 180 185 cm.) Weight (in 65.1 67.9 70.1 72.8 75.4 77.2 Kg) The correlation between height and weight in the above chart can be described as: a) Positive b) Negative c) Zero d) None of the above Answer a) 4) In a study of insect life near a stream, data about the number of different insects’ species and the distance from the stream were collected Distance 2 5 8 11 14 17 (in feet.) Insect 26 25 19 19 14 9 species 16 The correlation between the distance from the stream and the number of different species found is: a) Positive b) Negative c) Zero d) None of the above Answer b) 5) Which of the following is an example of NO correlation? a) The age of a child and his/her shoe size. b) The age of a child and his/her height. c) The age of a child and the number of pets owned. d) The age of a child and vocabulary of words learned. Answer c) 6) Which one of the below-mentioned scenarios appears most likely to be from causation? a) Reading one hour per day increases vocabulary. b) People who are homeless are more likely to have mental health issues. c) The weight of a person has nothing to do with the risk of heart disease. d) In India, as car sales increase, the birth rate also increases. Answer a) 7) A study outlines that people who run more outdoors have higher rates of skin disease than people who exercise indoors. Which one of these seems the most likely to be connection between running and diseases? a) Running produces a chemical that causes skin disease. b) People who run generally drink more water and sports drinks, which might weaken the immune system's ability to attack disease germs. c) People who run do so because they are obese and might have poor health conditions, to begin with. d) People who run outdoors spend more time in the sun. Thus, they are exposed to harmful sunlight for more extended periods. Answer d) 17 8) Which of the following statements given below shows a causal relationship and not just a correlated one? a) An individual's decision to work in construction and his/her diagnosis of skin disease. b) A decrease in temperature and an increase in the presence of people at ice skating rink. c) As the weight of a child increases so does his/her vocabulary. d) The time spent exercising and the number of calories burned. Answer d) 9) For a person having a pleasant experience while doing something, time seems to: a) pass slowly b) come to a standstill c) fly quickly d) None of the above Answer c) Standard Questions 1) Write down three instances of stories and justify why you think they are stories? (refer to the chapter for pointers) 2) Write down three facts and justify why you think they are facts? (refer to the chapter for pointers) 3) What is correlation? (Support your answer with two examples of correlation) (refer to the chapter for pointers) 4) What is the difference between positive and negative correlation? (refer to the chapter for pointers) 5) What is causation? (Support your answer with two examples of causation) (refer to the chapter for pointers) Higher Order Thinking Skills (HOTS) Please answer the questions below in no less than 200 words. 1) Is there a correlation between speaking and writing skills of an individual? (Here we need to consider few scenarios. If it is in the native language, they seem highly correlated. But if not, people may think and write in the language without 18 grammatical mistakes. However, while speaking they may not be as grammatically correct.) 2) If outdoor runners have higher skin disease occurrences due to time exposure in the sun, think of an independent variable that can be used to test the relationship between running and skin diseases. What alternatives can one look for to negate the problem arising out of running outdoors in the sun? (Need to observe if people in general who are exposed to long hours of sunlight tend to suffer with skin disease. Also need to observe, if people who run in different conditions like indoors or in the evening, tend to suffer with skin diseases. There may be others who just restrict themselves to physical exercises. Need to observe, if they tend to suffer with skin ailments) 3) Suppose you are reading a newly published fiction of your favorite author. The story turns out to be thoroughly engrossing, according to you. Describe your personal experience while reading the book about the time you took to complete the reading.(In the description, please mention whether it felt like it is taking too long to complete or was it the other way around). Once you have described your experience, kindly note down what you can infer from this. (Generally, things which are appealing, and engrossing seems to take less time.) Applied Project Consider a company manufacturing metal utensil for home cookware. The company has a factory in which 200 workers work on an everyday basis. Describe in detail how the set-up of the factory (proper equipment, safety measure and infrastructure) has a correlation and causal relationship with workers of the factory. Also, try to derive the correlation between workers of the factory, the profitability and sustenance of the company. One should also think about the end product the customers will be using and what difference it may cause to their lives. (If the working conditions and setup in the factory like good ventilation and sanitation, properly checked equipment’s are maintained, it will decrease the probability of a worker getting health issues to a very large extent. Thus, productivity should ideally increase resulting in better operations and profit. On the contrary, if the factory set up is not proper, workers may fall sick or get injured. This will lead to a depressing attitude amongst workers, thus leading to poor operations.) 19 CHAPTER Forecasting on Data 1. Lesson Structure Discuss: What is an observational study? 1) What is Forecasting? 2) What is an observational study? The teacher should explain to the students in detail the concept of an observational study. A few simple 2. Lesson plan examples of observational study should follow the concept. The teacher should also highlight to the students about the Subtopics Method advantages and disadvantages of What is forecasting? Theory What is an Theory observational studies. observational study? Exercises Practical Activity: Exercises The teacher should encourage the students to complete the exercises, 2.1 Teacher’s Note review them, and provide feedback. Discuss: What is forecasting? The teacher should introduce the students to a concept called forecasting. 3. Introduction A few simple examples of forecasting In the previous chapter, we learned should follow the concept. about the differences between story and fact. We also studied trial assessment in detail. 20 tries to interfere with the subject to Studying this chapter should affect the outcome. enable you to understand: Forecasting Observational study Need for observational study Pros and cons of observational study In this chapter, we are going to learn An example of an observational study about forecasting and observational study. would be if a researcher were trying to determine the outcome eating of organic diet has on overall health. The researcher finds 500 individuals, where 4. Forecasting 250 have eaten an organic diet in the Given all the information available, past five years, while the rest 250 have including the present and the historical not had an organic diet in the past five data, forecasting can be defined as a years. An overall health assessment is statistical task that predicts the future then performed on each of these 500 as accurately as possible. individuals. The result data from the health assessment are then analyzed, and conclusions are drawn on how an organic diet can affect one's overall health. Why observational study when there is trial assessment? 5. Observational study Sometimes, it is not possible to perform An observational study can be defined trial assessments. In those scenarios, we as a procedure in which the subjects are need to rely upon observational study for just observed, and the results are then data collection. noted. During the investigation, nobody The reasons behind this are as follows: 21 1. In trial assessments, the subject one can become an expert in is assigned to a random monitoring one's surroundings. treatment and control group. However, it is unethical to expose 2. Another advantage of using an the subject to arbitrary treatment observational study is that since in specific scenarios. Thus, the observations are made in a observational studies are perfectly natural setting, the preferred over trial assessments. analysis can reveal deep and unexplored insights. The For example, purposefully revelation of such insights will be exposing a subject to polluted air a rarity if we try to collect data via to observe the health issues that other means like surveys. come to the forefront is unethical. Disadvantages of observational 2. Large sums of money may be study required to execute some of the The disadvantages of observational trial assessments. There may be study are as follows: occasions when such large sums of money cannot be arranged. In 1. Sometimes, the insights gained such scenarios, it will be a better by an observational study are not idea to drop the idea of justified by the amount of time performing trial assessments and spent to do so. give the observational study a 2. Certain events are uncertain and priority. may not occur in the presence of 3. A trial assessment cannot be an observer. performed in some scenarios as it becomes unfeasible to assign a 3. Sometimes, the observer may subject to a group randomly. miss reporting important observational details. Advantages of observational study 4. The chances for unfair The advantages of observational study conclusion increase significantly are as follows: in cases where an expert has not performed the study's analysis. 1. Observation is one of the simplest and most used methods of data gathering. Everybody in this world observes many things in their lives. With little training, 22 Recap In this chapter we have learnt about forecasting and observational study We have understood the need to perform observational study even though the option to perform trial assessment exists. We have learnt the advantages and disadvantages of observational study Exercises Objective Type Questions Please choose the correct option in the questions below. 1. In forecasting, past and present data are used to predict the future as accurately as possible. a) The above statement is always true. b) The above statement is never true. c) The above statement is sometimes true. d) None of the above. Answer a) 2. If the actual demand for a period is 100 units but forecast demand was 90 units. The error in forecast is a) -10 b) +10 c) -5 d) +5 Answer a) 3. Observational study cannot be used in: a) Child studies b) Study about attitudes c) Animal studies d) Studies involving groups 23 Answer b) 4. Which of these is not true? a) Observational study is cheap. b) Observational study replaces interviewing. c) Observational study is time consuming. d) Observational study requires operational definition. Answer b) 5. Which of these would make an observational study unethical? a) Putting an observer at risk of harm. b) Using multiple observers. c) Not getting consent from those being observed. d) Conducting the observation late at night. Answer a) 6. Observer’s reliability is improved by: a) Training observers. b) Using operational definitions c) Restricting observations to specific time points d) All the above Answer d) 7. Which of the following is not a disadvantage of observational study? a) We need to assign the subject to random treatment and control groups. b) Certain events may occur in the absence of the observer. c) Miss the reporting of critical observational details. d) Time spent is far more compared to the insights gained via observation. Answer a) Standard Questions 1. What is forecasting? Give two examples of forecasting. (refer to the chapter for pointers) 2. State the reasons why sometimes observational study is preferred over trial assessment? (refer to the chapter for pointers) 3. What are the advantages of observational study? (refer to the chapter for pointers) 4. What are the disadvantages of observational study? (refer to the chapter for pointers) 24 Higher Order Thinking Skills (HOTS) Please answer the question given below in no less than 200 words. 1. A monthly family budget is a forecast of income and expenditure of a family in a month. Critically discuss the above statement. (A monthly family budget is the base case financial expectation for a family for a given month. It lies between the pessimistic and optimistic case scenarios. While deemed most likely, the base case is still highly uncertain. Nevertheless, the family budget provides the family with a tool with which to navigate the uncertain future.) 2. Imagine that you have been given the task to observe the food seeking behavior of rats. Would it be best to conduct this in the wild, or in a laboratory situation? Do you think the results will matter? (Although rats may not be able to distinguish between indoor and outdoor, one needs to observe, the kind of food the generally eat when roaming freely outdoors and compare it with the foods they eat when they are indoors) Applied Project You have been assigned the task to observe the behavior of people working in the parking space of a supermarket. During the observation, you should keep an eye on the maximum number of people working at a time in the parking space. How the employee interacts with the drivers of the vehicles coming in and going out of the parking space. Do they provide any extra assistance to the customers? How well are they able to manage space, so that they can accommodate parking for maximum number of cars at peak shopping hours? You are free to add any observation that you may find interesting. (Need to observe how employees of the parking lot interact with drivers of the vehicle. Do they provide any extra help to people with disabilities? How do they interact with vehicle owners, if they are of opposite gender? How do they cooperate with each other when they are less in number?) 25 CHAPTER Randomization 2. Lesson Plan Studying this chapter should Subtopics Method enable you to understand: Discuss survey Theory Use of surveys to collect data Sampling bias Discuss Theory sampling bias Confidence interval Data collection by sensory devices Discuss Theory confidence Data from internet interval Collection of Theory data via sensors Collection of Theory 1. Lesson Structure data via XML 1) What is a survey? Exercises Practical 2) What is sampling bias? 3) What is a confidence interval? 4) Collection of data via sensors. 5) Collection of data from XML. 26 2.1 Teacher’s Note usage in the collection of data. Few examples of XML in data collection should also be highlighted. Discuss: What is a survey? The teacher should introduce the students to a form of data collection Assignment: Exercises called a survey. The students should be The teacher should encourage the informed about how to structure a survey students to complete the exercise in order to collect data efficiently. questions provided at the end of the chapter. The teacher should then review the answers and provide feedback. Discuss: What is sampling bias? The teacher should introduce the concept of sampling bias to the students along with examples. They should make sure 3. Introduction In the previous chapter, we learned that the students are aware of the about how we can use the observational consequences of sampling bias. study to collect data. The collected data is further analyzed to deduce a conclusion. Discuss: Confidence interval The teacher should introduce the concept of a confidence interval to the student In this chapter, we will learn about how along with examples. While explaining we can collect data via mediums like the concept, the teacher should also surveys, sensory devices, and the highlight the factor affecting a confidence internet. We will also explore a way to interval's value. increase the accuracy of the results deduced using a confidence interval. Discuss: Collection of data via sensors The teacher should introduce the concept 4. Let us do a survey of collecting data via sensors to the students along with examples. A survey is a research method used to collect data where the subjects are generally people. In a survey, the Discuss: Collection of data via XML process involves asking people for the information through a questionnaire. The teacher should explain to the The outcome of a survey depends heavily students that data collection is possible on the type of questions asked. The from the internet. The teacher should questions should be carefully worded then introduce the concept of XML and its 27 not to hurt the sentiment of the people Some examples of open-ended questions being surveyed. are: Surveys can be composed of two types of Comments/review questions: open-ended questions and Suggest improvements close-ended questions. The respondents can answer open-ended questions in Some examples of close-ended questions their own words. In close-ended are: questions, the choice of answers from which to select is fixed and generally Multiple choice provided alongside the question. Yes/No A rating scale of 1 to 10 Emojis 28 5. Sampling Bias or observations taken from the population of interest. For example, a population can be all mangoes in an Sampling bias is a type of discrimination orchard at a given time. We wish to in which a sample is collected so that know; how heavy the mangoes are. We some members of the considered cannot measure all of them, so we take population have a lower or a higher a sample of some of them and measure sampling chance than the others. This them. sample type can be considered a non- random sample as the likelihood of Let us first understand what population everyone being equally selected is not parameter. A population parameter is a there. If such a scenario is not value that describes the characteristics accounted for, it will generate wrong of an entire population, such as the results for the phenomenon under population mean. study. The inference is when we conclude the In other words, to make the statistic population from the sample. Because unbiased, sample collection should be the sample is only a selection of objects random. Thus, all the members of the from the population, it will never be a considered population should have an perfect representation of the population. equal sampling chance. Separate samples of the same population will give different results, giving rise to sampling error or variation. Thus, there will always be sampling 6. How sure are you? errors. When we ask someone, "How sure are you?" we try to gauge the level of To sum up, when we estimate a confidence with which that person is population parameter, it is good practice putting forward an observation. to give it a confidence interval. A confidence interval communicates how In statistics, the term used to measure accurate our estimate is likely to be. the accuracy of a result is called the confidence interval. Thus, if we put an investigative question like: To understand confidence interval in detail, we first need to understand What is the mean weight of all the sampling and sampling error. To find mangoes in the orchard? 4 things out about a population of For this, we take a sample of mangoes interest, it is common practice to take a and calculate the sample mean, which is sample. A sample is a selection of objects 29 the best estimate of the population more diverse population will lead mean. to a more diverse sample. Different samples taken from the A confidence interval defines the span in same population will differ more. which we are pretty sure the population We will be less sure that the mean parameter lies. In this case, the mean of the sample will be closer to the weight for all the mangoes in the orchard population mean. Thus, here the is the population parameter. confidence interval will be large. So, in this case, if we consider that the So greater dissimilarity in the mean weight of mangoes in the orchard population leads to a wider is 250 gm and the confidence interval is confidence interval. 20, we can represent it as follows: 2. The width of the confidence interval is also affected by the sample size. With a small sample, we do not have much reference to base our conclusion. Small samples will differ more from one another, leading to a wider confidence interval. On the other hand, in larger sample size, the effect of a few unusual values is evened out by the other values in the sample. Now that we know about confidence Larger samples will be more like interval let us find out what affects a each other. The effective sampling confidence interval's width. error is reduced with larger samples. When we take larger The width of a confidence interval samples, we have more depends upon two things: information and can be surer 1. The first thing is variations about our estimates, which leads within the population of to a narrower confidence interval. interest. If all the population values were almost the same, then we will have low/little variation. Our estimate is going to be close to the actual population. Thus, the confidence interval, in this case, will be small. But a 30 7. Let us act on a sense activated. Data collection can also be done automatically without any human intervention and following a predefined Another method that can be used to set of rules. collect data is via sensors. This method of data collection requires the least human involvement. 8. Online Data A sensor is a device that identifies and measures the change in input from a The internet can be considered as an physical entity and converts them into ocean of data. There is an uncountable signals. These generated signals can number of websites and web articles on then be converted into human-readable the internet. All of these serve as a rich displays. pool of data. Data can be easily collected Here is an example of a sensor: from the internet using web data scraping, cleaning up the data, and then In a mercury-based glass thermometer, analyzing them. the temperature is the input. Depending on the temperature change, the mercury either expands or contracts, causing the level to go up or down on the marked 9. Charm of XML gauge, which is human readable. Whenever we collect data for analysis, we first decide upon a subject on which the same needs to be performed. We then go one level deeper to understand the characteristics that need to be observed to perform the analysis. Shown below is a table highlighting upvotes for different types of pizzas. A sensor can either collect data continuously or whenever a trigger gets 31 Thus here, we perform an analysis on Thus, if we want to represent the above pizzas. The study is based on the table as an XML, it will be as shown upvotes pizzas of different pizza crust below: categories have received. We can also have a similar kind of table on a web page on the internet. We can store the data shown in these tables on the internet in an XML. XML stands for Extensible Markup Language. It is a self-descriptive tool to store and transport data on the internet. A simple XML is made up of tags, element names, and element values. A tag, either opening or closing, is used to mark an element's start or end. Tags are of two types: start tag and end tag. Start tag is created by wrapping the element name between '.' An ending tag is created by wrapping the element name between ''. The XML format makes it simple to display element value is present between the data on a web page. Also, converting start tag and the end tag. XML to a data table format helps us In XML, each tag is called a node. Each visualize our data better. node can have one or more child nodes contained within it. Thus, we can get an XML to collect data and perform analysis on it. 32 Recap In this chapter, we have learnt about use of surveys to collect data. We also understood how to design the questions in a survey and the different types of questions that may find its place in a survey. Introducing biasness while collecting sample will give incorrect results. Sometimes, results from an experiment are stated as an approximation. The maximum range possible between the approximated value and the actual value is confidence interval. Data can be collected via different mediums from different places. We can collect temperature data via a thermometer, while data on the internet can be collected via xml. Exercises Objective Type Questions 1) Which of the following statement is false? a) Yes/No is an example of close ended question in a survey. b) A rating scale of 1 – 10 is an example of close ended question in a survey. c) A multiple-choice option is an example of close ended question in a survey. d) Suggesting improvement is an example of close ended question in a survey. Answer d) 2) Out of the options given below, which one can be selected to be associated with survey research? a) The problem of objectivity b) The problem of "going native" c) The problem of omission d) The problem of robustness Answer c) (as participant can opt to not answer any particular question) 3) Mr. X conducted a study of the way restaurant owners granted or refused access to a couple. This is an example of observing behavior in terms of: a) Individuals b) Incidents c) Short time periods 33 d) Long time periods Answer b) 4) The statement "results are accurate within +/-4 p.p., 95% of the time" refers respectively to: a) confidence level and confidence interval b) confidence interval and confidence level c) margin of error and margin of confidence d) sample interval and confidence level Answer b) 5) Sampling error can be reduced by: a) correcting a faulty sample frame b) increasing response rates c) increasing the sample size d) reducing incomplete surveys Answer c) 6) Greater dissimilarity within population of interest will: a) Increase the confidence interval b) Decrease the confidence interval c) Will have no impact on confidence interval d) None of the above Answer a) 7) Which of the following is a method of data collection by sensing? a) Use of speed guns to measure the speed of vehicles by traffic police. b) Performing surveys to calculate population of the country. c) Performing experiments and observing the results. d) None of the above Answer a) 8) What is the best format in which online data should be collected perform analysis? 34 a) RTF b) DOCX c) XML d) CSV Answer c) Standard Questions 1) What is a survey? What are the things to keep in mind while creating a survey? (refer to the chapter for pointers) 2) What are the different types of questions that a survey can contain? Provide two examples for each type of question. (refer to the chapter for pointers) 3) What is sampling bias? Is it reasonable to have a sampling bias? Provide an example to support your answer. (refer to the chapter for pointers) 4) Given below are instances of biased survey questions. Point out the biasness involved and try rephrasing the questions so that the subject responding to them can generate meaningful response. a) How amazing was your experience with our customer service team? (The question was put up in a way that assumes customer’s experience was amazing which highlights biasness. In this case the response will be significantly skewed. The correct way should be – On a scale of 1-10, how satisfied were you with our customer service team.) b) What problems did you have with the launch of this new product? (The question assumes that there was something wrong in the first place and will have the customers looking for problem in their answer. The correct way should be – On a scale of 1-10, how satisfied were you with the use of this new product.) c) How do we compare to our competitors? (This question is far too broad. It is assuming that the customers have used products of their competitors. Also, there is no benchmark given to compare against. One of the correct ways would be – On a scale of 1-10, how likely would you recommend our range of products to others.) d) Do you always use product X for your cleaning needs? (The problem with this question is that, it can either be answered as a yes or a no. One of the correct ways would be – What discourages you from purchasing our product? Options: High Price / Bad quality / Unaware of the product / Others.) 5) What is a confidence interval? What are the different factors affecting the values of a confidence interval for a given population? (refer to the chapter for pointers) 35 6) Explain with an example how sensors are used widely in the field of healthcare to collect data and monitor patients' health conditions. (refer to the chapter for pointers) Higher Order Thinking Skills (HOTS) India is a nation where most people are incredibly fond of watching cricket as a sport. In order to better strategize against the opponent, each team analyses the strengths and weaknesses of the opponent, a lot of data is analyzed off the field. Explain in detail, how this is done. (Highlight the use of sensors to generate pitch maps where batsmen play scoring shots and areas on the pitch that fetches wicket. Where does an opponent bowler pitch the ball to prove better than the batsman?) Applied Project Consider a situation where a person gets admitted to a hospital for treatment. In such situations, information is being collected from the patient using various ways. Explain in detail the various ways data is being collected from the patient. Once the data has been collected, represent the data first in tabular format and then as an XML.(When a person gets admitted to hospital, previous medical history, if exists, is collected. Tests are done based on the existing condition of the patient. All this is done keeping the previous medical history in mind. During the course of the treatment, everyday checks are performed to understand the progression of patient’s health condition). 36 CHAPTER Introduction to R Studio Studying this chapter should 2. Lesson Plan enable you to understand: Orientation with R Studio Subtopics Method Coding for data science using How to access R Theory + R Studio Studio Practical Basic programming Theory + in R Practical Coding for data Theory + science in R Practical 1. Lesson structure 1) How to access R Studio? Code example with R Theory + 2) How to perform basic Studio Practical programming in R Studio? Exercises Practical 3) Coding for data science using R Studio. 4) Code example with R Studio. 2.1 Teacher’s Note 5) Exercises Discuss: How to access R Studio? The teacher should instruct the students to install R Studio software on the practice computers. The students should be advised to download the software 37 from the download link provided in the The teacher should encourage the lesson. The students should then be students to complete the exercise instructed to familiarize themselves with questions provided at the end of the the R Studio interface. chapter. The teacher should then review the answers and provide feedback. Discuss: Basic programming in R. The teacher should introduce the basic concepts of programming to the students 3. Introduction along with examples. They should make In the previous chapter, we learned sure that the students are not confused about collecting data from surveys, with the concepts discussed. The teacher sensors, and the internet. We also should encourage the students to perform explained in detail the concept of a the concepts' hands-on practice and come confidence interval. up with doubts. In this chapter, we will learn about R Discuss: Coding for data science in R. Studio and coding for data science using The teacher should introduce the R Studio. concepts specific to coding for data science in R. Creating different visualization like 4. Orientation with R scatter plot, box plot, line chart, bar chart, Studio histogram, pie chart, etc. To download R Studio, navigate to the For each of these visualizations, the link given below: instructor should discuss ways to custom https://rstudio.com/products/rstudio/downloa modify the visualization depending upon d/ the need. The teacher should encourage We need to download the R Studio the students to perform the concepts' installer for windows and install it on the hands-on practice and come up with windows machine. doubts. Discuss: Code example with R Studio Once the R Studio gets installed successfully on the windows machine, The teacher should introduce the concept we need to open it to start working. of mean median and mode. Methods to When the R-Studio is opened for the first calculate the mean or median where time, we get an interface, as shown values in a data set are dropped should below: also be discussed. The teacher should encourage the students to perform the concepts' hands-on practice and come up with doubts. Assignment: Exercises 38 39 We can load a.csv file from a directory and can see the contents in R-Studio like below: 40 41 42 The console window in RStudio is the place where we can tell it what to do and it will show the results of a command. We can type commands directly into the console, but the drawback is that they will be forgotten when we close the session. Some examples of simple commands executed via the console in RStudio as given below: 43 Every time RStudio is opened, it goes to a working directory. We can know the current working directory in RStudio using the command getwd() in the console. We can change the working directory to a folder of our choice. To do so, we use the setwd() function. The directory path of the directory, which we want to set as working directory is passed as a string parameter in the function. The directory path which is passed as a parameter can either be a relative path or an absolute path. 44 Vector in R programming In R, a sequence of elements that share the same data type is known as a vector. Vectors are the most basic data objects. There are six basic vectors – logical, integer, double, complex, character, and raw. We also call these basic vectors atomic vectors. When a person writes just one value in R, it becomes a vector of length one and belongs to one of the above-stated vector types. Such a vector is called a single-element vector. 45 Just like a single element vector, we also have multiple element vector. We can create a multiple elements vector with numeric data using a colon operator. 46 Using the sequence operator, we can create a vector with elements between two numbers, the values for which increments by a numerical figure. 47 Another way of creating a vector is to use the c function. Here, the default method combines the arguments provided to form a vector. Here, all the arguments are forced to a common type, which is the returned value type. The return type is determined from the highest type of the components in the hierarchy expression > list > character > complex > double > integer > logical > raw > NULL. 48 In order to access the elements in a vector, indexing is being used. The [] brackets are used for indexing. Indexing starts with position 1. Providing a negative value in the index drops the element from the result. We can also use TRUE/FALSE or 0 and 1 for indexing. 49 Arithmetic operations like addition, subtraction, multiplication & division can also be performed on vectors. For performing arithmetic operations, the two vectors must be of the same length. The result generated post performing the arithmetic operation is also a vector. In the example shown below, we declare two vectors v1 & v2, and then perform arithmetic operations like addition, subtraction, multiplication, and division. 50 Arithmetic operations performed on the two vectors also result in a vector. List in R programming A list in R is a type of R object which contains different types of elements like - numbers, vectors, strings, and another list within it. A list can also contain a function or a matrix as its elements. 51 To create a list, we use the list() function. Shown below is an example to create a list using strings, numbers, vectors, and logical values. Naming elements in a list Names can be given to list elements, and they can be accessed using the same. Shown below is an example of assigning names to the elements in the list. 52 Accessing elements in a list Elements in a list can be accessed using the index of the element in the list. In case the list is a named list, it can also be accessed using the names. 53 To give a demonstration, let us use the list shown in the above example: Manipulating elements in a list 54 In a list in R, we can add, delete or update elements. The addition or deletion of the elements can only be done at the end of the list. However, an update can be performed on any element in the list. As a demonstration, let us use the list shown in the above example: 55 Matrices in R programming In R, matrices are an extension of the numeric or character vectors. In other words, they are atomic vectors arranged in a two-dimensional rectangular layout. Thus, matrix being an atomic vector extension, its elements must be of same data type. To create a matrix in R, we use the matrix() function. The syntax for creating a matrix in R is: matrix(data, nrow, ncol, byrow, dimnames) 56 The parameters used can be described as follows: data: the input vector which becomes the data elements of the matrix. nrow: number of rows to be created. ncol: number of columns to be created. byrow represents a logical clue. When set to TRUE, then the elements in input vector are organized by row. dimname is the names assigned to the rows and columns. Shown below is an example of a matrix where no data source is provided. 57 Shown here is an example of creating a matrix taking a vector of numbers as input 58 How to access elements in a matrix Elements of a matrix can be fetched by using the column and row index of the element. Shown below is a code snippet that illustrates how we can access different elements in an array. 59 Arithmetic operations such as addition, subtraction, multiplication & division can also be performed on matrices. For performing arithmetic operations, the two matrices must be of the same dimensions. The result generated post performing the arithmetic operation is also a matrix. Shown below is an example where we declare two matrices matrix1 & matrix2 and then perform arithmetic operations like addition, subtraction, multiplication, division on them. 60 Arithmetic operations performed on the two matrices also result in a matrix. We can also merge many lists into one list. Merging can be done by placing the lists inside a c() function or list() function. 61 Shown below is an example of two lists being combined into one. Transforming list to vector 62 We can transform a list into a vector. By doing so, we can perform further manipulation on the elements of the vector. Once a list is being converted to a vector, we can perform all the arithmetic operations possible on vectors. To convert a list into a vector, we use the unlist() function. This function takes a list as input and generates a vector as output. Shown below is an example to convert lists into vectors and perform addition on them. Arrays in R programming 63 Arrays are the R data objects in which we can store data in more than two dimensions. So, if we create an array of dimensions (4,5,2), it will create two rectangular matrices, each with four rows and five columns. Arrays can store only data types. An array is created using the array() function. The array() function takes vectors as input and uses the dim parameter values to create an array. Shown below is a simple example of an array In arrays, we can provide names to the rows, columns, and matrices. This is done using the dimnames parameter. 64 Shown below is an example of an array with custom names for rows, columns, and matrices. 65 How to access an element in an array Elements in an array can be retrieved using the column, row, and matrix index of the element. Shown below is a code snippet that illustrates how we can access different elements in an array. 66 Factors in R programming In R, factors are the data objects used to categorize the data and store it as levels. Factors can store both strings and integers. They are generally used in columns that have a finite number of unique values. Factors are helpful in the data analysis for statistical modeling. Factors in R are created using the factor() function. The input parameter for this function is a vector. Shown below is an example of implementing factors in R: 67 Data frames in R programming In R, a data frame can be defined as a table-like structure used to store data. In a data frame, each column contains the values of each variable. Here, each row contains one set of values related to each column. In a data frame, the column names are non-empty, and the row names should be unique. Data frames are made up of data that are of numeric, factor, or character data type. Here, each column should contain the same number of data items. To create a data frame, we use the data.frame() function. Shown below is an example of a simple data frame: 68 Structure of the data frame In R, we can get the structure of a data frame using the str() function. Shown below is an example of str() function to get the structure of a data frame. Retrieving the summary of data in a data frame 69 We can get the statistical summary and nature of the data in a data frame by applying the summary() function. Shown below is an example of a summary() function to summarize data in a data frame. How to extract data from a data frame We can extract a specific column from a data frame using the column name. 70 Shown below is an example of extracting data from a data frame. 5. Coding for Data Science using R-Studio 71 An essential aspect of data science includes data visualization. We can represent such visualizations as scatter plots, box plots, time series plots, bar charts, histograms, pie charts, etc. Although we have functions to plot scatter plots, box plots, and time series plots in R, we can also plot them by including a package named ggplot2. ggplot2 is a plotting package that simplifies the creation of complex plots from data in a data frame. This package provides a more programmatic interface to specify what variables to plot, how they should be displayed, and other general visual properties. Thus, one needs to make minimal changes if the underlying data source changes or change the visualization from scatter plot to bar plot. A few of the essential functions under the ggplot2 package include the ggplot function & the geom functions. ggplot graphics are built gradually by adding new elements. This approach makes plotting flexible and customizable. To build a ggplot, the basic template used for generating different types of plot is: ggplot(data = , mapping = aes()) + () Using the ggplot function, we bind the plot to a data frame. This is done using the data argument. We define an aesthetic mapping (using the aesthetic (aes) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc. GEOM_FUNCTION represents the graphical representation of the data in the plot in the form of points, lines, or bars. The most common forms of geom functions are: geom_point() [ used for scatter plots, dot plots, etc.] geom_boxplot() [used for boxplots] geom_line() [used for trend lines, time series, etc.] 72 To add a geom function to the ggplot function, we use a ‘+’ operator. Scatter plot in R Let us first see how we can create a scatterplot in R without using any package. For creating a scatterplot in R, we use the plot() function. The basic syntax for creating a scatterplot is: plot(x, y, main, xlab, ylab, xlim, ylim, axes) Following is the description of the parameters used: x is the data set whose values are the horizontal coordinates. y is the data set whose values are the vertical coordinates. main is the tile of the graph. xlab is the label in the horizontal axis. ylab is the label in the vertical axis. xlim is the limits of the values of x used for plotting. ylim is the limits of the values of y used for plotting. axes indicate whether both axes should be drawn on the plot. Shown below is an example of a simple scatter plot drawn using plot() function in R. In this example, we are using values of two columns disp & hp from the built in mtcars data set in R. We are using the values from these two columns to draw a scatter plot in R. 73 Shown alongside is the scatterplot drawn in R for the above set of inputs. The Y-axis displays the Horsepower and The X-axis displays the Highest speed. Now let us see how we can plot a scattered plot using the ggplot2 plotting package. Here, we will see the use of geom_point() along with ggplot()) As stated earlier, in R, we have a predefined dataset named mtcars. 74 Box plot in R 75 A box plot is a graphical technique of summarizing a set of data on an interval scale. Boxplots are used extensively in descriptive data analysis. Using this, we can show the shape of the distribution, its central value, and its variability. 6 A boxplot in R is created using the boxplot() function. The syntax to create a boxplot in R is: boxplot(x, data, notch, varwidth, names, main) Following is the description of the parameters used − x is a vector or a formula. data is the data frame. notch is a logical value. Set as TRUE to draw a notch. varwidth is a logical value. Set as true to draw width of the box proportionate to the sample size. names are the group labels which will be printed under each boxplot. main is used to give a title to the graph. The boxplot() function can also take in formulas of the form Y~X, where Y is a numeric vector grouped according to the value of X. For demonstrating boxplots in R, we will be using the airquality dataset. 76 77 Example of a boxplot where the numeric vector is grouped according to another value. 78 Now let us see how we can plot a boxplot using the ggplot2 plotting package. Boxplot in R ((use of geom_point() with ggplot()) 79 Line chart in R A line chart is a form of a chart created by connecting data points of the data set. Line charts can be used for exploratory data analysis to check the data trends by observing the line graph's line pattern. To create a line graph in R, we use the plot() function. The syntax used to create a line chart in R is: plot (v, type, x

classXI_DS_Teacher_Handbook.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue