classXI_DS_Student_Handbook.pdf

Full Transcript

DATA SCIENCE CLASS XI DATA SCIENCE Version 1.0 Volume 1.0 CLASS XI DATA SCIENCE GRADE XI Student Handbook ACKNOWLEDGMENT Patrons Sh. Ramesh Pokhriyal 'Nishank', Minister of Human Resource Development, Gover...

DATA SCIENCE CLASS XI DATA SCIENCE Version 1.0 Volume 1.0 CLASS XI DATA SCIENCE GRADE XI Student Handbook ACKNOWLEDGMENT Patrons Sh. Ramesh Pokhriyal 'Nishank', Minister of Human Resource Development, Government of India Sh. Dhotre Sanjay Shamrao, Minister of State for Human Resource Development, Government of India Ms. Anita Karwal, IAS, Secretary, Department of School Education and Literacy, Ministry Human Resource Development, Government of India Advisory Editorial and Creative Inputs Mr. Manuj Ahuja, IAS, Chairperson, Central Board of Secondary Education Guidance and Support Dr. Biswajit Saha, Director (Skill Education & Training), Central Board of Secondary Education Dr. Joseph Emmanuel, Director (Academics), Central Board of Secondary Education Sh. Navtez Bal, Executive Director, Public Sector, Microsoft Corporation India Pvt. Ltd. Sh. Omjiwan Gupta, Director Education, Microsoft Corporation India Pvt. Ltd Dr. Vinnie Jauhari, Director Education Advocacy, Microsoft Corporation India Pvt. Ltd. Ms. Navdeep Kaur Kular, Education Program Manager, Allegis Services India Value adder, Curator and Co-Ordinator Sh. Ravinder Pal Singh, Joint Secretary, Department of Skill Education, Central Board of Secondary Education ABOUT THE HANDBOOK In today’s world, we have a surplus of data, and the demand for learning data science has never been greater. The students need to be provided a solid foundation on data science and technology for them to be industry ready. The objective of this curriculum is to lay the foundation for Data Science, understanding how data is collected, analyzed and, how it can be used in solving problems and making decisions. It will also cover ethical issues with data including data governance and builds foundation for AI based applications of data science. Therefore, CBSE is introducing ‘Data Science’ as a skill module of 12 hours duration in class VIII and as a skill subject in classes IX-XII. CBSE acknowledges the initiative by Microsoft India in developing this data science handbook for class XI students. This handbook introduces to R programming in Data Science, solving practical examples. The course covers the theoretical concepts of data science followed by practical examples to develop critical thinking capabilities among students. The purpose of the book is to enable the future workforce to acquire data science skills early in their educational phase and build a solid foundation to be industry ready. Contents ETHICS IN DATA SCIENCE 1 1. Introduction 1 2. How Data Ecosystem is evolving 1 3. Why do Data Scientists need to understand ethics? 3 4. What is data governance framework? 4 ASSESSING DATA 7 1. Introduction 7 2. Story vs. facts 7 3. Trial assessment 8 4. Activity 10 FORECASTING ON DATA 18 1. Introduction 18 2. Forecasting 18 3. Observational study 19 RANDOMIZATION 23 1. Introduction 23 2. Let us do a survey 23 3. Sampling Bias 25 4. How sure are you? 25 I 5. Let us act on a sense 27 6. Online Data 27 7. Charm of XML 27 INTRODUCTION TO R STUDIO 32 1. Introduction 32 2. Orientation with R Studio 32 3. Coding for Data Science using R-Studio 65 4. Code examples with R-Studio 86 REFERENCES 100 II CHAPTER Ethics in Data Science 2. How Data Ecosystem is evolving Studying this chapter should In the initial stages of data science, the enable you to understand: kind of data we all dealt with, be it for Evolution of Data Ecosystem academic purposes or business needs Need for Data scientists to was small, structural, and static. This is understand ethics the kind of data that was easy to put into Concept of Data governance rows and columns and displayed via framework and its benefits. spreadsheets. In short, we can say that this was a happy place to be in for statisticians. Here, the traditional tools such as descriptive statistics, predictive modeling, and classifications were used to serve the purpose. 1. Introduction However, as data continued to evolve, it In this chapter, we will learn about did not remain small, structural, and ethical guidelines and data governance basic. The data that evolved became frameworks in data science. large, unstructured, and in motion. This change in data behavior led to the need to develop skills that look different from the skills required to look at data when the data was small, structural, and basic. Here, one must learn about sensor-based data, IoT data, machine 1 learning skills, and concepts like concepts of machine learning and deep support vector machines. In short, the networks. things that one needed to know in this The figure shown below highlights the scenario, to translate large, evolution of the data ecosystem. unstructured data into information, were different from those needed to Here, one must remember that the know when data was small, structured, issues related to data ethics when data and basic. is small, structured, and basic are very The kind of data that we see today is simplistic. However, the issues get complicated when questions need to be massive, integrated, and dynamic. Here we are talking about a system of data asked to retrieve information from massive, integrated, and dynamic data. talking to another system of data. To execute such behavior, one must learn 2 3. Why do Data Scientists Another example of a data breach is Marriott (Starwood) hotel. In need to understand 2018, Marriott’s data team had ethics? confirmed that around 383 million accounts of the guests were compromised in the year Since data scientists have access to a 2016. The breach had exposed vast pool of data in their data analysis, the names, addresses, contact it becomes essential for them to adhere numbers, and passport to ethical guidelines. information of the guests whose accounts were hacked. The use of protective mechanisms and policies to discourage the mishandling b) Lack of consent: and unethical use of data should be One of the leading social made part of best practices. networking sites experimented Some of the negative scenarios that may wherein without consent they arise if ethical guidelines are purposely fed the users in their disrespected include: newsfeed highly extreme point of view, particularly incendiary part a) A few people can do an of the news feed because they immense amount of harm: were trying to elicit a reaction from the users and then to see if In the last decade, we have seen that impacted what the users many organizations becoming were posting back. The newsfeed vulnerable to data breaches. was curated with the purposeful Hackers worldwide are on the intention to see if that ultimately lookout to crack through a impacted the way a user reputed organization's firewalls interacted with the rest of the and steal important data from network. Was the consent from their servers. The stolen data are the users taken before performing then sold out for a hefty sum. this experiment? The answer to this is “No”. To date, Yahoo holds the title for the largest data breach in the history of the internet. In 2016, the company disclosed that it had been the victim of multiple data breaches over the years, starting in 2013. The data breaches had exposed the email addresses, names, dates of birth of around three billion people who have used Yahoo. 4. What is data Recap governance In this chapter we have learnt that framework? data over time has evolved both in size as well as in complexity. With increase in the volume of data Data governance framework can be getting generated every day, data defined as a collection of practices and scientists gathering data to perform processes that ensure the authorized analysis need to understand the management of data in an organization. importance of ethics. The primary purpose behind We have also learnt about data implementing data governance by any governance framework and its organization is to achieve better control benefits over its data assets, including the methods, technologies, and behaviors around the proper management of data. It also deals with the security, privacy, integrity, and management of data flowing in and out of the organization. Some of the benefits of implementing a data governance framework are: 1. Procedures around regulation and compliance activities become exact. 2. There is greater transparency within data-related activities. 3. Increase in value of organization’s data. 4. Better resolution of issues around current data. 4 Exercises Objective Type Questions 1) A person runs a small business and keeps all his/her business records on an unprotected personal computer. These records include essential information about his/her customers. Since it is a small business, that person believes that he/she is unlikely to be a target for hackers. According to him/her, several years have passed, and information on his/her unprotected computer has never been compromised. Are the actions of that person ethical? a) Yes b) No 2) One comes up with an idea to improve the way patient data is collected into electronic medical records, thereby reducing errors and better integrating data entry with patient care workflow. When an experiment is run to evaluate the idea, what kind of data is expected to be used? a) Prospective data b) Retrospective data 3) A supermarket has prominently displayed boards at various places in the store "We videotape you for your security". Later, you find out that the supermarket analyzes videos to decide store layout and product placement. You feel that the signage is misleading since the store uses the videos not just for security but also to boost profits. Are the supermarket's actions ethical? a) Yes b) No 4) Suppose a celebrity goes to a supermarket for shopping. The next day, images of the celebrity taken from the supermarket's video camera appear on a leading tabloid. Is the supermarket right in selling the images to the tabloid? a) Yes b) No 5) Once I have voluntarily shared some information about myself on the web, it means that this information is no longer private and can be shared freely. a) True b) False 6) Undesired analysis of previously collected personal data violates privacy. a) True 5 b) False 7) Undesired dissemination of previously collected data violates privacy. a) True b) False Standard Questions 1) Explain in detail how data has evolved over time. 2) Explain with relevant examples why data scientists need to understand data and follow data ethics? 3) What is data governance framework? 4) What are some of the benefits of implementing data governance framework? Higher Order Thinking Skills (HOTS) Please answer the questions below in no less than 200 words. 1) What, according to you, should be the ethical principles for conducting research that involves dealing with other people's data? 2) Should there be differences in expectations about what is ethical online versus offline regarding handling of data? Applied Project Suppose you have visited a restaurant for dining. At the end of the meal, the restaurant manager provides you with a form wherein you need to fill up your details along with your contact number. According to the manager, the purpose for collecting this data is to enable them to inform us of the exciting deals as and when they come up. Explain in detail, the precautions you need to take before handing out your details to the restaurant manager. 6 CHAPTER Assessing Data Studying this chapter should 2. Story vs. facts enable you to understand: A story is an account of experiences of Difference between story and events presented by someone. A story fact may contain disproportionate weightage Trial assessment in detail either in favor of or against an idea or thing. You can have ten different people go through the same set of events, and each of them may present a completely different experience for those set of 1. Introduction events. In the previous chapter, we learnt about On the other hand, a fact is something ethical guidelines and data governance that has occurred or occurs. In other framework in data science. words, it is a truth that has either In this chapter, we will learn to make happened or continues to happen in this distinction between story and fact. We universe. In general, people generally will also explore the various aspects inspect a fact or a series of facts to derive involved in performing trial assessment a conclusion. and the insights these assessments generate. 7 3. Trial assessment A trial assessment is a set of steps executed to support, reject or confirm an assumption. Concept of correlation and causation In statistics, two or more variables are related if their values alter so that the rise or fall in one variable's value is either directly or inversely proportional Role of causation in a trial to the rise or fall in the other variable's assessment value. In statistics, correlation describes the The idea behind performing the trial direction of a relationship between two assessment is to test something. In or more variables. However, we cannot other words, a trial assessment can be assume that change in one variable referred to as an experiment. There are gives rise to the other variables' change. usually two sets of variables in every E.g., an Increase in sales of winter care experiment: the treatment variable and products in the United States of America the response variable. is correlated to the increase in summer care products in Australia. By treatment variable, we refer to the procedure variable. The treatment On the other side, causation shows that variable is generally an independent one events' occurrence originates from variable. On the other hand, the the other events' occurrence. E.g., How response variable is a dependent different human activities, livestock variable. In statistics, an experiment is farming, rising emissions, and cutting defined as a supervised study in which a down trees in forests ultimately affect researcher tries to understand the cause temperature. and effect relationship. Based on the analysis, the researcher concludes that the treatment followed had a causal effect on the response variable. 8 Let us perform a trial assessment Imagine a person taking a launch chamber ride in a water park. In this ride, riders enter the chambers in nearly One needs to perform a trial assessment vertical positions. There is typically a to understand what is meant by the countdown after which a trap door cause and effect relationship. opens on the chamber floor to release To start, let us take all the students of a the passenger. The anticipation and grade in an academic year and split almost 90-degree launch make these among the most thrilling water park them into two halves. The first half is rides. In this case, let us consider that subjected to the treatment of no practice exam. On the other hand, the second half is subjected to the treatment of practice exams regularly. At the end of the academic year, both the groups' annual exam results are compared. (Illustration is shown above) Perception of time assessment The perception of time assessment highlights a person's subjective the average duration of the ride is experience of time duration within an around 5 seconds. ongoing event. This perceived duration can alter significantly between different Suppose we interview anyone who has individuals in different circumstances. just completed the ride for the first time and ask him/her how long that ride 9 lasted. In that case, that person will towels were put inside a clear plastic estimate around 10 – 11 seconds, which bag. The paper towels were stapled in a is more than the ride's actual duration. line about one-third from the bag's bottom to hold the paper towel in place and provide a seam to hold the radish seeds. 4. Activity The effects of different duration of light on the growth of radish For the study, one hundred twenty seeds seedlings? were available. The growth setup allowed A class of biology students in a school only thirty seeds. The class agreed to use presented a question – What are the the additional seeds and create a total of effects of duration of light on the growth four growth setups – one for the light of radish seedlings? treatment, one for the mixed treatment, and two for the dark treatment. Thirty To answer the above question, the seeds were selected randomly and students designed and carried out an placed along the stapled seam of the experiment using the following steps: light treatment bag. Next, thirty seeds were again chosen from the remaining Collect/consider data ninety seeds and were placed inside the Data were required to be collected to mixed treatment bag. Finally, thirty of answer the above question. The radish the remaining sixty seeds were randomly seeds were exposed to three different chosen and placed inside one of the dark treatments – 24 hours of light (light), 12 treatment bags. The last thirty seeds hours of light and 12 hours of darkness were placed inside the other dark (mixed), and 24 hours of darkness. The treatment bag. Care was taken to make above three treatments covered two the four growth setups as identical as extreme cases and one in the middle. possible. With two growth setups for the same condition, their results could be With the assistance of a teacher, the compared to ensure similar handling. students agreed to use plastic bags as a Three days later, the length of the radish growth setup. The plastic bags allowed seedlings for the germinating seeds was the students to observe and measure the measured in millimeters. seeds' germination without interfering with them. Two layers of moist paper 10 The data were noted in a summary variable on a separate column for format like the one shown here. analysis purposes. The table shows the sorted values. In In the above table, the observed units each treatment type, some seeds did not are individual seeds. Growth bag germinate. Such values were indicates the bag (1, 2, 3, or 4) in which considered missing values and were the seed is present, and treatment recorded as “x”. Thus, there were 114 indicates the treatment (1-Light, 2- observations (28 for light treatment, 28 Mixed, or 3-Dark) the seeds are for mixed treatment, 58 for dark receiving. Both growth bag & treatment treatment). are said to be categorical variables. On the other hand, length is a quantitative It would have been more engaging if the variable, measuring the length of the students were encouraged to discuss seedlings in millimeters. whether excluding the seeds that did not germinate could add bias to the When the mean, median, and standard conclusions. In this scenario, because deviation were calculated on the seed the number of seeds in each category samples exposed to different treatments was roughly the same, the missing (1-Light, 2-Mixed, or 3-Dark), the values likely happened by chance. results produced were as shown in the Conversely, if all the missing values table here: were in one category, it would have suggested that that category's conditions hindered the growth, and so missing data were not accidental. The data could be represented in another table with each observation (seed) on a separate row and each bag being fortunate enough to get a large count of good seeds. It could also be due to the light and water not being uniformly distributed among the treatment groups. But if a difference this large (6.2 mm) was probably the result of the randomization of seeds, then differences of such magnitude would have been observed quite often, if the measurements were rejumbled and Based on the results produced, the a new difference in observed mean was students might have come up with few calculated. questions: The students could have mocked this a) Was there proof that 12 hours of rejumbling by noting down each light and 12 hours of dark group seedling's length on a separate card. had a significantly higher mean Thus, for the 56 seedlings, there would length than the 24 hours of light have been 56 cards. These cards would group? then be shuffled and divided into two b) Was there proof that 24 hours of piles of 28 cards each. The first pile dark group had a significantly would represent the 1-Light treatment higher mean length than the 12 group and the second pile would hours of light and 12 hours of represent the 2-Mixed treatment group. dark group? The students would have then c) Were these differences in mean computed the difference in mean large enough to rule out their between the two piles. The difference occurrences by chance as a generated this way would have been possible clarification for the entirely due to chance. By repeating the observed difference? above process multiple times, the student would have able to interpret how the difference in mean varies when The mean length of the seedlings in the the treatment has no effect on the Treatment 2 – Mixed group was 6.2 mm growth of the seedlings. more than the treatment 1 – Light group's mean length. Even though there Figure 2.4.4 shown herewith was was a difference of 6.2 mm, it might not produced when technology was used to be massive enough to rule out chance, mix the growth measurements from 1- and so it might become difficult to claim Light treatment and 2-Mixed treatment a treatment effect. This noticeable together and haphazardly divide the difference might have been due to one measurements into two groups of 28. 12 Here difference in mean was recorded, rejumbling. This gave strong proof and the operation was repeated 200 against the supposition that the times in total. difference between means for treatments 1 and 2 was due to chance The observed difference of 6.2 mm was alone. never exceeded in the simulation of 200 When a similar type of procedure was stated above was done with samples of followed for samples of treatment 2 and treatment 2 and treatment 3. treatment 3, the observed difference of Thus. here too, it gave a strong proof 6 mm in the mean length was never that the observed difference in mean observed. length between treatment 2 and Figure 2.4.5 shown herewith was treatment 3 was not due to chance produced when a similar procedure as alone. 13 Recap In this chapter we have learnt that there exists a difference between a story and fact. We have learnt about the concept of correlation and causation We have learnt how causation affects the outcome of a trial assessment We have learnt how a person perceives time under different circumstances Exercises Objective Type Questions 1) Which of the following is incorrect about a story? a) A story generally consists of one or more character(s) b) A story represents the point of view of a person c) A story has a theme associated with it d) Things revealed in a story are always correct and exist in the real world. 2) What is the nature of the correlation of two variables when they move in the same direction? a) Neutral b) Negative c) Positive d) None of the above 3) Height (in 160 165 170 175 180 185 cm.) Weight (in 65.1 67.9 70.1 72.8 75.4 77.2 Kg) The correlation between height and weight in the above chart can be described as: a) Positive b) Negative c) Zero d) None of the above 4) In a study of insect life near a stream, data about the number of different insects’ species and the distance from the stream were collected Distance 2 5 8 11 14 17 (in feet.) Insect 26 25 19 19 14 9 species The correlation between the distance from the stream and the number of different species found is: a) Positive b) Negative c) Zero d) None of the above 5) Which of the following is an example of NO correlation? 15 a) The age of a child and his/her shoe size. b) The age of a child and his/her height. c) The age of a child and the number of pets owned. d) The age of a child and vocabulary of words learned. 6) Which one of the below-mentioned scenarios appears most likely to be from causation? a) Reading one hour per day increases vocabulary. b) People who are homeless are more likely to have mental health issues. c) The weight of a person has nothing to do with the risk of heart disease. d) In India, as car sales increase, the birth rate also increases. 7) A study outlines that people who run more outdoors have higher rates of skin disease than people who exercise indoors. Which one of these seems the most likely to be connection between running and diseases? a) Running produces a chemical that causes skin disease. b) People who run generally drink more water and sports drinks, which might weaken the immune system's ability to attack disease germs. c) People who run do so because they are obese and might have poor health conditions, to begin with. d) People who run outdoors spend more time in the sun. Thus, they are exposed to harmful sunlight for more extended periods. 8) Which of the following statements given below shows a causal relationship and not just a correlated one? a) An individual's decision to work in construction and his/her diagnosis of skin disease. b) A decrease in temperature and an increase in the presence of people at ice skating rink. c) As the weight of a child increases so does his/her vocabulary. d) The time spent exercising and the number of calories burned. 9) For a person having a pleasant experience while doing something, time seems to: a) pass slowly b) come to a standstill c) fly quickly d) None of the above Standard Questions 1) Write down three instances of stories and justify why you think they are stories? 16 2) Write down three facts and justify why you think they are facts? 3) What is correlation? (Support your answer with two examples of correlation) 4) What is the difference between positive and negative correlation? 5) What is causation? (Support your answer with two examples of causation) Higher Order Thinking Skills (HOTS) Please answer the questions below in no less than 200 words. 1) Is there a correlation between speaking and writing skills of an individual? 2) If outdoor runners have higher skin disease occurrences due to time exposure in the sun, think of an independent variable that can be used to test the relationship between running and skin diseases. What alternatives can one look for to negate the problem arising out of running outdoors in the sun? 3) Suppose you are reading a newly published fiction of your favorite author. The story turns out to be thoroughly engrossing, according to you. Describe your personal experience while reading the book about the time you took to complete the reading.(In the description, please mention whether it felt like it is taking too long to complete or was it the other way around). Once you have described your experience, kindly note down what you can infer from this. Applied Project Consider a company manufacturing metal utensil for home cookware. The company has a factory in which 200 workers work on an everyday basis. Describe in detail how the set-up of the factory (proper equipment, safety measure and infrastructure) has a correlation and causal relationship with workers of the factory. Also, try to derive the correlation between workers of the factory, the profitability and sustenance of the company. One should also think about the end product the customers will be using and what difference it may cause to their lives. 17 CHAPTER Forecasting on Data In this chapter, we are going to learn Studying this chapter should about forecasting and observational enable you to understand: study. Forecasting 2. Forecasting Observational study Need for observational study Pros and cons of Given all the information available, observational study including the present and the historical data, forecasting can be defined as a statistical task that predicts the future as accurately as possible. 1. Introduction In the previous chapter, we learned about the differences between story and fact. We also studied trial assessment in detail. 18 3. Observational study Why observational study when there is trial assessment? An observational study can be defined as a procedure in which the subjects are Sometimes, it is not possible to perform just observed, and the results are then trial assessments. In those scenarios, we noted. During the investigation, nobody need to rely upon observational study for tries to interfere with the subject to data collection. affect the outcome. The reasons behind this are as follows: 1. In trial assessments, the subject is assigned to a random treatment and control group. However, it is unethical to expose the subject to arbitrary treatment in specific scenarios. Thus, observational studies are preferred over trial assessments. For example, purposefully exposing a subject to polluted air to observe the health issues that An example of an observational study come to the forefront is unethical. would be if a researcher were trying to 2. Large sums of money may be determine the outcome eating of organic required to execute some of the diet has on overall health. The trial assessments. There may be researcher finds 500 individuals, where occasions when such large sums 250 have eaten an organic diet in the of money cannot be arranged. In past five years, while the rest 250 have such scenarios, it will be a better not had an organic diet in the past five idea to drop the idea of years. An overall health assessment is performing trial assessments and then performed on each of these 500 give the observational study a individuals. The result data from the priority. health assessment are then analyzed, 3. A trial assessment cannot be and conclusions are drawn on how an performed in some scenarios as it organic diet can affect one's overall becomes unfeasible to assign a health. subject to a group randomly. 19 Advantages of observational study Disadvantages of observational The advantages of observational study study are as follows: The disadvantages of observational study are as follows: 1. Observation is one of the simplest and most used methods of data 1. Sometimes, the insights gained gathering. Everybody in this by an observational study are not world observes many things in justified by the amount of time their lives. With little training, spent to do so. one can become an expert in 2. Certain events are uncertain and monitoring one's surroundings. may not occur in the presence of an observer. 2. Another advantage of using an observational study is that since 3. Sometimes, the observer may the observations are made in a miss reporting important perfectly natural setting, the observational details. analysis can reveal deep and 4. The chances for unfair unexplored insights. The revelation of such insights will be conclusion increase significantly in cases where an expert has not a rarity if we try to collect data via performed the study's analysis. other means like surveys. Recap In this chapter we have learnt about forecasting and observational study We have understood the need to perform observational study even though the option to perform trial assessment exists. We have learnt the advantages and disadvantages of observational study 20 Exercises Objective Type Questions Please choose the correct option in the questions below. 1. In forecasting, past and present data are used to predict the future as accurately as possible. a) The above statement is always true. b) The above statement is never true. c) The above statement is sometimes true. d) None of the above. 2. If the actual demand for a period is 100 units but forecast demand was 90 units. The error in forecast is a) -10 b) +10 c) -5 d) +5 3. Observational study cannot be used in: a) Child studies b) Study about attitudes c) Animal studies d) Studies involving groups 4. Which of these is not true? a) Observational study is cheap. b) Observational study replaces interviewing. c) Observational study is time consuming. d) Observational study requires operational definition. 5. Which of these would make an observational study unethical? a) Putting an observer at risk of harm. b) Using multiple observers. c) Not getting consent from those being observed. d) Conducting the observation late at night. 6. Observer’s reliability is improved by: a) Training observers. b) Using operational definitions 21 c) Restricting observations to specific time points d) All the above 7. Which of the following is not a disadvantage of observational study? a) We need to assign the subject to random treatment and control groups. b) Certain events may occur in the absence of the observer. c) Miss the reporting of critical observational details. d) Time spent is far more compared to the insights gained via observation. Standard Questions 1. What is forecasting? Give two examples of forecasting. 2. State the reasons why sometimes observational study is preferred over trial assessment? 3. What are the advantages of observational study? 4. What are the disadvantages of observational study? Higher Order Thinking Skills (HOTS) Please answer the question given below in no less than 200 words. 1. A monthly family budget is a forecast of income and expenditure of a family in a month. Critically discuss the above statement. 2. Imagine that you have been given the task to observe the food seeking behavior of rats. Would it be best to conduct this in the wild, or in a laboratory situation? Do you think the results will matter? Applied Project You have been assigned the task to observe the behavior of people working in the parking space of a supermarket. During the observation, you should keep an eye on the maximum number of people working at a time in the parking space. How the employee interacts with the drivers of the vehicles coming in and going out of the parking space. Do they provide any extra assistance to the customers? How well are they able to manage space, so that they can accommodate parking for maximum number of cars at peak shopping hours? You are free to add any observation that you may find interesting. 22 CHAPTER Randomization In this chapter, we will learn about how Studying this chapter should we can collect data via mediums like enable you to understand: surveys, sensory devices, and the internet. We will also explore a way to Use of surveys to collect data increase the accuracy of the results Sampling bias deduced using a confidence interval. Confidence interval Data collection by sensory devices 2. Let us do a survey Data from internet A survey is a research method used to collect data where the subjects are generally people. In a survey, the process involves asking people for the 1. Introduction information through a questionnaire. In the previous chapter, we learned The outcome of a survey depends heavily about how we can use the observational on the type of questions asked. The study to collect data. The collected data questions should be carefully worded is further analyzed to deduce a not to hurt the sentiment of the people conclusion. being surveyed. 23 Surveys can be composed of two types of Comments/review questions: open-ended questions and Suggest improvements close-ended questions. The respondents can answer open-ended questions in Some examples of close-ended questions their own words. In close-ended are: questions, the choice of answers from Multiple choice which to select is fixed and generally provided alongside the question. Yes/No A rating scale of 1 to 10 Some examples of open-ended questions Emojis are: 24 3. Sampling Bias or observations taken from the population of interest. For example, a population can be all mangoes in an Sampling bias is a type of discrimination orchard at a given time. We wish to in which a sample is collected so that know; how heavy the mangoes are. We some members of the considered cannot measure all of them, so we take population have a lower or a higher a sample of some of them and measure sampling chance than the others. This them. sample type can be considered a non- random sample as the likelihood of Let us first understand what population everyone being equally selected is not parameter. A population parameter is a there. If such a scenario is not value that describes the characteristics accounted for, it will generate wrong of an entire population, such as the results for the phenomenon under population mean. study. The inference is when we conclude the In other words, to make the statistic population from the sample. Because unbiased, sample collection should be the sample is only a selection of objects random. Thus, all the members of the from the population, it will never be a considered population should have an perfect representation of the population. equal sampling chance. Separate samples of the same population will give different results, giving rise to sampling error or variation. Thus, there will always be sampling 4. How sure are you? errors. When we ask someone, "How sure are you?" we try to gauge the level of To sum up, when we estimate a confidence with which that person is population parameter, it is good practice putting forward an observation. to give it a confidence interval. A confidence interval communicates how In statistics, the term used to measure accurate our estimate is likely to be. the accuracy of a result is called the confidence interval. Thus, if we put an investigative question like: To understand confidence interval in detail, we first need to understand What is the mean weight of all the sampling and sampling error. To find mangoes in the orchard? 4 things out about a population of For this, we take a sample of mangoes interest, it is common practice to take a and calculate the sample mean, which is sample. A sample is a selection of objects 25 the best estimate of the population more diverse population will lead mean. to a more diverse sample. Different samples taken from the A confidence interval defines the span in same population will differ more. which we are pretty sure the population We will be less sure that the mean parameter lies. In this case, the mean of the sample will be closer to the weight for all the mangoes in the orchard population mean. Thus, here the is the population parameter. confidence interval will be large. So, in this case, if we consider that the So greater dissimilarity in the mean weight of mangoes in the orchard population leads to a wider is 250 gm and the confidence interval is confidence interval. 20, we can represent it as follows: 2. The width of the confidence interval is also affected by the sample size. With a small sample, we do not have much reference to base our conclusion. Small samples will differ more from one another, leading to a wider confidence interval. On the other hand, in larger sample size, the effect of a few unusual values is evened out by the other values in the sample. Now that we know about confidence Larger samples will be more like interval let us find out what affects a each other. The effective sampling confidence interval's width. error is reduced with larger samples. When we take larger The width of a confidence interval samples, we have more depends upon two things: information and can be surer 1. The first thing is variations about our estimates, which leads within the population of to a narrower confidence interval. interest. If all the population values were almost the same, then we will have low/little variation. Our estimate is going to be close to the actual population. Thus, the confidence interval, in this case, will be small. But a 26 5. Let us act on a sense activated. Data collection can also be done automatically without any human intervention and following a predefined Another method that can be used to set of rules. collect data is via sensors. This method of data collection requires the least human involvement. 6. Online Data A sensor is a device that identifies and measures the change in input from a The internet can be considered as an physical entity and converts them into ocean of data. There is an uncountable signals. These generated signals can number of websites and web articles on then be converted into human-readable the internet. All of these serve as a rich displays. pool of data. Data can be easily collected Here is an example of a sensor: from the internet using web data scraping, cleaning up the data, and then In a mercury-based glass thermometer, analyzing them. the temperature is the input. Depending on the temperature change, the mercury either expands or contracts, causing the level to go up or down on the marked 7. Charm of XML gauge, which is human readable. Whenever we collect data for analysis, we first decide upon a subject on which the same needs to be performed. We then go one level deeper to understand the characteristics that need to be observed to perform the analysis. Shown below is a table highlighting upvotes for different types of pizzas. A sensor can either collect data continuously or whenever a trigger gets 27 Thus here, we perform an analysis on Thus, if we want to represent the above pizzas. The study is based on the table as an XML, it will be as shown upvotes pizzas of different pizza crust below: categories have received. We can also have a similar kind of table on a web page on the internet. We can store the data shown in these tables on the internet in an XML. XML stands for Extensible Markup Language. It is a self-descriptive tool to store and transport data on the internet. A simple XML is made up of tags, element names, and element values. A tag, either opening or closing, is used to mark an element's start or end. Tags are of two types: start tag and end tag. Start tag is created by wrapping the element name between '.' An ending tag is created by wrapping the element name between ''. The XML format makes it simple to display element value is present between the data on a web page. Also, converting start tag and the end tag. XML to a data table format helps us In XML, each tag is called a node. Each visualize our data better. node can have one or more child nodes contained within it. Thus, we can get an XML to collect data and perform analysis on it. 28 Recap In this chapter, we have learnt about use of surveys to collect data. We also understood how to design the questions in a survey and the different types of questions that may find its place in a survey. Introducing biasness while collecting sample will give incorrect results. Sometimes, results from an experiment are stated as an approximation. The maximum range possible between the approximated value and the actual value is confidence interval. Data can be collected via different mediums from different places. We can collect temperature data via a thermometer, while data on the internet can be collected via xml. Exercises Objective Type Questions 1) Which of the following statement is false? a) Yes/No is an example of close ended question in a survey. b) A rating scale of 1 – 10 is an example of close ended question in a survey. c) A multiple-choice option is an example of close ended question in a survey. d) Suggesting improvement is an example of close ended question in a survey. 2) Out of the options given below, which one can be selected to be associated with survey research? e) The problem of objectivity f) The problem of "going native" g) The problem of omission h) The problem of robustness 3) Mr. X conducted a study of the way restaurant owners granted or refused access to a couple. This is an example of observing behavior in terms of: a) Individuals b) Incidents c) Short time periods d) Long time periods 29 4) The statement "results are accurate within +/-4 p.p., 95% of the time" refers respectively to: a) confidence level and confidence interval b) confidence interval and confidence level c) margin of error and margin of confidence d) sample interval and confidence level 5) Sampling error can be reduced by: a) correcting a faulty sample frame b) increasing response rates c) increasing the sample size d) reducing incomplete surveys 6) Greater dissimilarity within population of interest will: a) Increase the confidence interval b) Decrease the confidence interval c) Will have no impact on confidence interval d) None of the above 7) Which of the following is a method of data collection by sensing? a) Use of speed guns to measure the speed of vehicles by traffic police. b) Performing surveys to calculate population of the country. c) Performing experiments and observing the results. d) None of the above 8) What is the best format in which online data should be collected perform analysis? a) RTF b) DOCX c) XML d) CSV 30 Standard Questions 1) What is a survey? What are the things to keep in mind while creating a survey? 2) What are the different types of questions that a survey can contain? Provide two examples for each type of question. 3) What is sampling bias? Is it reasonable to have a sampling bias? Provide an example to support your answer. 4) Given below are instances of biased survey questions. Point out the biasness involved and try rephrasing the questions so that the subject responding to them can generate meaningful response. a) How amazing was your experience with our customer service team? b) What problems did you have with the launch of this new product? c) How do we compare to our competitors? d) Do you always use product X for your cleaning needs? 5) What is a confidence interval? What are the different factors affecting the values of a confidence interval for a given population? 6) Explain with an example how sensors are used widely in the field of healthcare to collect data and monitor patients' health conditions. Higher Order Thinking Skills (HOTS) 1) India is a nation where most people are incredibly fond of watching cricket as a sport. In order to better strategize against the opponent, each team analyses the strengths and weaknesses of the opponent, a lot of data is analyzed off the field. Explain in detail, how this is done. Applied Project Consider a situation where a person gets admitted to a hospital for treatment. In such situations, information is being collected from the patient using various ways. Explain in detail the various ways data is being collected from the patient. Once the data has been collected, represent the data first in tabular format and then as an XML. 31 CHAPTER Introduction to R Studio Studying this chapter should enable you to understand: 2. Orientation with R Studio Orientation with R Studio Coding for data science using R Studio To download R Studio, navigate to the link given below: https://rstudio.com/products/rstudio/downloa d/ We need to download the R Studio 1. Introduction installer for windows and install it on the windows machine. In the previous chapter, we learned Once the R Studio gets installed about collecting data from surveys, successfully on the windows machine, sensors, and the internet. We also we need to open it to start working. explained in detail the concept of a confidence interval. When the R-Studio is opened for the first time, we get an interface, as shown In this chapter, we will learn about R below: Studio and coding for data science using R Studio. 32 33 We can load a.csv file from a directory and can see the contents in R-Studio like below: 34 35 36 The console window in RStudio is the place where we can tell it what to do and it will show the results of a command. We can type commands directly into the console, but the drawback is that they will be forgotten when we close the session. Some examples of simple commands executed via the console in RStudio as given below: 37 Every time RStudio is opened, it goes to a working directory. We can know the current working directory in RStudio using the command getwd() in the console. We can change the working directory to a folder of our choice. To do so, we use the setwd() function. The directory path of the directory, which we want to set as working directory is passed as a string parameter in the function. The directory path which is passed as a parameter can either be a relative path or an absolute path. 38 Vector in R programming In R, a sequence of elements that share the same data type is known as a vector. Vectors are the most basic data objects. There are six basic vectors – logical, integer, double, complex, character, and raw. We also call these basic vectors atomic vectors. When a person writes just one value in R, it becomes a vector of length one and belongs to one of the above-stated vector types. Such a vector is called a single-element vector. 39 Just like a single element vector, we also have multiple element vector. We can create a multiple elements vector with numeric data using a colon operator. 40 Using the sequence operator, we can create a vector with elements between two numbers, the values for which increments by a numerical figure. 41 Another way of creating a vector is to use the c function. Here, the default method combines the arguments provided to form a vector. Here, all the arguments are forced to a common type, which is the returned value type. The return type is determined from the highest type of the components in the hierarchy expression > list > character > complex > double > integer > logical > raw > NULL. 42 In order to access the elements in a vector, indexing is being used. The [] brackets are used for indexing. Indexing starts with position 1. Providing a negative value in the index drops the element from the result. We can also use TRUE/FALSE or 0 and 1 for indexing. 43 Arithmetic operations like addition, subtraction, multiplication & division can also be performed on vectors. For performing arithmetic operations, the two vectors must be of the same length. The result generated post performing the arithmetic operation is also a vector. In the example shown below, we declare two vectors v1 & v2, and then perform arithmetic operations like addition, subtraction, multiplication, and division. 44 Arithmetic operations performed on the two vectors also result in a vector. List in R programming A list in R is a type of R object which contains different types of elements like - numbers, vectors, strings, and another list within it. A list can also contain a function or a matrix as its elements. 45 To create a list, we use the list() function. Shown below is an example to create a list using strings, numbers, vectors, and logical values. Naming elements in a list Names can be given to list elements, and they can be accessed using the same. Shown below is an example of assigning names to the elements in the list. 46 Accessing elements in a list Elements in a list can be accessed using the index of the element in the list. In case the list is a named list, it can also be accessed using the names. 47 To give a demonstration, let us use the list shown in the above example: Manipulating elements in a list 48 In a list in R, we can add, delete or update elements. The addition or deletion of the elements can only be done at the end of the list. However, an update can be performed on any element in the list. As a demonstration, let us use the list shown in the above example: 49 Matrices in R programming In R, matrices are an extension of the numeric or character vectors. In other words, they are atomic vectors arranged in a two-dimensional rectangular layout. Thus, matrix being an atomic vector extension, its elements must be of same data type. To create a matrix in R, we use the matrix() function. The syntax for creating a matrix in R is: matrix(data, nrow, ncol, byrow, dimnames) 50 The parameters used can be described as follows: data: the input vector which becomes the data elements of the matrix. nrow: number of rows to be created. ncol: number of columns to be created. byrow represents a logical clue. When set to TRUE, then the elements in input vector are organized by row. dimname is the names assigned to the rows and columns. Shown below is an example of a matrix where no data source is provided. 51 Shown here is an example of creating a matrix taking a vector of numbers as input 52 How to access elements in a matrix Elements of a matrix can be fetched by using the column and row index of the element. Shown below is a code snippet that illustrates how we can access different elements in an array. 53 Arithmetic operations such as addition, subtraction, multiplication & division can also be performed on matrices. For performing arithmetic operations, the two matrices must be of the same dimensions. The result generated post performing the arithmetic operation is also a matrix. Shown below is an example where we declare two matrices matrix1 & matrix2 and then perform arithmetic operations like addition, subtraction, multiplication, division on them. 54 Arithmetic operations performed on the two matrices also result in a matrix. We can also merge many lists into one list. Merging can be done by placing the lists inside a c() function or list() function. 55 Shown below is an example of two lists being combined into one. Transforming list to vector 56 We can transform a list into a vector. By doing so, we can perform further manipulation on the elements of the vector. Once a list is being converted to a vector, we can perform all the arithmetic operations possible on vectors. To convert a list into a vector, we use the unlist() function. This function takes a list as input and generates a vector as output. Shown below is an example to convert lists into vectors and perform addition on them. Arrays in R programming 57 Arrays are the R data objects in which we can store data in more than two dimensions. So, if we create an array of dimensions (4,5,2), it will create two rectangular matrices, each with four rows and five columns. Arrays can store only data types. An array is created using the array() function. The array() function takes vectors as input and uses the dim parameter values to create an array. Shown below is a simple example of an array In arrays, we can provide names to the rows, columns, and matrices. This is done using the dimnames parameter. 58 Shown below is an example of an array with custom names for rows, columns, and matrices. 59 How to access an element in an array Elements in an array can be retrieved using the column, row, and matrix index of the element. Shown below is a code snippet that illustrates how we can access different elements in an array. 60 Factors in R programming In R, factors are the data objects used to categorize the data and store it as levels. Factors can store both strings and integers. They are generally used in columns that have a finite number of unique values. Factors are helpful in the data analysis for statistical modeling. Factors in R are created using the factor() function. The input parameter for this function is a vector. Shown below is an example of implementing factors in R: 61 Data frames in R programming In R, a data frame can be defined as a table-like structure used to store data. In a data frame, each column contains the values of each variable. Here, each row contains one set of values related to each column. In a data frame, the column names are non-empty, and the row names should be unique. Data frames are made up of data that are of numeric, factor, or character data type. Here, each column should contain the same number of data items. To create a data frame, we use the data.frame() function. Shown below is an example of a simple data frame: 62 Structure of the data frame In R, we can get the structure of a data frame using the str() function. Shown below is an example of str() function to get the structure of a data frame. Retrieving the summary of data in a data frame 63 We can get the statistical summary and nature of the data in a data frame by applying the summary() function. Shown below is an example of a summary() function to summarize data in a data frame. How to extract data from a data frame We can extract a specific column from a data frame using the column name. 64 Shown below is an example of extracting data from a data frame. 3. Coding for Data Science using R-Studio 65 An essential aspect of data science includes data visualization. We can represent such visualizations as scatter plots, box plots, time series plots, bar charts, histograms, pie charts, etc. Although we have functions to plot scatter plots, box plots, and time series plots in R, we can also plot them by including a package named ggplot2. ggplot2 is a plotting package that simplifies the creation of complex plots from data in a data frame. This package provides a more programmatic interface to specify what variables to plot, how they should be displayed, and other general visual properties. Thus, one needs to make minimal changes if the underlying data source changes or change the visualization from scatter plot to bar plot. 5 A few of the essential functions under the ggplot2 package include the ggplot function & the geom functions. ggplot graphics are built gradually by adding new elements. This approach makes plotting flexible and customizable. To build a ggplot, the basic template used for generating different types of plot is: ggplot(data = , mapping = aes()) + () Using the ggplot function, we bind the plot to a data frame. This is done using the data argument. We define an aesthetic mapping (using the aesthetic (aes) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc. GEOM_FUNCTION represents the graphical representation of the data in the plot in the form of points, lines, or bars. The most common forms of geom functions are: geom_point() [ used for scatter plots, dot plots, etc.] geom_boxplot() [used for boxplots] geom_line() [used for trend lines, time series, etc.] 66 To add a geom function to the ggplot function, we use a ‘+’ operator. Scatter plot in R Let us first see how we can create a scatterplot in R without using any package. For creating a scatterplot in R, we use the plot() function. The basic syntax for creating a scatterplot is: plot(x, y, main, xlab, ylab, xlim, ylim, axes) Following is the description of the parameters used: x is the data set whose values are the horizontal coordinates. y is the data set whose values are the vertical coordinates. main is the tile of the graph. xlab is the label in the horizontal axis. ylab is the label in the vertical axis. xlim is the limits of the values of x used for plotting. ylim is the limits of the values of y used for plotting. axes indicate whether both axes should be drawn on the plot. Shown below is an example of a simple scatter plot drawn using plot() function in R. In this example, we are using values of two columns disp & hp from the built in mtcars data set in R. We are using the values from these two columns to draw a scatter plot in R. 67 Shown alongside is the scatterplot drawn in R for the above set of inputs. The Y-axis displays the Horsepower and The X-axis displays the Highest speed. Now let us see how we can plot a scattered plot using the ggplot2 plotting package. Here, we will see the use of geom_point() along with ggplot()) As stated earlier, in R, we have a predefined dataset named mtcars. 68 Box plot in R 69 A box plot is a graphical technique of summarizing a set of data on an interval scale. Boxplots are used extensively in descriptive data analysis. Using this, we can show the shape of the distribution, its central value, and its variability. 6 A boxplot in R is created using the boxplot() function. The syntax to create a boxplot in R is: boxplot(x, data, notch, varwidth, names, main) Following is the description of the parameters used − x is a vector or a formula. data is the data frame. notch is a logical value. Set as TRUE to draw a notch. varwidth is a logical value. Set as true to draw width of the box proportionate to the sample size. names are the group labels which will be printed under each boxplot. main is used to give a title to the graph. The boxplot() function can also take in formulas of the form Y~X, where Y is a numeric vector grouped according to the value of X. For demonstrating boxplots in R, we will be using the airquality dataset. 70 71 Example of a boxplot where the numeric vector is grouped according to another value. 72 Now let us see how we can plot a boxplot using the ggplot2 plotting package. Boxplot in R ((use of geom_point() with ggplot()) 73 Line chart in R A line chart is a form of a chart created by connecting data points of the data set. Line charts can be used for exploratory data analysis to check the data trends by observing the line graph's line pattern. To create a line graph in R, we use the plot() function. The syntax used to create a line chart in R is: plot (v, type, xlab, ylab, main, col) Following is the description of the parameters used − v is a vector containing the numeric values. type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw both points and lines. xlab is the label for x axis. ylab is the label for y axis. main is the Title of the chart. col is used to give colors to both the points and lines. Let us look at an example to draw a line chart in R using plot function 74 Now let us see how we can draw a line chart using the ggplot2 plotting package. Line in R (use of geom_line() with ggplot()) 75 Bar chart in R The basic syntax for creating a bar chart in R is: barplot(H,xlab,ylab,main, names.arg,col) Following is the description of the parameters used: H is a vector or matrix containing numeric values used in bar chart. xlab is the label for x axis. ylab is the label for y axis. main is the Title of the bar chart. names.arg is a vector of names appearing under each bar. col is used to give colors to the bars in the graph. Given below is an example to draw the bar graph for maximum temperature in Celsius recorded during five consecutive months of a year 76 77 Bars can also be plotted horizontally by providing the argument horiz = TRUE The horizontal presentation of bar graph for maximum temperature in Celsius was recorded during five consecutive months of a year. Group bar chart and stacked bar chart Bar charts can also be created in R with groups of bars and stacks in each bar using a matrix as an input value. 78 Instead of a stacked bar we can have different bars for each element in a column juxtaposed to each other by specifying the parameter beside = TRUE in the barplot function as shown below. 79 80 Histogram in R The basic syntax for creating a histogram in R is: hist(v,main,xlab,xlim,ylim,breaks,col,border) Following is the description of the parameters used: v is a vector containing numeric values used in histogram. main indicates Title of the chart. col is used to set color of the bars. border is used to set border color of each bar. xlab is used to give description of x-axis. xlim is used to specify the range of values on the x-axis. ylim is used to specify the range of values on the y-axis. breaks are used to mention the width of each bar. Shown below is an example to plot a simple histogram using R: 81 Pie Chart in R The basic syntax for creating a pie chart in R is: pie(x, labels, radius, main, col, clockwise) Following is the description of the parameters used: x is a vector containing the numeric values used in the pie chart. labels are used to give description to the slices. radius indicates the radius of the circle of the pie chart.(value between −1 and +1). main indicates the Title of the chart. col indicates the color palette. clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise. Shown below is a simple pie chart created using input vector and labels. 82 We can modify the above pie chart with a custom Title and colors 83 The above pie chart can also be presented in the form of slice percentages and chart legend 84 85 4. Code examples with R-Studio In this section, we will determine how to perform statistical analysis using R studio. For this, we will be using many built-in functions. Most of these functions are part of the R base package. The functions being referred to here are mean, median, and mode. Mean We calculate the mean by taking the sum of values and dividing it by the number of values in the data series. To calculate the mean in R, we use the mean() function. The mean() function in R takes a vector as an input along with the other arguments and generates the result. The syntax for calculating mean in R is: mean(x, trim = 0, na.rm = FALSE,...) Following is the description of the parameters used: x is the input vector. trim is used to drop some observations from both end of the sorted vector. na.rm is used to remove the missing values from the input vector. Shown below is an example of calculating the mean of a given input vector. 86 87 Use of trim parameter in the mean() function: When the trim parameter is used in the function, the vector's values get sorted. Then the required number of observations are dropped from both ends from the calculation of the mean. So, when trim = 0.2, 2 values from both the right end and left end are dropped from the mean calculation. Shown below is an example of calculating the mean of a given input vector, having a trim parameter 88 Use of na.rm parameter in the mean() function: In some situations, one or more value may be NA. Suppose the vector contains a missing value (populated in the vector as NA). In that case, the mean function by default returns NA. To drop the missing value in a vector, we set the na.rm parameter to TRUE (i.e., na.rm = TRUE), which means remove the NA values. 89 Shown below is an example of calculating the mean of a given input vector with missing values: 90 Median The median can be defined as the middlemost value in a data series. To calculate the median value for a data series in R, we use the median() function. The syntax for calculating median in R is: median(x, na.rm = FALSE) Following is the description of the parameters used − x is the input vector. na.rm is used to remove the missing values from the input vector. Shown below is an example of calculating the median of a numeric vector 91 Use of na.rm parameter in the median() function: In some situations, one or more value may be NA. Suppose the vector contains a missing value (populated in the vector as NA). In that case, the median function by default returns NA. To drop the missing value in a vector, we set the na.rm parameter to TRUE (i.e., na.rm = TRUE), which means remove the NA values. Shown below is an example of calculating the median of a given input vector with a missing value 92 93 Mode A mode is defined as the value with the highest frequency of occurrences in a set of data. The mode can be determined both for numeric and character data. For calculating the mode, we do not have any inbuilt function in R. Here; we create a user-defined function to get the mode of a data set. Shown below is an example of a user-defined function that takes a vector as input and returns the mode value as output. 94 Recap In this chapter, we have learnt about use of R Studio for programming in R. We have learnt about the different data objects in R. We have learnt about methods to generate different statistical visualization using R. We have learnt the ways to calculate mean, median & mode using R as the programming language. 95 Exercises Objective Type Questions 1) R is an ________________________ programming language? a) Closed source b) GPL c) Open source d) Definite source 2) How many types of R objects are present in R data type? a) 4 b) 5 c) 6 d) 7 3) In vector, where all arguments are forced to a common type, please tick the correct order: a) Expression

Use Quizgecko on...
Browser
Browser