Full Transcript

DATA SCIENCE GRADE XII Version 1.0 DATA SCIENCE GRADE XII Student Handbook ACKNOWLEDGEMENTS Patrons Sh. Ramesh Pokhriyal 'Nishank', Minister of Human Resource Development, Government of India Sh. Dhotre Sanjay Shamrao,...

DATA SCIENCE GRADE XII Version 1.0 DATA SCIENCE GRADE XII Student Handbook ACKNOWLEDGEMENTS Patrons Sh. Ramesh Pokhriyal 'Nishank', Minister of Human Resource Development, Government of India Sh. Dhotre Sanjay Shamrao, Minister of State for Human Resource Development, Government of India Ms. Anita Karwal, IAS, Secretary, Department of School Education and Literacy, Ministry Human Resource Development, Government of India Advisory Editorial and Creative Inputs Mr. Manuj Ahuja, IAS, Chairperson, Central Board of Secondary Education Guidance and Support Dr. Biswajit Saha, Director (Skill Education & Training), Central Board of Secondary Education Dr. Joseph Emmanuel, Director (Academics), Central Board of Secondary Education Sh. Navtez Bal, Executive Director, Public Sector, Microsoft Corporation India Pvt. Ltd. Sh. Omjiwan Gupta, Director Education, Microsoft Corporation India Pvt. Ltd Dr. Vinnie Jauhari, Director Education Advocacy, Microsoft Corporation India Pvt. Ltd. Ms. Navdeep Kaur Kular, Education Program Manager, Allegis Services India Value adder, Curator and Co-Ordinator Sh. Ravinder Pal Singh, Joint Secretary, Department of Skill Education, Central Board of Secondary Education ABOUT THE HANDBOOK In today’s world, we have a surplus of data, and the demand for learning data science has never been greater. The students need to be provided a solid foundation on data science and technology for them to be industry ready. The objective of this curriculum is to lay the foundation for Data Science, understanding how data is collected, analyzed and, how it can be used in solving problems and making decisions. It will also cover ethical issues with data including data governance and builds foundation for AI based applications of data science. Therefore, CBSE is introducing ‘Data Science’ as a skill module of 12 hours duration in class VIII and as a skill subject in classes IX-XII. CBSE acknowledges the initiative by Microsoft India in developing this data science handbook for class XII students. This handbook introduces Classification and Regression algorithms; Unsupervised learning with practical examples. The course covers the theoretical concepts of data science followed by practical examples to develop critical thinking capabilities among students. The purpose of the book is to enable the future workforce to acquire data science skills early in their educational phase and build a solid foundation to be industry ready. Contents Data Governance 1 1. What is Data Governance? 1 2. Ethical Guidelines 2 3. Data Privacy 2 Exploratory Data Analysis 7 1. Introduction 7 2. Univariate Analysis 8 3. Multivariate Analysis 9 4. Data Cleaning 10 Classification Algorithms I 13 1. Introduction 13 2. Introduction to Decision Trees 13 3. Applications of Decision Trees 15 4. Creating a Decision Tree 16 Classification Algorithms II 22 1. Introduction 22 2. Introduction to K-Nearest Neighbors 22 3. Pros and Cons of using K-NN 24 4. Cross Validation 25 Regression Algorithms I 33 1. Introduction 33 2. Introduction to Linear Regression 33 3. Mean Absolute Error 34 4. Root Mean Square Deviation 35 Regression Algorithms II 40 1. Introduction 40 2. Multiple Linear Regression 40 3. Non-linear Regression 41 Unsupervised Learning 43 I 1. Introduction 43 2. Introduction to Unsupervised Learning 43 3. Real-world applications of Unsupervised Learning 44 4. Introduction to Clustering 45 5. K - Means Clustering 45 Final Project I 48 1. Introduction 48 2. Introduction to the Project 48 3. Setup Visual Studio Code and Python 49 4. Gather data for the meteor showers 51 5. Cleanse meteor data 53 6. Write the predictor function 58 Final Project II 60 1. Introduction 60 2. Introduction to the Project 60 References 62 II CHAPTER Data Governance 1. What is Data the data used is maintained. Data governance also defines who can act, Governance? upon which data and using what Data governance can be thought of as a methods. collection of people, technologies, processes, and policies that protect and Hence, data governance is a data help to manage the efficient use of data. management concept that ensures that Through data governance, we can high data quality exists throughout the ensure that the quality and security of complete lifecycle of the data along with effective controls on how and with whom data is shared. Studying this chapter should enable you to understand: Data governance focuses on areas such as include data integrity, data What is Data Governance? availability, usability and data What are the ethical consistency. guidelines for governing data? What is Data Privacy? 1 Use technologies and data architecture that has the minimum Data Governance covers the following intrusion necessary. aspects. Data Quality Data Security Data privacy covers aspects such as Data Architecture How personal data is Data Integration and collected and stored by Interoperability organizations. Data Storage Whether and how personal data is shared with third parties. Government policies and regulatory restrictions 2. Ethical Guidelines regarding the storage and Ethics can be said to be the moral sharing of personal principles that govern the behavior or information. actions of an individual or a group. These principles help us decide what is good or bad. Software products and data are not always used for purposes that are good 3. Data Privacy for society. This is why we need to Data privacy is the right of any adhere to a set of guidelines that can individual to have control over how his guide us on what is right and what is or her personal information is collected wrong. and used. To begin with, we must make sure that Data privacy is not just about secure qualities such as integrity, honesty, data storage. There could be cases where objectivity, nondiscrimination are personal identifiable information is always part of the high-level principles collected and stored securely in an which should be incorporated in all our encrypted format, however, there is no processes. agreement from the users regarding the collection of the data itself. In such Besides that, while dealing with data, we cases, there is a clear violation of data should also seek to include the following privacy rules. points. One major aspect of data privacy is that Keep the data secure. the individual is considered to be the Create machine learning models that sole owner of his data. In other words, are impartial and robust he can request any organization to Be as open and accountable as remove all the data they have collected possible 2 about him at any point in time. Data Some of the personal identifiers that privacy rules are still evolving with time HIPAA protects are as follows as more and more awareness about data privacy continues to spread. Names of parts of names Phone numbers, email addresses Some of the important legislations for Geographical identifiers data privacy are discussed below. Fingerprints and retinal prints GDPR - General Data Protection Social security numbers Regulation Medical records. The European Union made General Data CCPA – California Consumer Privacy Act Protection Regulation effective on May California passed the CCPA on June 28, 25th, 2018 to protect European Union 2018, and it went into effect on January consumer data. All of the reformed laws 1, 2020. The CCPA is landmark are made to help consumers gain a high legislation designed to protect consumer level of control over their data. It also data. offers more transparency to the end- user about the data collection and use The CCPA provides residents living in process. the state of California with the right to request businesses: GDPR is based on the following important aspects To disclose to them what personal information the businesses have Obtaining Consent about them and what they intend Timely breach notification to do with it Right to access data To request businesses to delete Right to be forgotten their personal information Privacy by design To request businesses not to sell their personal information HIPAA - Health Insurance Portability and Accountability Act COPPA - Children’s Online Privacy Protection Act The Health Insurance Portability and Accountability Act (HIPAA) was passed The Children's Online Privacy and in the United States to protect Protection Act, which is commonly healthcare information from fraud and known as COPPA, is a law that deals theft. It also helps to manage Personally with how websites and other online Identifiable Information stored by companies collect data from children healthcare and insurance companies. who are less than the age of 13. HIPAA returns control of data to the It was passed in the US in 1998 and individuals by giving them the option to came into effect on April 21, 2000. It see their data at any time, ask for details what must be included in corrections and report any violations of a privacy policy for children. It also privacy that they might suspect. addresses when and how to seek 3 consent from parents or guardians for protect children's privacy and safety certain services and what online. responsibilities a company has to PDP – Personal Data Protection Bill Recap The Personal Data Protection Bill 2019 was tabled in the Indian Parliament by Data governance defines who can the Ministry of Electronics and take action, upon which data and Information Technology on 11 December using what methods. 2019. As of March 2020, the Bill is being analyzed by a Joint Parliamentary Ethics guidelines for data stems Committee (JPC) in consultation with from qualities like integrity, honesty, experts and stakeholders. The Bill objectivity and nondiscrimination. covers mechanisms for the protection of personal data and proposes the setting Data privacy is the right of any up of a Data Protection Authority of individual to have control over how India for it. his or her personal information is collected and used. 4 Exercises Objective Type Questions Please choose the correct option in the questions below. 1. Which of the following statements is true? a. Data governance helps in effective data management. b. Ethical guidelines must be well-defined in any organization dealing with lots of data c. Data privacy is only about secure data storage d. Transparency is an important ethical guideline. 2. What are some important ethical guidelines for data? a. Keeping data secure. b. Making models impartial. c. Being open and accountable. d. All of the above. 3. Some organizations store more personal information than required. This is in accordance with Data Ethics. a. True b. False 4. Which data legislation was introduced in Europe? a. GDPR b. HIPAA c. COPPA 5. Which are some of the areas that data governance focuses on? a. data integrity, b. data availability c. data consistency d. All of the above Standard Questions Please answer the questions below in no less than 100 words. 1. What are some of the aspects covered by data governance? 2. Write a short note on the California Consumer Privacy Act. 3. Write a short note on the General Data Protection Regulation. 5 Higher Order Thinking Skills(HOTS) Please answer the questions below in no less than 200 words. 1. In 2019, Canva, which is a famous website used for design, suffered a data breach that impacted more than 100 million users. The breach caused data such as email addresses and passwords to be leaked. Considering this situation, discuss how the website can prevent further leaks based on ethical guidelines. 2. Write a short note on how children are at higher risk of being manipulated on the internet. Applied Project Discuss how data governance and data best practices are followed at your school. 6 CHAPTER Exploratory Data Analysis Studying this chapter should Thus, exploratory data analysis can be enable you to understand: said to be an approach for analyzing data sets to summarize their key What is Exploratory Data characteristics, often using visual Analysis? methods. Often, in real life, exploratory What is Univariate Analysis? data analysis (EDA) techniques that are What is Multivariate used are graphical and only a few Analysis? statistical techniques are used. The What are the techniques to main reason for this is that EDA is way clean data? to explore data quickly and find patterns and this can be done best by using 1. Introduction graphs. Exploratory Data Analysis is the process There are a number of tools and of carrying out an initial analysis of the methods to perform exploratory data available data to find out more about the analysis. Some of them have been data. We usually try to find patterns, try discussed below. to spot anomalies, and test any hypotheses or assumptions that we may Univariate analysis of each have about the data. The process of feature variable in the raw dataset by Exploratory Data Analysis is done with preparing visualizations along with the help of summary statistics and summary statistics. graphical representations. Bivariate analysis of feature variables by preparing visualizations 7 and summary statistics that allow us to data include looking at mean, mode, determine the relationship between two median, range, variance, maximum, variables in the dataset. minimum, quartiles, and standard deviation. Multivariate analysis of multiple feature variables for mapping and Some graphical methods involve understanding interactions between preparing frequency distribution tables, different fields in the data. bar charts, histograms, frequency polygons, and pie charts. Graphical analysis by plotting the raw data (histograms, probability plots Now let’s look at some of the graphs used and lag plots) and plotting simple for univariate analysis. statistics (mean plots, standard deviation plots, and box plots) The diagram below shows a scatter plot for a single variable. Using unsupervised learning techniques like clustering to identify the number of clusters in the data set. Clustering finds application in image compression and pattern recognition. 2. Univariate Analysis Univariate analysis can be considered as the easiest form of data analysis where we only analyze only one variable from the entire dataset. Since we deal with only one variable, we do not have to worry about causes or relationships. The main purpose of the univariate analysis is to describe the data and find patterns that exist within it. For univariate analysis, we pick up a Activity 2.1 variable from the dataset and try to Take a look at the famous ‘iris’ dataset analyze it in depth. One example of a and create a scatter plot of the petal variable in the univariate analysis might legths on a graph paper. be "revenue". Another might be "height". For univariate analysis, we would not look at these two variables at the same The diagram below shows the box plot time, nor would we look at the for a variable. The box plot helps us to relationship between them. see the quantile ranges of the variable Univariate analysis techniques involve and whether any outliers are present in both statistical and graphical methods. the data. Some statistical methods for univariate 8 feature of the dataset. The main objective is to find out if there is a relationship between two different variables. Bivariate analysis is usually done by using graphical methods like scatter plots, line charts, and pair plots. These simple charts can give us a picture of the relationship between the variables. Bivariate analysis is also a good way to measure the correlations between the two variables. For example – in a market survey we may be looking to analyze the relationship between price and sales of a product to see if there is any The diagram below shows a histogram for a variable showing the frequency relationship. distribution versus the range. Let us take a look at a pair plot used for bivariate analysis. 3. Multivariate Analysis Multivariate analysis is a more complex One of the ways to do multivariate form of statistical analysis technique analysis is Bivariate analysis. It refers to and is used to analyze more than two the analysis of two or more variables in variables in the data set. There are the dataset. It is usually carried out several ways to do a multivariate between the target variable and another analysis, but it depends on your goals. Some of these methods include 9 Canonical Correlation Analysis, Cluster insights from the data. Different types of Analysis, contour plots, and Principal data might need different approaches to Component Analysis. cleaning. However, there is a systematic approach that works well on all kinds of data. Some of the steps of data cleaning are as mentioned below. 1. Remove duplicate observations - Duplicate observations most frequently arise during data collection especially when we combine datasets from multiple places or scrape data from online sources. It is important to remove duplicates otherwise they can adversely affect the models we build. 2. Remove irrelevant observations – Quite often we are presented with datasets that have lots of extra, irrelevant data that does not add any value to the problem we are trying to Activity 2.2 solve. In such cases, we should remove Take a look at the ‘iris’ dataset and try the columns or rows of data that are to plot the values of the petal length vs irrelevant so that the model has to only the sepal length. Do you find a positive learn from good relevant data. relationship between the two? 3. Remove unwanted outliers - Outliers can cause problems with certain types of models. For example, 4. Data Cleaning linear regression models are less robust Data cleaning is a very essential and to outliers than decision tree models. In often overlooked step in the pre- general, you should always look out for processing of data. It refers to the outliers in your data and remove them if process of identifying incorrect, they are unwanted so that it helps your incomplete, and inaccurate data. We model’s performance. However, some then clean the dataset by either outliers might be valid data so those removing the incorrect data or replacing values should be preserved. it with better data. 4. Fix data type issues – Data types are Data cleaning is a fundamental aspect of often overlooked aspects of data. Many data science. If you have a dataset that times, numerical or DateTime data has been cleaned well, even simple might be saved as text. If this is not algorithms can learn and give impressive corrected in the data cleaning step, it 10 will cause problems when we use it to this must be handled well during the build a model. Therefore, you should cleaning stage. The two common always correct data type issues. techniques to handle missing data is to either remove that row of data or to 5. Handle missing data – Missing data insert a value that is quite close to the is also a common issue with datasets. mean or mode of the variable that is Many machine learning algorithms do missing. For example, the height of not work well with missing data and so students is univariate data. Recap Exploratory Data Analysis is the process of carrying out an initial analysis of the available data. EDA involves graphical methods like frequency distribution tables, bar charts, histograms, frequency polygons, and pie charts. Univariate analysis is the simplest form of data analysis where we only analyze only one variable. Bivariate analysis refers to the analysis of two or more variables in the dataset. Bivariate analysis refers to the analysis of three or more variables. Data cleaning refers to the process of removing the incorrect data or replacing it with better data. Exercises Objective Type Questions Please choose the correct option in the questions below. 1. You need to check the relationship between the two variables. Which graph would you use? a. Histogram b. Pair plot c. Box plot d. None of the above 2. You need to check if a variable has outliers. Which graph would you use? 11 a. Histogram b. Pair plot c. Box plot d. None of the above 3. You need to perform a multivariate analysis. Which graph will you use? a. Contour plot b. Scatter plot c. Box plot d. None of the above 4. You need to perform a univariate analysis. Which graph will you use? a. Scatter plot b. Histogram c. Contour plot d. Both a and b 5. What is a data cleaning step? a. Removing duplicates b. Removing outliers c. All of the above Standard Questions Please answer the questions below in no less than 100 words. 1. What are some of the differences between univariate and multivariate analysis? Give some examples. 2. What are the ways to handle missing data? 3. What are some of the methods for univariate analysis? 4. What are the steps for cleaning raw data? Higher Order Thinking Skills(HOTS) Please answer the questions below in no less than 200 words. 1. What problems can outliers cause? 2. Why should irrelevant observations be removed from the data? 3. How can we use unsupervised learning for EDA? Applied Project Using the iris dataset provided in R Studio, perform a univariate analysis by creating scatter plots of sepal length, sepal width, petal length, and petal width. 12 CHAPTER Classification Algorithms I using an important algorithm called Decision Trees. Studying this chapter should 2. Introduction to enable you to understand: Decision Trees A Decision tree is a diagrammatic What is a Decision Tree? representation of the decision-making How are Decision Trees used process and has a tree-like structure. in Data Science? Each internal node in the decision tree How to create a Decision denotes a question on choosing a Tree? particular class. Every branch represents the outcome of the test, and each leaf node holds a class label. You often use Decision Trees in your 1. Introduction daily life without noticing them. In the last chapter, we learned about For example, when you go to a how to conduct an exploratory data supermarket to buy milk for your family, analysis using various graphical the question which comes to your mind techniques and how to clean the data is – How much bread should I buy that has been collected. In this chapter today? we will see how we can classify the data 13 To answer the question, you subconsciously make calculations and you purchase the required quantity of 3. Applications of bread. Activity 3.1 Is it a weekday? On weekdays we require 1 packet of bread. Think about an everyday activity where you need to make a decision by thinking Is it a weekend? On weekends we require about several possibilities. Can a 2 packets of bread Decision Tree make help you make an Are we expecting any guests today? We effective decision? need to buy extra bread for each guest. The diagram below shows a sample Decision Trees decision tree. Each leaf node contains a Decision Trees and other tree-based class label and we split the entire learning algorithms are considered to be population based on the test or criteria one of the best and most used at each decision node. supervised learning methods. They are important as they are easy to visualize, understand and have a high ease of interpretation. Sometimes the trend in the data is not linear, so we cannot apply linear classification techniques for these problems. The linear approaches we’ve seen so far will not produce accurate results. For such cases, we need to build our models differently. Decision trees are a good tool for classifying observations when the trend is non-linear. Thus, in the end, we can classify the Decision Trees are versatile as they can population based on the criteria we be used to any kind of problem at hand choose. - classification or regression. Also, unlike linear models that we have Decision Trees are considered to be one studied earlier, decision trees map both of the most efficient classification linear and non-linear relationships quite techniques. An even better way of using well. decision trees is to use the Random Forest algorithm which makes Decision tree outputs are very easy to predictions based on the outcomes of understand even for people from a non- several decision trees. analytical background. They do not 15 require any statistical knowledge to read 4. Creating a Decision and interpret them. Their graphical representation is very intuitive, and Activity 3.2 users can easily relate to their hypothesis. Try to find a real-world scenario where decision trees used for Another major advantage of decision classification. tree is that they can handle both numerical and categorical variables. Therefore, they require fewer data Tree cleaning steps compared to some other Do you know a real-world decision tree modeling techniques. They are also not that helped save lives? influenced much by outliers and missing In late 1970, Lee Goldman, a U.S. Navy values to a fair degree. cardiologist, developed a decision tree to Decision trees are used to solve both determine if a person was likely to have classification and regression problems. a heart attack. Lee spent years However, there are certain differences developing and testing a single model between them. Let us take a brief look at that would allow submarine doctors to the differences between them. quickly evaluate possible heart attack 1. Regression trees are used when symptoms and determine if the the dependent variable is continuous. submarine had to resurface and Classification trees are used when the evacuate the chest pain sufferer. dependent variable is categorical. This visual and simplified approach to 2. In case of a regression tree, the decision making was one of the first uses value of the terminal nodes after training of decision trees in real-world scenarios. is the mean of the observations. Thus, predictions on unseen data are made To create a decision tree, you can follow using the mean. the steps below. 3. In case of a classification tree, the 1. Think about your main objective for value or class of the terminal nodes after which you are creating the decision tree. training is the mode of the observations. The main decision that you are trying to Thus, predictions on unseen data are make should be placed at the very top of made using the mode. The same data when plotted in a column chart will look like the below. 16 the decision tree. Therefore, the main Recap objective should be the “root” of the entire diagram. A Decision tree is a flowchart like tree structure, where each internal 2. Next, you need to draw the branches node denotes a test on an attribute. and leaf nodes. For every possible An even better way of using decision decision, stemming from the root make trees is to use the Random Forest a branch. One root or node can have two algorithm which makes predictions or more branches. At the end of the based on the outcomes of several branches, attach the leaf nodes. The leaf decision trees. nodes should represent the results of Decision tree outputs are very easy each decision. If another decision has to to understand even for people from a be made, draw a square leaf node. If the non-analytical background. outcome is not quite certain, you should A major advantage of decision trees draw a circular node. is that they can handle both numerical and categorical variables. 3. Finally, you need to calculate the A visual and simplified approach to probability of success of each decision decision making was the one of the being made. While creating the decision first uses of decision trees in real tree, it is essential to do some research, world scenarios. so that you can predict the probability of each decision. To do this research, you may examine old data or assess previous projects. Once you calculate the expected value of each decision in a tree, put the values on the branches. The decision tree can be made in a linear form of decision rules where the outcome is the contents of the leaf node. Activity 3.3 Create your own decision tree for a daily activity based on the steps above. 17 Exercises Objective Type Questions Please choose the correct option in the questions below. 1. Which of the following are parts of a Decision Tree? a. Decision Node b. Leaf Node c. Branch d. All of the above 2. Which of the following statement is false? a. Decision Trees can contain only one branch. b. Decision Trees can be used for classification and regression. c. Random Forests algorithm uses Decision Trees. d. None of the above 3. Which of the following is a use case for Decision Trees? a. Classification b. Regression c. Both of the above 4. A decision tree can be further divided into further sub-trees. a. True b. False 5. Decision Trees are easier to understand and interpret. a. True b. False Standard Questions Please answer the questions below in no less than 100 words. 1. Write a short note on the application of classification algorithms. 2. In your own words, write down the steps to create a decision tree. 3. Write two advantages of using a decision tree. Higher Order Thinking Skills(HOTS) Please answer the questions below in no less than 200 words. 1. Write two disadvantages of using a decision tree. 2. Write a short note on the Random Forest algorithm. 17 Applied Project In this exercise, we will use R Studio to generate a Decision Tree model to classify the famous iris dataset. The iris dataset has 150 rows. Each row has the data of the iris plant under five attributes - sepal length, sepal width, petal length, petal width and species. There are three different kinds of iris in the dataset and each type has 50 rows of data. The three types are – setosa, versicolor and virginica. The iris dataset is already present in R Studio and does not need to be loaded. The objective of the exercise is to build a Decision Tree model which can correctly classify a new iris flower into one of the three groups - setosa, versicolor, and virginica. Launch R Studio and follow the steps below. 1. First, let us take a look at the data to see the attributes, rows, and columns. To do so, paste the code below in R Studio and click on Run. library(rpart) library(rpart.plot) v

Use Quizgecko on...
Browser
Browser