Data Sciences - Grade 10 AI PDF

Summary

This document provides an overview of data science concepts and applications, particularly in the context of artificial intelligence (AI). It discusses various domains of AI, including data sciences, computer vision, and natural language processing. Furthermore, it touches upon applications such as fraud and risk detection, genetics and genomics, and website recommendations.

Full Transcript

1 4. Data Sciences Grade 10 Artificial Intelligence 3 Each domain has its own type of data which gets fed into the machine and hence has its own way of working around it. Data Sciences is a concept to unify statistics, data analysis, machine learning and their related me...

1 4. Data Sciences Grade 10 Artificial Intelligence 3 Each domain has its own type of data which gets fed into the machine and hence has its own way of working around it. Data Sciences is a concept to unify statistics, data analysis, machine learning and their related methods in order to understand and analyse actual phenomena with data. It employs techniques and theories drawn from many fields within the context of Mathematics, Statistics, Computer Science, and Information Science. Data Sciences is a combination of Python and Mathematical concepts like Statistics, Data Analysis, probability, etc. Concepts of Data Science can be used in developing applications around AI as it gives a strong base for data analysis in Python. Applications of Data Sciences Data Science is not a new field. Data Sciences majorly work around analysing the data and when it comes to AI, the analysis helps in making the machine intelligent enough to perform tasks by itself. There exist various applications of Data Science in today’s world. 1)Fraud and Risk Detection 2) Genetics & Genomics 3)Internet search Petabytes - a unit of information equal to one thousand million million (1015) or, strictly, 250 bytes. 4) Targeted Advertising 5) Website recommendations 6)Airline route planning Predict flight delay Decide which class of airplanes to buy Whether to directly land at the destination or take a halt in between (For example, A flight can have a direct route from New Delhi to New York. Alternatively, it can also choose to halt in any country.) Effectively drive customer loyalty programs FA Mind map the applications of Data Science Practice Task Rock, Paper & Scissors: https://www.essentially.net/rsp/play.jsp What was the strategy you used to play? DifferenceS –while playing the same game with a human? Revisiting AI Project Cycle Scenario Humans are social animals. We tend to organise and/or participate in various kinds of social gatherings all the time. We love eating out with friends and family because of which we can find restaurants almost everywhere and out of these, many of the restaurants arrange for buffets to offer a variety of food items to their customers. Be it small shops or big outlets, every restaurant prepares food in bulk as they expect a good crowd to come and enjoy their food. But in most cases, after the day ends, a lot of food is left which becomes unusable for the restaurant as they do not wish to serve stale food to their customers the next day. So, every day, they prepare food in large quantities keeping in mind the probable number of customers walking into their outlet. But if the expectations are not met, a good amount of food gets wasted which eventually becomes a loss for the restaurant as they either have to dump it or give it to hungry people for free. And if this daily loss is taken into account for a year, it becomes quite a big amount. Problem Scoping Now that we have understood the scenario well, let us take a deeper look into the problem to find out more about various factors around it. Let us fill up the 4Ws problem canvas to find out. Problem statement Reference Slide 30,31,32 System Map In this system map, you can see how the relationship of each element is defined with the goal of our project. Recall that the positive arrows determine a direct relationship of elements while the negative ones show an inverse relationship of elements. After looking at the factors affecting our problem statement, it’s time to look at the data to be acquired for the goal. For this problem, a dataset covering all the elements mentioned above is made for each dish prepared by the restaurant over a period of 30 days. This data is collected offline in the form of a regular survey since this is a personalized dataset created just for one restaurant’s needs. Specifically, the data collected comes under the following categories: Name of the dish, Price of the dish, Quantity of dish produced per day, Quantity of dish left unconsumed per day, Total number of customers per day, Fixed customers per day, etc. Now let us understand how these factors are related to our problem statement. For this, we can use the System Maps tool to figure out the relationship of elements with the project’s goal. Here is the System map for our problem statement. Reference: Slide 44,45 Data Exploration Modelling Once the dataset is ready, we train our model on it. In this case, a regression model is chosen in which the dataset is fed as a dataframe (a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. ) and is trained accordingly. Regression is a Supervised Learning model which takes in continuous values of data over a period of time. Since in our case the data we have is a continuous data of 30 days, we can use the regression model so that it predicts the next values to it in a similar manner. In this case, the dataset of 30 days is divided in a ratio of 2:1 for training and testing respectively. In this case, the model is first trained on the 20-day data and then gets evaluated for the rest of the 10 days. Evaluation Once the model has been trained on the training dataset of 20 days, it is now time to see if the model is working properly or not. Let us see how the model works and how is it tested. Step 1: The trained model is fed data regards the name of the dish and the quantity produced for the same. Step 2: It is then fed data regards the quantity of food left unconsumed for the same dish on previous occasions. Step 3: The model then works upon the entries according to the training it got at the modelling stage. Step 4: The Model predicts the quantity of food to be prepared for the next day. Step 5: The prediction is compared to the testing dataset value. From the testing dataset, ideally, we can say that the quantity of food to be produced for next day’s consumption should be the total quantity minus the unconsumed quantity. Step 6: The model is tested for 10 testing datasets kept aside while training. Step 7: Prediction values of testing dataset is compared to the actual values. Step 8: If the prediction value is same or almost similar to the actual values, the model is said to be accurate. Otherwise, either the model selection is changed or the model is trained on more data for better accuracy. Once the model is able to achieve optimum efficiency, it is ready to be deployed in the restaurant for real-time usage. Data Collection Data collection is nothing new which has come up in our lives. It has been in our society since ages. Even when people did not have fair knowledge of calculations, records were still maintained in some way or the other to keep an account of relevant things. Data collection is an exercise which does not require even a tiny bit of technological knowledge. But when it comes to analysing the data, it becomes a tedious process for humans as it is all about numbers and alpha-numerical data. That is where Data Science comes into the picture. It not only gives us a clearer idea around the dataset, but also adds value to it by providing deeper and clearer analyses around it. And as AI gets incorporated in the process, predictions and suggestions by the machine become possible on the same. Now that we have gone through an example of a Data Science based project, we have a bit of clarity regarding the type of data that can be used to develop a Data Science related project. For the data domain-based projects, majorly the type of data used is in numerical or alpha-numerical format and such datasets are curated in the form of tables. Such databases are very commonly found in any institution for record maintenance and other purposes. Some examples of datasets which you must already be aware of are: While accessing data from any of the data sources, following points should be kept in mind: Data which is available for public usage only should be taken up. Personal datasets should only be used with the consent of the owner. One should never breach someone’s privacy to collect data. Data should only be taken form reliable sources as the data collected from random sources can be wrong or unusable. Reliable sources of data ensure the authenticity of data which helps in proper training of the AI model. Types of Data For Data Science, usually the data is collected in the form of tables. These tabular datasets can be stored in different formats. Some of the commonly used formats are: CSV: CSV stands for comma separated values. It is a simple file format used to store tabular data. Each line of this file is a data record and reach record consists of one or more fields which are separated by commas. Since the values of records are separated by a comma, hence they are known as CSV files. Spreadsheet: A Spreadsheet is a piece of paper or a computer program which is used for accounting and recording data using rows and columns into which information can be entered. Microsoft excel is a program which helps in creating spreadsheets. SQL: SQL is a programming language also known as Structured Query Language. It is a domain- specific language used in programming and is designed for managing data held in different kinds of DBMS (Database Management System) It is particularly useful in handling structured data. A lot of other formats of databases also exist, you can explore them online! Types of Data CSV files Spreadsheet SQL There are Python packages help us in accessing structured data (in tabular form) inside the code. Some of the important packages : NumPy, which stands for Numerical Python, is the fundamental package for in Python for Mathematical and logical operations on numbers. Matplotlib is a visualization library in Python for 2D plots of arrays. Pandas is for data manipulation and analysis in Python. DATA SCIENCE- PRACTICAL (Not included for theory exam) Data Access After collecting the data, to be able to use it for programming purposes, we should know how to access the same in a Python code. To make our lives easier, there exist various Python packages which help us in accessing structured data (in tabular form) inside the code. Let us take a look at some of these packages: NumPy NumPy, which stands for Numerical Python, is the fundamental package for Mathematical and logical operations on arrays in Python. It is a commonly used package when it comes to working around numbers. NumPy gives a wide range of arithmetic operations around numbers giving us an easier approach in working with them. NumPy also works with arrays, which is nothing but a homogenous collection of Data. An array is nothing but a set of multiple values which are of same datatype. They can be numbers, characters, booleans, etc. but only one datatype can be accessed through an array In NumPy, the arrays used are known as ND-arrays (N-Dimensional Arrays) as NumPy comes with a feature of creating n-dimensional arrays in Python. An array can easily be compared to a list. Let us take a look at how they are different: Data Access After collecting the data, to be able to use it for programming purposes, we should know how to access the same in a Python code. To make our lives easier, there exist various Python packages which help us in accessing structured data (in tabular form) inside the code. Let us take a look at some of these packages: NumPy NumPy, which stands for Numerical Python, is the fundamental package for Mathematical and logical operations on arrays in Python. It is a commonly used package when it comes to working around numbers. NumPy gives a wide range of arithmetic operations around numbers giving us an easier approach in working with them. NumPy also works with arrays, which is nothing but a homogenous collection of Data. An array is nothing but a set of multiple values which are of same datatype. They can be numbers, characters, booleans, etc. but only one datatype can be accessed through an array. In NumPy, the arrays used are known as ND-arrays (N-Dimensional Arrays) as NumPy comes with a feature of creating n-dimensional arrays in Python. An array can easily be compared to a list. Let us take a look at how they are different: Pandas Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The name is derived from the term "panel data", a term for data sets that include observations over multiple time periods for the same individuals. Pandas is well suited for many different kinds of data: Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet Ordered and unordered (not necessarily fixed-frequency) time series data. Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels Any other form of observational / statistical data sets. The data actually need not be labelled at all to be placed into a Pandas data structure Pandas The two primary data structures of Pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries. Here are just a few of the things that pandas does well: Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations Intelligent label-based slicing, fancy indexing, and subsetting of large data sets Intuitive merging and joining data sets Flexible reshaping and pivoting of data sets Matplotlib Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multi- platform data visualization library built on NumPy arrays. One of the greatest benefits of visualization is that it allows us visual access to huge amounts of data in easily digestible visuals. Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to make correlations. They’re typically instruments for reasoning about quantitative information. Some types of graphs that we can make with this package are listed below: Basic Statistics with Python Percentile The percentile of data value from a set of data values is a statistical measure that gives the percentage of data values that fall below a given data value. For example, in a group of 20 children, Ben is the 4th tallest and 80% of the children are shorter than Ben. Hence, it means that Ben is at the 80th percentile. Advantage of using Python packages is that we do not need to make our own formula or equation to find out the results. There exist a lot of pre-defined functions with packages like NumPy which reduces this trouble for us. All we need to do is write that function and pass on the data to it. It’s that simple! Data Visualisation While collecting data, it is possible that the data might come with some errors. Let us first take a look at the types of issues we can face with data: 1. Erroneous Data: There are two ways in which the data can be erroneous: Incorrect values: The values in the dataset (at random places) are incorrect. For example, in the column of phone number, there is a decimal value or in the marks column, there is a name mentioned, etc. These are incorrect values that do not resemble the kind of data expected in that position. Invalid or Null values: At some places, the values get corrupted and hence they become invalid. Many times you will find NaN values in the dataset. These are null values which do not hold any meaning and are not processible. That is why, these values (as and when encountered) are removed from the database. 2. Missing Data: In some datasets, some cells remain empty. The values of these cells are missing and hence the cells remain empty. Missing data cannot be interpreted as an error as the values here are not erroneous or might not be missing because of any error. 3. Outliers: Data which does not fall in the range of a certain element are referred to as outliers. To understand this better, let us take an example of marks of students in a class. Let us assume that a student was absent for exams and hence has got 0 marks in it. If his marks are taken into account, the whole class’s average would go down. To prevent this, the average is taken for the range of marks from highest to lowest keeping this particular result separate. This makes sure that the average marks of the class are true according to the data. Analyzing the data collected can be difficult as it is all about tables and numbers. While machines work efficiently on numbers, humans need a visual aid to understand and comprehend the information passed. Hence, data visualization is used to interpret the data collected and identify patterns and trends out of it. In Python, Matplotlib package helps in visualising the data and making some sense out of it. As we have already discussed before, with the help of this package, we can plot various kinds of graphs. Let see a few: Here we can see that the box contains a plot and 2 lines to the left and right are called whiskers. There are 5 parts to this plot Box Plot Quartile 1: From 0 percentile to 25th percentile – Here data lying between 0 and 25th percentile is plotted. Now, if the data is close to each other, let’s say 0 to 25th percentile data has been covered in just 20-30 marks range, then the whisker would be smaller as the range is smaller. But if the range is large that is 0-30 marks range, then the whisker would also get elongated as the range is longer. Quartile 2: From 25th Percentile to 50th percentile – 50th percentile is termed as the mean of the whole distribution and since the data falling in the range of 25th percentile to 75th percentile has minimum deviation from the mean, it is plotted inside the box. Quartile 3: From 50th percentile to 75th percentile – This range is again plotted in the box as its deviation from the mean is less. Quartile 2 & 3 (from 25th percentile to 75th percentile) together constitute the Inter Quartile Range (IQR). Also, depending upon the range of distribution, just like whiskers, the length of box also varies if the data is less spread or more. Quartile 4: From 75th percentile to 100th percentile – It is the whiskers plot for top 25 percentile data. Outliers: The advantage of box plots is that they clearly show the outliers in a data distribution. Points that do not lie in the range are plotted outside the graph as dots or circles and are termed outliers as they do not belong to the range of data. Since being out of range is not an error, that is why they are still plotted on the graph for visualization. SAMPLE PRACTICE QUESTIONS Rohan went shopping for the various essential for his house. To help him maintain his essential better and the cost he incurred at those items, create a numpy array with the cost incurred on the items. The list with the incurred prices are as follows: *price = [100, 450, 33, 280, 135, 157, 680]* Perform the following task on the list mentioned above: 1) Convert the list into a numpy array and print the same 2) Sort the array into ascending order 3) multiply each element by 2 4) Create a new array where the price of objects is decreased by 10% which are at odd positions and display it import numpy as np price = [100,450,33,280,135,157,680] p = np.array(price) #task 2 print(np.sort(p)) #task 3 p1 = p * 2 print(p1) #task 4 p2 = np.multiply(p[p % 2 == 1], 0.9) print(p2) Pandas When executing the above code, a dialogue box opens up asking for permission to access the drive. Select ”Allow” After the Mounted at/content/drive message On the Left hand side select the folder seen at the leftmost Then you will see drive folder Click on the drive folder and select MyDrive folder Select the folder where the data file is stored Select the file and click on the 3 dots that appear to its right side and select copy path Paste the copied path inside the round brackets within single or double quotes The output will be seen as follows:

Use Quizgecko on...
Browser
Browser