Data Science Theory .pdf
Document Details
Uploaded by Deleted User
Tags
Full Transcript
UNIT 4 DATA SCIENCE SYLLABUS LEARNING OBJECTIVES 1.Define the concept of Data Science and understand its applications in various fields. 2.Understand the basic concepts of data acquisition, visualization, and exploration. Data Science Dat...
UNIT 4 DATA SCIENCE SYLLABUS LEARNING OBJECTIVES 1.Define the concept of Data Science and understand its applications in various fields. 2.Understand the basic concepts of data acquisition, visualization, and exploration. Data Science Data Sciences is a concept to unify statistics, data analysis, machine learning and their related methods in order to understand and analyse actual phenomena with data. It employs techniques and theories drawn from many fields within the context of Mathematics, Statistics, Computer Science, and Information Science. Applications of Data Sciences Data Sciences majorly work around analysing the data and when it comes to AI, the analysis helps in making the machine intelligent enough to perform tasks by itself. There exist various applications of Data Science in today’s world. Some of them are: Fraud and Risk Detection: The earliest applications of data science were in Finance. Companies were fed up of bad debts and losses every year. However, they had a lot of data which use to get collected during the initial paperwork while sanctioning loans. They decided to bring in data scientists in order to rescue them from losses. Over the years, banking companies learned to divide and conquer data via customer profiling, past expenditures, and other essential variables to analyse the probabilities of risk and default. Moreover, it also helped them to push their banking products based on customer’s purchasing power. Applications of Data Sciences. Genetics & Genomics*: Data Science applications also enable an advanced level of treatment personalization through research in genetics and genomics. The goal is to understand the impact of the DNA on our health and find individual biological connections between genetics, diseases, and drug response. Data science techniques allow integration of different kinds of data with genomic data in disease research, which provides a deeper understanding of genetic issues in reactions to particular drugs and diseases. As soon as we acquire reliable personal genome data, we will achieve a deeper understanding of the human DNA. The advanced genetic risk prediction will be a major step towards more individual care. Applications of Data Sciences. Internet Search*: When we talk about search engines, we think ‘Google’. Right? But there are many other search engines like Yahoo, Bing, Ask, AOL, and so on. All these search engines (including Google) make use of data science algorithms to deliver the best result for our searched query in the fraction of a second. Considering the fact that Google processes more than 20 petabytes of data every day, had there been no data science, Google wouldn’t have been the ‘Google’ we know today. Applications of Data Sciences. Targeted Advertising*: If you thought Search would have been the biggest of all data science applications, here is a challenger – the entire digital marketing spectrum. Starting from the display banners on various websites to the digital billboards at the airports – almost all of them are decided by using data science algorithms. This is the reason why digital ads have been able to get a much higher CTR (Call-Through Rate) than traditional advertisements. They can be targeted based on a user’s past behaviour. Applications of Data Sciences. Website Recommendations:* Aren’t we all used to the suggestions about similar products on Amazon? They not only help us find relevant products from billions of products available with them but also add a lot to the user experience. A lot of companies have fervidly used this engine to promote their products in accordance with the user’s interest and relevance of information. Internet giants like Amazon, Twitter, Google Play, Netflix, LinkedIn, IMDB and many more use this system to improve the user experience. The recommendations are made based on previous search results for a user. Applications of Data Sciences. Airline Route Planning*: The Airline Industry across the world is known to bear heavy losses. Except for a few airline service providers, companies are struggling to maintain their occupancy ratio and operating profits. With high rise in air-fuel prices and the need to offer heavy discounts to customers, the situation has got worse. It wasn’t long before airline companies started using Data Science to identify the strategic areas of improvements. Now, while using Data Science, the airline companies can: Predict flight delay Decide which class of airplanes to buy Whether to directly land at the destination or take a halt in between (For example, A flight can have a direct route from New Delhi to New York. Alternatively, it can also choose to halt in any country.) Effectively drive customer loyalty programs Revisiting AI Project Cycle(project on data science) Humans are social animals. We tend to organise and/or. participate in various kinds of social gatherings all the time. We love eating out with friends and family because of which we can find restaurants almost everywhere and out of these, many of the restaurants arrange for buffets to offer a variety of food items to their customers. Be it small shops or big outlets, every restaurant prepares food in bulk as they expect a good crowd to come and enjoy their food. But in most cases, after the day ends, a lot of food is left which becomes unusable for the restaurant as they do not wish to serve stale food to their customers the next day. So, every day, they prepare food in large quantities keeping in mind the probable number of customers walking into their outlet. But if the expectations are not met, a good amount of food gets wasted which eventually becomes a loss for the restaurant as they either have to dump it or give it to hungry people for free. And if this daily loss is taken into account for a year, it becomes quite a big amount Revisiting AI Project Cycle:Problem Scoping Now that we have understood the scenario well, let us take a deeper look into the problem to find out more about various factors around it. Let us fill up the 4Ws problem canvas to find out. Who Canvas – Who is having the problem? Revisiting AI Project Cycle: Problem Scoping Data Acquisition In our scenario, various factors that would affect the quantity of food to be prepared for the next day consumption in buffets would be: Data Acquisition For this, we can use the System Maps tool to figure out the relationship of elements with the project’s goal. Here is the System map for our problem statement. Total Number of Customers Quantity of dish prepared per day Dish consumption Unconsumed dish quantity per day Price of dish Quantity of dish for the Click /copy paste the link below https://ncase.me/loopy/v1.1/?data=[[[3,335,172,0.5,%22dishquantity%2520prepared%2520per%2520day%2520%22,2],[4,545,171,0.5,%22Total%2520Customers%25 20%22,4],[5,723,543,0.5,%22unconsumed%2520dish%252Fday%22,2],[6,503,380,0.5,%22dish%2520consumption%2520%22,5],[7,216,392,0.5,%22price%2520of%252 0consumptionn%22,5],[9,733,272,0.5,%22Quantity%2520of%2520dish%2520per%2520day%2520%22,2]],[[7,3,11,-1,0],[7,6,-16,-1,0],[6,5,31,-1,0],[5,9,-20,-1,0],[6,9,- 17,1,0],[4,6,12,1,0],[6,3,-9,1,0],[3,9,-73,1,0]],[],9%5D Data Exploration After creating the database, we now need to look at the data collected and understand what is required out of it. In this case, since the goal of our project is to be able to predict the quantity of food to be prepared for the next day, we need to have the following data: Thus, we extract the required information from the curated dataset and clean it up in such a way that there exist no errors or missing elements in it Data Modelling Once the dataset is ready, we train our model on it. In this case, a regression model is chosen in which the dataset is fed as a dataframe and is trained accordingly. Regression is a Supervised Learning model which takes in continuous values of data over a period of time. Since in our case the data which we have is a continuous data of 30 days, we can use the regression model so that it predicts the next values to it in a similar manilr. In this case, the dataset of 30 days is divided in a ratio of 2:1 for training and testing respectively. In this case, the model is first trained on the 20-day data and then gets evaluated for the rest of the 10 days. Data Evaluation Once the model has been trained on the training dataset of 20 days, it is now time to see if the model is working properly or not. Let us see how the model works and how is it tested. Step 1: The trained model is fed data regards the name of the dish and the quantity produced for the same. Step 2: It is then fed data regards the quantity of food left unconsumed for the same dish on previous occasions. Step 3: The model then works upon the entries according to the training it got at the modelling stage. Step 4: The Model predicts the quantity of food to be prepared for the next day. Step 5: The prediction is compared to the testing dataset value. From the testing dataset, ideally, we can say that the quantity of food to be produced for next day’s consumption should be the total quantity minus the unconsumed quantity. Step 6: The model is tested for 10 testing datasets kept aside while training. Step 7: Prediction values of testing dataset is compared to the actual values. Step 8: If the prediction value is same or almost similar to the actual values, the model is said to be accurate. Otherwise, either the model selection is changed, or the model is trained on more data for better accuracy. Once the model is able to achieve optimum efficiency, it is ready to be deployed in the restaurant for real- time usage Data Collection Data Science not only gives us a clearer idea around the dataset, but also adds value to it by providing deeper and clearer analyses around it. AI gets incorporated in the process, predictions and suggestions by the machine become possible on the same. Now that we have gone through an example of a Data Science based project, we have a bit of clarity regarding the type of data that can be used to develop a Data Science related project. For the data domain-based projects, majorly the type of data used is in numerical or alpha-numerical format and such datasets are curated in the form of tables. Such databases are very commonly found in any institution for record maintenance and other purposes. Some examples of datasets which you must already be aware of are: Databases of loans issued, account holder, locker owners, employee registrations, bank visitors, etc. Usage details per day, cash denominations transaction details, visitor details, etc. Movie details, tickets sold offline, tickets sold online, refreshment purchases, etc. Sources of Data There exist various sources of data from where we can collect any type of data required and the data collection process can be categorised in two ways: Offline and Online. While accessing data from any of the data sources, following points should be kept in mind: 1. Data which is available for public usage only should be taken up. 2. Personal datasets should only be used with the consent of the owner. 3. One should never breach someone’s privacy to collect data. 4. Data should only be taken form reliable sources as the data collected from random sources can be wrong or unusable. 5. Reliable sources of data ensure the authenticity of data which helps in proper training of the AI model. Types of Data For Data Science, usually the data is collected in the form of tables. These tabular datasets can be stored in different formats. Some of the commonly used formats are: 1. CSV: CSV stands for comma separated values. It is a simple file format used to store tabular data. Each line of this file is a data record, and each record consists of one or more fields which are separated by commas. Since the values of records are separated by a comma, hence they are known as CSV files. 2. Spreadsheet: A Spreadsheet is a piece of paper or a computer program which is used for accounting and recording data using rows and columns into which information can be entered. Microsoft excel is a program which helps in creating spreadsheets. 3. SQL: SQL is a programming language also known as Structured Query Language. It is a domainspecific language used in programming and is designed for managing data held in different kinds of DBMS (Database Management System) It is particularly useful in handling structured data. Data Access After collecting the data, to be able to use it for programming purposes, we should know how to access the same in a Python code. To make our lives easier, there exist various Python packages which help us in accessing structured data (in tabular form) inside the code. Let us take a look at some of these packages: NumPy NumPy, which stands for Numerical Python, is the fundamental package for Mathematical and logical operations on arrays in Python. It is a commonly used package when it comes to working around numbers. NumPy gives a wide range of arithmetic operations around numbers giving us an easier approach in working with them. NumPy also works with arrays, which is nothing but a homogenous collection of Data Data Access Pandas Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. Pandas is well suited for many different kinds of data: Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet Ordered and unordered (not necessarily fixed-frequency) time series data. Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels Any other form of observational / statistical data sets. The data actually need not be labelled at all to be placed into a Pandas data structure Data Access The two primary data structures of Pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries. Here are just a few of the things that pandas does well: Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations Intelligent label-based slicing, fancy indexing, and subsetting of large data sets Intuitive merging and joining data sets Flexible reshaping and pivoting of data sets Data Access Matplotlib* Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multiplatform data visualization library built on NumPy arrays. One of the greatest benefits of visualization is that it allows us visual access to huge amounts of data in easily digestible visuals. Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to make correlations. They’re typically instruments for reasoning about quantitative information. Some types of graphs that we can make with this package are listed below: Not just plotting, but you can also modify your plots the way you wish. You can stylise them and make them more descriptive and communicable. These packages help us in accessing the datasets we have and also in exploring them to develop a better understanding of them.