PPT CSD102 Data Science Session 1 and 2_Shefali Naik.pdf
Document Details
Uploaded by PoeticSweetPea
Ahmedabad University
Tags
Full Transcript
CSD102 Data Science Session-1 and 2 Introduction to Key Terminology in Data Science Course Outline (Computer Science – 14 sessions+2 sessions of project, Statistics – 8 sessions+2 sessions of project, Total 26 sessions for teaching 4 sessions for exam, review, reflection) 1. Introduction to key...
CSD102 Data Science Session-1 and 2 Introduction to Key Terminology in Data Science Course Outline (Computer Science – 14 sessions+2 sessions of project, Statistics – 8 sessions+2 sessions of project, Total 26 sessions for teaching 4 sessions for exam, review, reflection) 1. Introduction to key terminology in data science (2 Sessions-Shefali N) 2. Introduction to Geographical Information Systems (4 Sessions-Shefali N) 3. Types of data, scales of measurement, and methods for collection (2) 5. Data cleaning using MS Excel (1) 5. Simple descriptive statistics (3) 6. Data visualisation using MS Excel and other software (2) 7. Introduction to Computer Programming (8 Sessions-Shefali N) 8. Infographics project (2 Sessions-Shefali N, 2 sessions-faculty of statistics) What is Data ? Data is collection of numbers, words, (Textual Data) images, sound, videos (Multimedia Data) It is collected through Surveys, Interviews, Extracting from web, etc. Data - Storage Perspective Data is stored physically by writing on papers, printed on papers and other materials (sign boards) etc and digitally on memory devices. To quickly apply processing on data, it needs to be “DIGITIZED”. Digitization is a process of transferring data to electronic/magnetic devices for storage and processing by Computations. Data is stored in Computers in form of files. Many files are arranged/categorized into folders. Data Formats The common format to store and make data available for further processing are as follows: 1. Textual Data -.txt,.doc,.docx,.pdf,.csv etc. 2. Alpha-Numeric Data (Tabular Data) –.xls,.xlsx (spreadsheets formats),.dbf,.mdb (sql databases) etc. 3. Images -.jpeg,.gif,.bmp,.cdr,.png,.tiff, 4. Sound -.mp3,.wav,.wma. 5. Video -.mp4 (MPEG-4), WMV. 6. Data Transformation format -.csv,.json,.xml Data Storage Where the data resides Cloud or Computing Clusters or Individual Computer/Server Storage System DBMS: eg MySQL, Oracle, MS Server,...(Structured Data) NoSQL : eg MongoDB, Cassandra, Couchbase, Hbase, Hive etc. (Unstructured Data) Text Indexing: eg Solr, ElasticSearch Structured Query Language (SQL) Data can be (will be covered later in details) 1. Qualitative 2. Quantitative 1. Qualitative : Here the information is grouped by Category, Hence also known as Categorical. eg. I am born in India. eg. He is a fast runner. eg. He has black hair. Data can be (will be covered later in details) 2. Quantitative : a. Discrete data – Certain values) eg. Jatin has 2 cars. eg. There are 13 players on the field. eg. The farmer has 200 cows. b. Continuous Data – Value within a Range) eg. His weight is 37 kg eg. He is sick. His body temperature is 101 F. Relationship between the four Components Pyramid showing transformation of Data to Information to Knowledge to Wisdom. Wisdom Knowledge (Predictability) Information (Patterns) Data (Unfiltered) Data in different domains Data is categorized from many domains, to name a few Data related to Business (Sales, Finance, Marketing, etc.) Geographical (Climate, Meteorological, Location, etc.) Transport (Sensors, Cameras, Navigational systems, etc.) Scientific (Biological, Astrophysics, Medical, etc.) Statistical , etc Where the Data is ? Sources of Data Data is: Big 2.5 quintillion (1018) bytes of data are generated every day Everything around us collects/generates data Social media sites Business transactions Location-based data Sensors data Digital photos, videos Consumer behavior (online and store transactions) Cloud based & mobile applications are widespread Benefits of having Data Recommendations (based on learned preferences, recommendation engines can refer you to movies, restaurants and books you like) Classifications (eg in email server classifying emails as “important”, “Social”, “Promotions” or “junk”) Pattern detection (weather patterns, financial market patterns) Forecasting (sales and customer retention) Recognition (facial, voice, text) Anomaly detection (fraud, disease, crime) Data and current scenario Traditionally, the data that we had was mostly structured and small in size, which could be analyzed by using the simple Business Intelligence tools. But, today most of the data is unstructured or semi-structured. Image shows that by 2025, more than 80 % of the data will be unstructured. Data Science It is the field of study that combines domain expertise, programming skills, and knowledge of statistics/maths to extract meaningful insights from data. It incorporates techniques like machine learning, cluster analysis, data mining and visualization. Data Science Data Science is: multidisciplinary Statisticians Mathematicians Computer Scientists in ○ Data mining ○ Artificial Intelligence & Machine Learning ○ Systems Development and Integration ○ Database development ○ Analytics Domain Experts ○ Medical experts ○ Geneticists ○ Finance, Business, Economy experts ○ etc. Data Science in various Domains – Recommendation How about if you could understand the precise requirements of your customers from the existing data like the customer’s past browsing history, purchase history, age and income. No doubt you had all this data earlier too, but now with the vast amount and variety of data, you can train models more effectively and recommend the product to your customers with more precision. Wouldn’t it be amazing as it will bring more business to your organization? Data Science in various Domains – Decision Making Let’s take a different scenario to understand the role of Data Science in decision making. How about if your car had the intelligence to drive you home? The self-driving cars collect live data from sensors, including radars, cameras and lasers to create a map of its surroundings. Based on this data, it takes decisions like when to speed up, when to speed down, when to overtake, where to take a turn – making use of advanced machine learning algorithms. Data Science in various Domains – Predictive Analysis Let’s see how Data Science can be used in predictive analytics. Let’s take weather forecasting as an example. Data from ships, aircrafts, radars, satellites can be collected and analyzed to build models. These models will not only forecast the weather but also help in predicting the occurrence of any natural calamities. It will help you to take appropriate measures beforehand and save many precious lives. Data Analysis Overview Data Analysis is a process of inspecting, cleaning, transforming and modeling data with the goal of discovering useful information, suggesting conclusions and supporting decision-making Data Analysis – Life cycle Data Analysis – Life cycle Data Requirement Specifications – The required data for input is identified by Analysis Process. (eg. If weather forecasting is aimed then vegetation coverage and humidity can be considered as required variables), domain knowledge is required. Data Collection - Process of gathering information on targeted variables identified as data requirements as in first step. Data Processing - Structuring the data as required for the relevant Analysis tools. (Eg. the data might have to be stored in table for Statistical Application). Data Analysis – Life cycle Data Cleaning : Data are often incomplete, incorrect like ○ Typographical Errors : e.g., text data in numeric fields ○ Missing Values : some fields may not be collected for some of the examples ○ Impossible Data combinations: eg. Programme = BA, Major = Mechanical Engineering ○ Out-of-Range Values: eg., age=1000 Data Analysis Statistical Data Models such as Correlation, Regression Analysis can be used to identify the relations among the data variables. Communication The results are presented through visualization techniques, such as tables and charts, which help in communicating the message clearly and efficiently to the users. Categories of Data Data can also be categorized as Structured Unstructured Semi-structured Categories of Data Structured – Data storage format and processing is fixed. (eg Data stored in DBMS (Database Management System)) Enrollment No. Name BirthDate Stream 145633 Shilpan Desai 12/3/1998 Science 321123 Manish Gor 05/06/1997 General Categories of Data Unstructured - Format of the Data is not particular. Processing unstructured data is a big challenge as format is not particular. Eg. Combination of text files, images, videos , data in books, journals, email messages, web-pages. Categories of Data Semi-structured – Is mixture of Structured and Unstructured. Data though look like structured but it is not represented/stored/processed in form of tabular structures. Eg. XML formatted Data, JSON data, CSV data. XML – eXtended Markup Language JSON – Java Script Object Notation XML Format 145633 Shilpan Desai Science myimages/sd.jpg Relationship between Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning(DL). Relationship between Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning(DL). AI, ML and DL are interrelated Artificial Intelligence :- Algorithms that exhibit human like intelligence Machine Learning:- It is subset of AI that can learn to perform a task with extracted data and models (algorithms). Deep Learning :- It is subset of ML which contains algorithms that imitate learning like human brain to solve the task. Hence all the 3 are nothing but algorithms which by looking at enormous data helps to solve corresponding task just as we human do. Humans learn through multiple experiences to perform a task. Two common types of Learning Algorithms are: 1. Supervised Learning 2. Unsupervised Learning Types of Learning Algorithms Supervised Learning Types of Learning Algorithms Supervised Learning It requires labeled training data. Labeled data means there is already an input which is attached with output also. Apple Apple Apple Apple Types of Learning Algorithms Supervised Learning The algorithm is trained to look at colour, shape, edges at some part, colour patterns. Hence colour, shape, edges, no. of sides, angles, colour patterns are known features in terms of learning algorithms. Ex., Classification algorithms Types of Learning Algorithms Unsupervised Learning Types of Learning Algorithms Unsupervised Learning No labeled training data. It tries itself to find the hidden patterns and insights from the given data. (no teacher/no supervisor) It is near to learning which takes place in the human brain while learning new things. Useful for categorizing things. Ex., Clustering algorithms N E T F L I X C A S E –S T U D Y Netflix Data Science Case Study to Improve its Recommendation System Do you remember the last movie you watched on Netflix ? After watching the movie, were you recommended of similar movies? How does Netflix know what you’d like? The secret here is Data Science. Netflix uses Data Science to cater relevant and interesting recommendations to us. Netflix Case Study Netflix Case Study What is a Recommendation System? A recommendation system is a platform that provides its users with various contents based on their preferences and likings. A recommendation system takes the information about the user as an input. This information can be in the form of the past usage of product or the ratings that were provided to the product. It then processes this information to predict how much the user would rate or prefer the product. Netflix Case Study What is a Recommendation System? A recommendation system makes use of a variety of machine learning algorithms. Recommendation system searches for movies that are similar to the ones you have watched or have liked previously. Based on the movies that are watched, Netflix provides recommendations of the films that share a degree of similarity. Netflix Case Study There are two main types of Recommendation Systems: 1. Content-based recommendation systems In a content-based recommendation system, the knowledge of the products and customer information are taken into consideration. for ex., if you have watched a film that has a science-fiction genre, the content-based recommendation system will provide you with suggestions for similar films that have the same genre. Netflix Case Study 2. Collaborative filtering recommendation systems Collaborative Filtering provides recommendations based on the similar profiles of its users. For example, if a person A watches crime, science-fiction and thriller genres and B watches science-fiction, thriller and action genres then A will also like action and B will like crime genre. Person – A watches Person-B watches crime action Science-fiction Science-fiction thriller thriller Netflix Case Study REFERENCES Machine Learning using Python (2019), M Pradhan, UD Kumar, Ch. 1 Introduction to Machine Learning , Wiley Publications. Python Data Science Handbook (2016), Jake Vander Plas, O’Reilly Publications. ISBN: 9781491912058 Web Access (Aug 2020), https://www.edureka.co/blog/what-is-data- science/https://machinelearning-blog.com Web Access (Aug 2020), https://intellipaat.com/blog/what-is-data-science Web Access (Aug 2020), http://www.saedsayad.com/data_mining_map.htm