DA106 Week 1 Material PDF
Document Details
Uploaded by SupremeZeal9798
Shkolla Ajet Xhindole
Tags
Summary
This document provides an introduction to data science and its relationship with machine learning. It details the process of data collection, generation and categorization, and demonstrates various applications of data science, including examples like fraud detection. This document can be useful material for university students studying data science.
Full Transcript
Learning Objectives What is Data Science? o Introduction to data science Data science is the field of study that combines o Domain experti...
Learning Objectives What is Data Science? o Introduction to data science Data science is the field of study that combines o Domain expertise o Relationship between data science and o Programming skills artificial intelligence o Knowledge of math and statistics o Comprehend the process of data to extract meaningful insights from data collection and generation o Learn about various data categorization What is Data Science? What is Data Science? Multidisciplinary field! Data science practitioners apply machine learning algorithms to data such as numbers, text, images, video, audio to develop systems to perform tasks which ordinarily require human intelligence. Data Science and Machine Learning Data Science and Machine Learning Amount of data is growing exponentially due “machine learning” to collection and storage of digital data Computers automatically detect patterns and make predictions or decisions from data Learn from data without relying on a predetermined mathematical model. Is a subset of Artificial Intelligence (AI). Data Science and Machine Learning Data Science and Machine Learning “machine learning” Machine learning systems generate insights Computers automatically detect patterns and make predictions or decisions from that analysts and business users data translate into tangible business value. Learn from data without relying on a predetermined mathematical model. Is a subset of Artificial Intelligence (AI). Solving useful problems, otherwise useless Data Science and Machine Learning Applications of Data Science Actual data science workflow can BUSINESS be complex! Help businesses to increase business value of its available data for competitive advantage against their competitors. Understand customers better Take better decisions Applications of Data Science Example Applications SOCIAL GOOD Credit Card Fraud Detection Applications in: Formulate a supervised model to ○ Agriculture categorize it into either fraud or no ○ Education fraud. ○ Disaster management ○ Environment Ideally, you would have a good ○ Transportation etc. quantity of examples. Example Applications Example Applications Credit Card Fraud Detection Features Customer Segmentation monetary amount frequency Use of an unsupervised model Formulate a supervised model to place categorize it into either fraud or no Find patterns about somebody period fraud. who buys specific products. transaction information transaction class Build a targeted marketing Ideally, you would have a good campaign for these consumers. quantity of examples. Labels fraud no fraud Example Applications Roles in Data Science Features Customer Segmentation products purchased Use of an unsupervised model location Find patterns about somebody spending rate “Data Scientist is a person who is better at statistics than any who buys specific products. product manufacturers programmer and better at programming than any education statistician.” Build a targeted marketing income campaign for these consumers. age -Josh Wills, Head of Data Engineering at Slack Labels None Roles in Data Science Roles in Data Science Recap: What is Data Science? Recap: What is Data Science? Data Analysis & !Red flags: Problem Formulation Presentation Modeling o Taking shortcuts: not spending Data Collection & Insight/Prediction enough time with data and problem Processing but jumping to modeling o Mindless use of ML tools Skills: Tools: o Ethical breach Domain expertise Machine learning o New learners are mostly susceptible… Programming skills Statistical tools Mathematical foundation R/Python libraries An overly simplified block diagram Recap: What is Data Science? Data Generation Source of data Mindset: ○ Need to capture information of a o Insights that are significant to the problem physical/digital activities o Figuring out the non-obvious! ○ E.g., Sales, customer feedback, social o Data driven scientific mindset media posts, speech, temperature, o Spend lot of time with exploring data body movement etc. ○ As you can see “everything under the sun” and “flowing through the internet” Data Generation Data Gathering Data gathering ○ Collected using sensors or manual labour or mining data from web ○ Digital data collected by sensors ○ Manual annotation and data entry to computer from physical documents ○ Extract data from internet using scripts Data Generation Data Categories The resulted data is raw data Structured versus unstructured data ○ With specific formats (.raw/..mp4 Quantitative versus qualitative data for video,.wav for speech,.csv for tabular data etc.) ○ Not clean and suitable for analysis ○ Need to clean: an important step in data science workflow Structured and Unstructured Data Structured and Unstructured Data Structured (organized) data: E.g., ○ This is data that can be thought of as observations and Most data that exists in text form, including server logs and characteristics. It is usually organized using a table method Facebook posts, is unstructured (rows and columns). Scientific observations, as recorded by careful scientists, are kept in Unstructured (unorganized) data: a very neat and organized (structured) format ○ This data exists as a free entity and does not follow any standard organization hierarchy. Structured and Unstructured Data Structured and Unstructured Data Data scientist likely prefers structured data but they also must be able to deal with the world's massive amounts of unstructured data. If 80-90% of the world's data is unstructured Data pre-processing is used to apply transformations to convert unstructured data into a structured counterpart Source: lawtomated.com Structured and Unstructured Data Quantitative and Qualitative Data pre-processing is used to apply transformations to convert Quantitative data: This data can be described using numbers, and basic mathematical unstructured data into a structured counterpart. procedures, including addition, are possible on the set. Key characteristics: “This Wednesday morn, are you early to rise? Then look East. The Qualitative data: This data cannot be Crescent Moon joins Venus & Saturn. Afloat in the dawn skies.” described using numbers and basic mathematics. This data is generally thought of as being described using natural categories and language. Quantitative and Qualitative Quantitative and Qualitative Name of a coffee shop: Qualitative Revenue: Quantitative Zip code: Qualitative Average monthly customers: Quantitative Country of coffee origin: Qualitative Quantitative and Qualitative Quantitative and Qualitative For a quantitative column, you may ask For a qualitative column, none of the questions such as the following: preceding questions can be answered. o What is the average value? However, the following questions only o Does this quantity increase or apply to qualitative values: decrease over time (if time is a factor)? o Which value occurs the most and the o Is there a threshold where if this least? number became too high or too low, it o How many unique values are there? would signal trouble for the company? o What are these unique values? Summary of the module o What is data science? o Data science vs. machine learning o Example applications o Roles in data science o Data generation o Categories of data