ilovepdf_merged.pdf

Introduction to Data Science and AI – Part II Week 1 - Lecture 2 COSC 202 Data Science and AI Menatalla Abououf Fall 2024 The Role of Data Scientists The role of a data scientist is to extract knowledge and meaningful insights from large amounts of data by combining principles and practices from the fields of mathematics, statistics, and computer science, often in a multidisciplinary context. 2 The Role of Data Scientists WISDOM KNOWLEDGE How does a data scientist transforms INFORMATION Make informed raw data into decisions. actionable wisdom? DATA Reveal patterns. Exercise more to Organize and Increased step count improve your health Raw data structure the data. leads to improved and fitness sleep quality A smartwatch Daily step count, collects number of average heart rate, steps taken, heart and hours of sleep rate, and sleep per night 3 duration. The Role of Data Scientists What should be done? Work from What will home happen? Why did it happen? Expecting heavy rain What are the different tomorrow types of analytics a data scientist uses the What data for? happened? More cloud caused sudden drop in temperature Description of the weather over the past month 4 Life Cycle of Data Science Problem Statement Data Collection Data Cleaning Exploratory Data Analysis Data Transformation Modeling Validation Decision Making & Deployment 5 Life Cycle of Data Science - Example 1. Problem: Classify different breeds of dogs, given their pictures. 2. Data collection: Collect pictures of each breed in different lighting, from different angles, correctly labeled. 3. Data Cleaning: Remove duplicates and poor-quality pictures. 4. EDA & Transformation: Distribution counts, heatmaps of the densest points. Resizing images and normalize. 5. Modeling & Validation: Start with some basic baseline model and validate on pictures we haven't trained our model on. 6. Decision Making and Deployment: Good accuracy? Communicate with stakeholders about how to put this in production. 6 What is Computer Science? 7 What is Computer Science? “Computer science is the the study of computers and computing, including their theoretical and algorithmic foundations, hardware and software, and their uses for processing information.” - Britannica It encompasses areas such as: Algorithms Programming languages Software engineering Computer architecture Artificial intelligence, etc. It drives technological innovation and impacts various aspects of everyday life. 8 What is Artificial Intelligence? A term coined by Stanford Professor John McCarthy in 1955. “A branch of computer science, dealing with the simulation of intelligent behaviors in computers” – Merriam-Webster It allows computer systems to perform complex tasks that traditionally required human intervention, such as reasoning, decision-making, or problem-solving. 9 What is Artificial Intelligence? Is a calculator considered AI? What about SIRI? What happens when you ask SIRI for help in a mathematical calculation? 10 What is Machine Learning? “The study and construction of programs that are not explicitly programmed but learn patterns as they are exposed to more data over time.” The more the data, the more the algorithm can learn the underlying patterns Machine Spam learning program Not spam Emails labelled as spam The more emails the …the better it gets at or not spam program sees… classification 11 Rule-based systems! Like rule-based expert systems. They use programmed rules and patterns to simulate the What is AI but judgment and behavior of a not ML? human or an organization that has expertise and experience in a particular field. 12 What is AI but not ML? Examples of Rule-Based Expert Systems: Medical Diagnosis: Based on a patient's symptoms, medical history, and test findings, a rule-based system in AI can make a diagnosis. Fraud Detection: Based on transaction's value, location, and time of day, a rule-based system in AI can be used to spot fraudulent transactions. When the rules are unknown, or too complex to write down, this is when machine learning comes in 13 What is Deep Learning? “Machine learning that involves using very complicated models called “deep neural networks”. Neural networks mimics the way human brains work The models themselves determine the best representation of the original data. In classical machine learning, humans must do this. Requires huge amount of data 14 Deep Learning Breakthroughs Computer Vision Natural Language Processing Example: Image classification Example: Sentiment Analysis 15 Some History AI has gone through several hype cycles where there have been both significant amounts of investments and excitement, as well as disappointments. 1950s-1960s 1980s Early Expert Systems 1990s-2000s Algorithms Neural Networks ML Success 1970s 1980s-1990s Present AI winter AI Winter Breakthroughs in DL 16 1950s-1960s 1980s Early Expert Systems 1990s-2000s Algorithms Neural Networks ML Success 1970s 1980s-1990s Present AI winter AI Winter Breakthroughs in DL 17 Birth of AI Alan Turing published “Computer Machinery and Intelligence” which proposed a test of machine intelligence called The Imitation Game in 1950. Arthur Samuel developed a program to play checkers, the first to ever learn the game independently. John McCarthy coined the word AI and created LISP (List Processing), the first programming language for AI research, which is still in popular use to this day. 18 1950s-1960s 1980s Early Expert Systems 1990s-2000s Algorithms Neural Networks ML Success 1970s 1980s-1990s Present AI winter AI Winter Breakthroughs in DL 19 Growth and Decline This period saw significant interest in AI, but also a decline due to the lack of tangible progress in some applications. The first autonomous vehicle was built by a student at Stanford University. The U.S. government showed little interest in continuing to fund AI research. 20 1950s-1960s 1980s Early Expert Systems 1990s-2000s Algorithms Neural Networks ML Success 1970s 1980s-1990s Present AI winter AI Winter Breakthroughs in DL 21 AI Boom Rapid growth and increased interest due to advanced discoveries in research and increased government funding to support researchers. Deep learning techniques and expert systems became more popular, as both allowed computers to learn from their mistakes and make independent decisions. 22 1950s-1960s 1980s Early Expert Systems 1990s-2000s Algorithms Neural Networks ML Success 1970s 1980s-1990s Present AI winter AI Winter Breakthroughs in DL 23 AI Winter Private investment and the government lost interest in AI and stopped their funding due to the high costs compared to the apparent low returns. The market for specialized hardware based on LISP collapsed due to the emergence of cheaper competitors that were more capable, such as IBM and Apple. This led to the failure of many specialized LISP companies, as the technology became more readily available. 24 1950s-1960s 1980s Early Expert Systems 1990s-2000s Algorithms Neural Networks ML Success 1970s 1980s-1990s Present AI winter AI Winter Breakthroughs in DL 25 Return of Interest This period saw the introduction of the first AI system, Deep Blue, which was able to defeat a world champion in chess. It also brought AI into everyday life with innovations like the first Roomba and the first commercially available speech recognition software on Windows computers. 26 1950s-1960s 1980s Early Expert Systems 1990s-2000s Algorithms Neural Networks ML Success 1970s 1980s-1990s Present AI winter AI Winter Breakthroughs in DL 27 General AI We have witnessed the widespread adoption of common AI tools such as virtual assistants, search engines, and more. This period has also seen the rise of deep learning and big data. 28 Some History 29 Limitations of AI as of Today Limited understanding of context Limited common sense Not yet! Bias Is AI set to Limited emotion take over? Limited robustness 31 Recommended Reading Artificial Intelligence with Python, by Alberto Artasanchez and Prateek Joshi. Publisher: Packt Publishing Ltd, 2nd Edition, 2020. ISBN-10: 183921953X. ISBN-13: 978-1839219535. – Pages 2-11 32 Data Types & Structures – Part I Week 2 - Lecture 3 COSC 202 Data Science and AI Menatalla Abououf Fall 2024 Outline – Week 2 Types of Data 1. Numerical Data: Continuous and Discrete Data 2. Categorical Data: Ordinal and Nominal Data Data Structures 1. Primitive Data Structures 2. Non-Primitive Data Structures 2 Types of Data Quantitative Qualitative Binary Non-Binary Only two options Many options True/False, Eye colour Male/Female 3 Categorical data Categorical data represents groups or categories and is expressed in labels or names. Female Male It is a type of qualitative data that can be grouped into categories instead of being 0 1 measured numerically. Numbers can sometimes represent it, but those numbers don’t mean anything mathematically. Categorical data It can be further divided into: 1. Ordinal Data: Categories with a natural order. Example: Education level (high school, bachelor's, master's, PhD). 2. Nominal Data: Categories without a natural order. Example: Colors (red, blue, green), gender (male, female). Categorical data: Ordinal Data Ordinal data represents categories that have a meaningful order or ranking. The intervals between the categories are not necessarily equal. Arithmetic operations are not meaningful, but comparisons of order are. Examples: Survey ratings (e.g., poor, fair, good, excellent) Education levels (e.g., high school, bachelor's, master's, doctorate) Military ranks (e.g., private, corporal, sergeant, lieutenant) Categorical data: Nominal Data Nominal data represents categories without any inherent order or ranking. Categories are distinct and do not have a logical sequence. The data can be grouped but not ordered. Arithmetic operations cannot be performed on nominal data. Examples: Colors (e.g., red, blue, green) Types of animals (e.g., dog, cat, bird) Nationalities (e.g., American, Canadian, Japanese) Categorical data: Nominal Data Binary nominal data: represents categories with only two distinct and mutually exclusive values. Exactly two possible categories. Non-binary nominal data: represents categories with more than two distinct and mutually exclusive values. Three or more categories. Categorical data: Nominal vs Ordinal Comparison Nominal Ordinal Category Yes Yes Order No inherent order Has inherent order Equal Intervals No No Comparison Cannot be compared Can be compared Examples Colors, Names Grades, Ratings Image Data Images can be considered both numerical and categorical. Textual Data Also known as text data or unstructured data. It cannot be easily categorized into traditional numerical or categorical formats due to its complex and varied nature. Examples of Textual Data: Documents: Articles, reports, books, and research papers. Social Media Posts: Tweets, Facebook posts, Instagram captions. Reviews and Feedback: Customer reviews, survey responses, comments. Outline – Week 2 Types of Data 1. Numerical Data: Continuous and Discrete Data 2. Categorical Data: Ordinal and Nominal Data Data Structures 1. Primitive Data Structures 2. Non-Primitive Data Structures 12 What is a data structure? A data structure is a specialized format for organizing, managing, and storing data. Different kinds of data structures are suited to different kinds of applications Mainly, we have two types of Data structures: Primitive Data Structures Non-Primitive Data Structures Types of Data Structures Data Structures Primitive Data Non-Primitive Structures Data Structures Integer Float Character Boolean Array Set List Tuples Dictionary Types of Data Structures Data Structures Primitive Data Non-Primitive Structures Data Structures Integer Float Character Boolean Array Set List Tuples Dictionary Primitive Data Structures Primitive data structures are the most basic types of data structures. They are Single-Value Data Types. Examples: ▪ Integers: Whole numbers without a fractional part. ▪ Floats: Numbers with a fractional part. ▪ Characters ( or Strings): Alphabetic letters, digits, or special symbols. ▪ Booleans: Values that represent true or false. Primitive Data Structures Characteristics of primitive data structures: 1. They are predefined types in a programming language. 2. They have a specific range of values. 3. They are not composed of other data structures. Example: Types of Data Structures Data Structures Primitive Data Non-Primitive Structures Data Structures Integer Float Character Boolean Array Set List Tuples Dictionary Non-Primitive Data Structures Non-primitive data structures are more complex structures that can store multiple values, often of different types. They are built using primitive data structures. They are Multiple-Value Data Types. Examples: Arrays, lists, tuples and dictionaries Non-Primitive Data Structures Characteristics of non-primitive data structures: 1. They are derived from primitive data structures. 2. They can store a collection of values. 3. They can be homogeneous (like arrays) or heterogeneous (like lists). A. Homogeneous data structures store elements of the same type. B. Heterogeneous data structures can store elements of different types 4. They are used to represent more complex relationships among data. Exercise Specify if the following is Homogeneous or Heterogeneous data structures: A. int_array = [1, 2, 3, 4, 5] B. Person class with attributes name (string), age (integer), and email (string). Primitive vs Non-Primitive Data Types Aspect Primitive Data Type Non-Primitive Data Types Definition Are predefined by the language itself. Need to be defined by the user and are built using primitive data types. Size Fixed and predefined by the language. Dynamic and can vary during runtime. Size depends on Data Type. Storage Stores single value. It can store multiple and complex sets of data. Default values Have default values (like 0 for int, Usually, initialize to null or require false for boolean). explicit initialization. NULL values Can’t be NULL. Can Consist of a NULL value. Direct Access It can be accessed directly through It may require special methods or their variable name. operations to access or manipulate. 22 Data Types & Structures – Part II Week 2 - Lecture 4 COSC 202 Data Science and AI Menatalla Abououf Fall 2024 Outline – Week 2 Types of Data 1. Numerical Data: Continuous and Discrete Data 2. Categorical Data: Ordinal and Nominal Data Data Structures 1. Primitive Data Structures 2. Non-Primitive Data Structures 2 Types of Data Structures Data Structures Primitive Data Non-Primitive Structures Data Structures Integer Float Character Boolean Array Set List Tuples Dictionary Non-Primitive Data Structures: Arrays An array is a data structure that can hold a fixed number of elements of the same type. A K L A V F G Non-Primitive Data Structures: Arrays It has the following properties: Fixed Size: The size of an array is determined at the time of its creation and cannot be changed. Homogeneous Elements: All elements in an array are of the same data type. Index-Based Access: Elements in an array are accessed using indices, which usually start at 0 in most programming languages. Non-Primitive Data Structures: Arrays Types of Array: One Dimensional array: A simple linear data structure that stores elements. Example: numbers = [1, 2, 3, 4, 5] Two-Dimensional array (Table):a collection of data elements arranged in a grid- like structure with rows and columns. Example: matrix = [ [1, 2, 3], [4, 5, 6], [7, 8, 9]] Three-Dimensional arrays. Non-Primitive Data Structures: Lists A list is a data structure that can hold a collection of items. Non-Primitive Data Structures: Lists Lists has the following properties: Ordered: Elements in a list are ordered and indexed. Heterogeneous: Unlike arrays, lists are Heterogeneous. This means that they can contain elements of different data types. Mutable and Dynamic in size: Lists can grow and shrink in size as needed. You can add or remove elements at any time. Contains Duplicates: Allows duplicates data. Non-Primitive Data Structures: Lists Some operations on Lists: Creation: When creating sets in my_list = [1, "two", 3.0, True] Python, we use square empty_list = [] brackets [] When to use lists? Accessing Elements: first_element = my_list When we have Modifying Elements: sequence of data my_list = "three" # Changes "two" to "three“ points or time Adding Elements: series data where y_list.append(6) # Adds 6 to the end the order is my_list.insert(2, "inserted") # Inserts "inserted" at index 2 important. Removing Elements: my_list.remove("three") # Removes first occurrence of "three“ popped_element = my_list.pop(2) # Removes and returns the element at index 2 We can also do Slicing, Concatenation and Iterating. Non-Primitive Data Structures: Sets A set is a data structure that can store a collection of elements. Non-Primitive Data Structures: Sets Sets has the following properties: Unique Elements: A set cannot contain duplicate elements Unordered: Sets do not maintain any particular order of elements. Heterogeneous: Unlike arrays, sets are Heterogeneous. This means that they can contain elements of different data types. Mutability: you can add or remove elements after the set is created. However, it cannot contain mutable elements. For example: integers and strings can be elements of a set but lists or other sets cannot be. Non-Primitive Data Structures: Sets Operations on Sets: When to use Creation: Sets? When creating sets in my_set = {1, 2, 3, 4, 5} Python, we use curly Adding Elements: brackets {} When we have distinct values in a my_set.add(6) dataset. Example: my_set.update([7, 8, 9]) # Adding multiple elements User IDs Removing Elements: my_set.remove(6) # Raises KeyError if element is not found my_set.discard(7) # Does not raise an error if element is not found Membership test my_set.pop() # Removes and returns an arbitrary element is a very fast Checking Membership: operation on Sets. is_present = 4 in my_set # Returns True if 4 is in the set For this we use in Non-Primitive Data Structures: Tuples A tuple is a data structure that can hold a collection of items. Non-Primitive Data Structures: Tuples Tuples has the following properties: Ordered: Elements in a tuple have a defined order, and this order will not change. Heterogeneous: Tuples can contain elements of different types (integers, strings, objects, etc.). Immutable: Unlike lists, tuples are immutable; meaning once they are created, their elements cannot be changed, added, or removed Contains Duplicates: Allows duplicates data. Non-Primitive Data Structures: Tuples Some operations on Tuples: When to use Tuples? Creation: When creating tuples in my_tuple = (1, 2, 3) Python, we use Tuples are slightly Parentheses () Accessing Elements: more memory- t = (1, 2, 3) efficient than lists print(t) # Output: 1 because they are Concatenation: immutable. t1 = (1, 2) t2 = (3, 4) They also ensure print(t1 + t2) # Output: (1, 2, 3, 4) that data can't be Membership: accidently t = (1, 2, 3) Def get_info() modified. print(2 in t) # Output: True Return (“Alice”, 21, “Computer Science”) Also, functions Count: uses tuples to t = (1, 2, 2, 3) return multiple print(t.count(2)) # Output: 2 values. 16 Non-Primitive Data Structures: Dictionaries A dictionary is a data structure that stores data in key-value pairs. dict = {“a”:“alpha”, “o”:“omega”, “g”:“gamma”} Non-Primitive Data Structures: Dictionaries A Dictionary has the following properties: Unordered: The order of elements in a dictionary is not guaranteed to be consistent (although as of Python 3.6+, dictionaries maintain insertion order). Heterogeneous Values: Values in a dictionary can be of any data type and do not need to be unique. Mutable: You can add, remove, and modify elements in a dictionary. Non-Primitive Data Structures: Dictionaries Some operations on Dictionaries: Creating Dictionaries When creating dicts in my_dict = {"name": "Alice", "age": 30, "city": "New York"} Python, we use curly my_dict = dict(name="Alice", age=30, city="New York") brackets {} Accessing Elements name = my_dict["name"] # Accessing By Key --- Output: "Alice“ age = my_dict.get("age", "Not Available") # Accessing by get() --- Output: 30 Adding or Updating Elements my_dict["age"] = 31 Removing Elements age = my_dict.pop("age") # Removes and returns the value associated with "age“ item = my_dict.popitem() # Removes and returns an arbitrary key-value pair del my_dict["city"] # Removes the key-value pair with key "city" my dict.clear() # Removes all key-value pairs Revision Exercise: Specify the data type of the following examples: Example Data Type “Hello World” 10 10.7 [“apple”, “banana”, “orange”] (“apple”, “banana”, “orange”) {“apple”, “banana”, “orange”} {“name”: “Vishnu”, “age”:27} True 20 Summary Comparison Order Heterogeneous Mutable Index-based Fixed or Uniqueness access dynamic of elements size Arrays Ordered No Yes Yes Fixed (size No (Homogeneous) specified at creation) Sets Unordered Yes Yes No Dynamic Yes Lists Ordered Yes Yes Yes Dynamic No Tuples Ordered Yes No Yes Fixed No Dictionaries Unordered Yes Yes No (key- Dynamic Keys must be (Python 3.6+ based unique insertion- access) ordered) Data Collection Week 3 - Lecture 5 COSC 202 Data Science and AI Menatalla Abououf Fall 2024 Outline – Week 3 Identifying problem statement Data collection 1. Sources of data 2. Data structures 3. Web scraping and APIs Data cleaning 1. Identify problems with the collected data 2. Spot Outliers 3. Deal with missing data 2 Life Cycle of Data Science Problem Statement What Problem are we trying to solve? Data Collection What data do we need to solve our problem? Data Cleaning How should we clean or our data so our model can use it? Exploratory Data Analysis What insights can we gain from the data? Data Transformation How to prepare data so our model can use it? Modeling Build a model to solve our problem? Validation Did we solve the problem? Decision Making & Deployment Communicate to stakeholders or put into production? 3 Life Cycle of Data Science Problem Statement What Problem are we trying to solve? Data Collection What data do we need to solve our problem? Data Cleaning How should we clean or our data so our model can use it? Exploratory Data Analysis What insights can we gain from the data? Data Transformation How to prepare data so our model can use it? Modeling Build a model to solve our problem? Validation Did we solve the problem? Decision Making & Deployment Communicate to stakeholders or put into production? 4 Problem Statement Clearly formalize the problem Question that needs to be addressed Understand the goals and objectives of the analysis or project What happens when a Data Scientist works on a domain-specific task? Domain Expert Data Scientist 5 Life Cycle of Data Science Problem Statement What Problem are we trying to solve? Data Collection What data do we need to solve our problem? Data Cleaning How should we clean or our data so our model can use it? Exploratory Data Analysis What insights can we gain from the data? Data Transformation How to prepare data so our model can use it? Modeling Build a model to solve our problem? Validation Did we solve the problem? Decision Making & Deployment Communicate to stakeholders or put into production? 6 Data Collection Identify relevant data sources and gather the necessary data for analysis What could be the sources of data? 7 Data Collection Sources of data: Interviews Surveys Primary Observations sources Experimental Organization internal data, e.g. sales record, transactions, customer data Secondary Web, e.g. open-source databases, articles, research, social media sources Internet of Things devices and sensors Could be either 8 Data Collection The common forms of collected data: Structured Semi-structured Unstructured 9 Data Collection Structured Semi-structured Unstructured Organized following a predefined Lacks a rigid format but No predefined format and format somewhat organized unorganized Database tables CSV, JSON Text, PDF, JPEG, MP3, others The university has 5000 students. John’s ID is number 1, he is 18 ID Name Age Degree John years old and holds a B.Sc. 1 John 18 B.Sc. 18 Degree. David’s ID is number 2 B.Sc. and he is 31 years old. Robert’s ID 2 David 31 NULL is 3 and he is 51 years old and 3 Robert 51 B.Sc. < Student ID=“2”> holds the same degree is John. David 4 Rick 19 M.SC. 31 … 10 Data Collection - Databases Retrieving data from databases - represents highly structured relational databases with fixed schema. 11 Data Collection - CSV Reading from CSV files: Files consisting of rows of data, separated by commas What if you import a csv file in an Excel sheet? Example of a CSV files 12 Data Collection - CSV What is the structure now? Reading from CSV files: Files consisting of rows of data, separated by commas Example of a CSV files If you open.csv file as EXCEL sheet 13 Data Collection - JSON Reading from JSON (JavaScript Object Notation) files: Standard way to store data across platforms They're meant to store information in a semi-organized easy to access manner, and they look similar to Python dictionaries. The dictionary has the key and value pairs. This example show two rows of data, the key represents the column name, and the value is the value of the column 14 Data Collection - JSON Reading from JSON (JavaScript Object Notation) files: Standard way to store data across platforms They're meant to store information in a semi-organized easy to access manner, and they look similar to Python dictionaries. Notice how the highlighted key-value pairs are not the same for both books 15 Data Collection Comment about the structure (form) of each of these data sources? Interviews Surveys Observations Experimental Organization internal data, e.g. sales record, transactions, customer data Web, e.g. open-source databases, articles, research, social media How can we collect data Internet of Things devices and sensors from the web? 16 Data Collection – From Web How can we collect data from the web? Web Scraping APIs Automatic methods to obtain large amounts of data from websites. Most of this data is unstructured which is then converted into structured data in a database so that it can be used in various applications. 17 Data Collection - Web Scraping How does Web Scraping work? A web scraper is a tool, that is given a URL or URLs, loads entire HTML code. The scraper will either extract all the data on the page or specific data selected by the user before the project is run. Most web scrapers will output data to a CSV, Excel spreadsheet or JSON. What are the methods of Web Scraping? Available tools Self-built 18 Data Collection - Example of Web Scraping We can use a web scraper on the Amazon website selecting the product names and URLs. The web scraper will output all the data that has been collected into a format that is more useful to us. 19 Data Collection - API Application Programming Interfaces (APIs): a software intermediary that allows two applications to talk to each other. Examples of APIs: Loging in with Facebook/Twitter/Google/GitHub for authentication Pay with PayPal Google Maps 20 Data Collection - API Variety of data providers make data available via APIs. APIs are an accessible way to extract and share data within and across platforms. Example: Collecting Tweets as data using X (Twitter) API. curl "https://api.twitter.com/2/users/by/username/$USERNAME" -H "Authorization: Bearer $ACCESS_TOKEN" The data can be in CSV or JSON format. 21 Data Collection – Web Scraping and API Web Scraping API Availability High – could be done to any website Limited – only if the website offers an API Flexibility High – user define custom data based Limited – restricted to data and functions provided on their need by provider Technical High – Requires technical skills Moderate – Less technical skills needed difficulty Speed Time-consuming Fast Legality Needs careful consideration Governed by the terms and conditions set by the provider Use when: The website does not offer API or The website offers a public API with the needed specific data is needed. data. 22 Data Collection - Summary Structured Semi-structured Unstructured Organized following a predefined Lacks a rigid format but No predefined format and format somewhat organized unorganized Database tables CSV, JSON Text, PDF, JPEG, MP3, others Organization internal data Social media posts and Interviews (typically stored in tabular articles (metadata: tags and Observations format) timestamps) Experimental data (e.g. lab Surveys (e.g., multiple-choice IoT devices and sensors (logs notes) questions) and raw sensor readings) Social media posts Experimental data (e.g. logs) Web pages Multimedia and text files Used in traditional AI models Commonly converted to Used in advanced AI models structured data 23 Data Collection – Quick Inspection Get an initial feel of the data to check if the dataset is usable: Does it have all the necessary features for the problem at hand? Is it in a structure suitable for the intended problem/model? Do features have the correct data types? Do numeric values fall within expected ranges? Do categorical values fall within the predefined categories? Are there only a few or no missing values? If the answer is yes to all, proceed with the collected data to the next phase. 24 Data Cleaning Week 3 - Lecture 6 COSC 202 Data Science and AI Menatalla Abououf Fall 2024 Outline – Week 3 Identifying problem statement Data collection 1. Sources of data 2. Data structures 3. Web scraping and APIs Data cleaning 1. Identify problems with the collected data 2. Spot Outliers 3. Deal with missing data 2 Life Cycle of Data Science Problem Statement What Problem are we trying to solve? Data Collection What data do we need to solve our problem? Data Cleaning How should we clean or our data so our model can use it? Exploratory Data Analysis What insights can we gain from the data? Data Transformation How to prepare data so our model can use it? Modeling Build a model to solve our problem? Validation Did we solve the problem? Decision Making & Deployment Communicate to stakeholders or put into production? 3 Data Cleaning Why data cleaning is so important? Decisions in analytics are increasingly driven by data and models. Several crucial steps need to be taken after data collection to ensure the data is suitable and ready for analysis. Key aspects of dataset: Observations: An instance of the data (usually a point or a row in the dataset Features: Information we have for each observations (variables) Labels: Output variable(s) being predicted Features Label Observations 4 Data Cleaning Why data cleaning is so important? Decisions in analytics are increasingly driven by data and models. Several crucial steps need to be taken after data collection to ensure the data is suitable and ready for analysis. Key aspects of dataset: Observations: An instance of the data (usually a point or a row in the dataset Features: Information we have for each observations (variables) Labels: Output variable(s) being predicted If these key aspects are not cleaned properly, we are misrepresenting to our model the relationship between our features and our targets. 5 Data Cleaning How can data be messy? Duplicate or unnecessary data Inconsistent format, text and typos Will be considered by the model as new values within the same feature, even though they should be categorized as the same Outliers Missing data Biased data Unbalanced data 6 Data Cleaning What is wrong with this dataset? 7 Data Cleaning Duplicates 8 Data Cleaning Inconsistent format 9 Data Cleaning Missing data Outlier Typo 10 Data Cleaning Unnecessary feature 11 Data Cleaning – Missing Values Cleaned dataset: 12 Data Cleaning – Missing Values Cleaned dataset: How to handle missing data? 13 Data Cleaning – Missing Values How to handle missing values: 1. Remove data: remove row(s) entirely (or column) Pro: Quickly clean your dataset without guessing an appropriate replacement value. Con: May end up losing too much information or biasing our dataset 2. Impute data: replace the value. Fill in the missing data with the most common value, the average, or median. Pro: We do not lose an entire row (or column) Con: We add an uncertainty to the model as it is based on an estimate 3. Mask the data: create a category for missing values. Pro: We do not lose an entire row (or column) Con: We add an uncertainty to the model as it is based assumption that all missing data are alike 14 Data Cleaning – Missing Values Handling missing data: Replaced with Median : 28, 28, 29, 32, 45 15 Data Cleaning – Outliers How to spot outliers, and how to deal with them? Handling outliers: 16 Data Cleaning – Outliers An outlier is a datapoint that is distant from most observations within a feature Typically, they do not represent the phenomenon we are trying to model Note: Some outliers are informative and provide insight into the data 17 Data Cleaning – Outliers How to spot outliers: Using histograms Groups data into bins or intervals and shows the frequency (count) of data points within each bin Box plot A standardized way of displaying the distribution of data. It highlights the central tendency, variability, and potential outliers in the data. Density plot Continuous version of a histogram. It shows the distribution of a numerical variable. Especially useful for identifying modes and the spread of the data. 18 Data Cleaning – Outliers 19 Data Cleaning – Outliers How to handle outliers: 1. Remove data: remove row(s) entirely (or column) Pro: No longer need to worry about their effects Con: May end up losing too much information or biasing our dataset 2. Assign the mean or median value Pro: We do not lose an entire row (or column) Con: Losing what may have been an important value 3. Predict what the value should have been using ‘similar’ observations or using regression Pro: We do not lose an entire row (or column) Con: Requires a lot more work than any of the other methods 4. Keep them – the model should be resistant 20 Data Cleaning – Outliers Handling outlier: 21 Data Cleaning – Biased Data What is biased data? Data that systematically favors certain class or features over others. It happens due to errors in data collection, processing or analysis. Leads to a model that is not representative of real-world scenarios. 22 Data Cleaning – Imbalanced Data What is imbalanced data? Data where the classes are not represented equally. It happens due to rarity of certain events. Can cause the model to be biased towards the majority class. 23 Data Cleaning – Biased & Imbalanced Data Which of the following is biased and which is imbalanced data? Data collected to build a model that Data collected to build a model that identifies fraudulent transactions uses existing employees CVs for hiring new ones 24 Data Cleaning – Biased & Imbalanced Data How to solve each problem? Which of the following is biased and which is imbalanced data? Data collected to build a model that Data collected to build a model that identifies fraudulent transactions uses existing employees CVs for hiring new ones 25 Data Cleaning – Biased & Imbalanced Data Is this dataset balanced and unbiased? 26 Data Exploration – Part I Week 4 - Lecture 7 COSC 202 Data Science and AI Menatalla Abououf Fall 2024 Outline – Week 4 Importance of EDA Techniques of EDA: 1. Statistical summary Central Tendency: Mean, Median, Mode Dispersion: Standard Deviation, Range Covariance and Correlation 2. Visualization Histograms Scatter Plots Box Plots Sampling 2 Life Cycle of Data Science Problem Statement What Problem are we trying to solve? Data Collection What data do we need to solve our problem? Data Cleaning How should we clean or our data so our model can use it? Exploratory Data Analysis What insights can we gain from the data? Data Transformation How to prepare data so our model can use it? Modeling Build a model to solve our problem? Validation Did we solve the problem? Decision Making & Deployment Communicate to stakeholders or put into production? 3 What is Exploratory Data Analysis (EDA) It is an approach to analyze datasets to summarize their main characteristics, often with visual methods It allows us to get insights of the data Does it make sense? Does it need further cleaning? Is more data needed? What are the patterns and trends? 4 Techniques of EDA Statistical summary Average (or Mean), Median, Standard Deviation, Minimum, Maximum, Range, Mode, Correlation Visualizations Histograms, Scatter Plots, Box Plots 5 Techniques of EDA Statistical summary Average (or Mean), Median, Standard Deviation, Minimum, Maximum, Range, Mode, Correlation Visualizations Histograms, Scatter Plots, Box Plots 6 Techniques of EDA – Statistical Summary Average (or mean) is the sum of all the values divided by the total number of values in a set 𝑆𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝐴𝑣𝑒𝑟𝑎𝑔𝑒(𝜇) = 𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥𝑛 ഥ= 𝒙 𝑛 What is the Average (Mean)? It gives a central value of the dataset, providing a sense of the overall level of 4, 10, 6, 18, 5, 8, 306 the data. 7 Techniques of EDA – Statistical Summary Median is the middle number in a sorted list of numbers. What is the Median? 4, 10, 6, 18, 5, 8, 306 It splits the dataset into two equal halves, indicating the central tendency without being affected by extreme values 4, 10, 6, 18, 5, 8, 20, 306 (outliers). 8 Techniques of EDA – Statistical Summary The mode is the value that appears What is the mode? most frequently. A dataset can have one mode, more 4, 10, 6, 18, 5, 8, 306 than one mode, or no mode at all (if no value repeats). 4, 5, 6, 18, 5, 8, 306 It is a measure of central tendency, like the mean and median, and it gives insight into the most common or popular value 4, 5, 6, 18, 5, 8, 6 within the dataset. 9 Techniques of EDA – Statistical Summary Standard deviation is the square root of the variance in data. The variance is the average squared deviation of each data point from the mean of the dataset. 2 2 2 2 (𝑥1 −𝑥) ҧ +(𝑥2 − 𝑥) ҧ + (𝑥3 −𝑥) ҧ + … + (𝑥𝑛 −𝑥) ҧ 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝜎 2 = 𝑛 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝜎 = 𝜎 2 What is the Standard Deviation? Standard deviation represents the amount of variation or dispersion of a set of values by measuring the average 4, 10, 6, 18, 5, 8, 306 distance of each data point from the mean. 10 Techniques of EDA – Statistical Summary The range is calculated as the difference between the maximum and minimum values in the dataset. 𝑟𝑎𝑛𝑔𝑒 = max𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒 − minimum 𝑣𝑎𝑙𝑢𝑒 What is the range? It is a measure of the data spread or 4, 10, 6, 18, 5, 8, 306 dispersion 11 Techniques of EDA – Statistical Summary 28+32+28+45+29+29 Age: 𝑥ҧ = 6 = 31.83 12 Techniques of EDA – Statistical Summary Age: 𝑚𝑒𝑑𝑖𝑎𝑛 = 28, 28, 29, 29, 32, 45 13 Techniques of EDA – Statistical Summary 28−31.83 2 + 32−31.83 2 + 28−31.83 2 + 45−31.83 2 + 29−31.83 2 + 29−31.83 2 Age: 𝜎2 = 6 = 36.47 𝜎 = 36.47 = 6.04 14 Techniques of EDA – Statistical Summary Age: range = 45-28 = 17 15 Techniques of EDA – Statistical Summary Age: mode = 28 and 29 16

ilovepdf_merged.pdf

Document Details

Tags

Related

Full Transcript