Topic-2. Working with Data.pptx
Document Details
Uploaded by SecureDravite
Abu Dhabi University
Tags
Full Transcript
GEN101 Introductory Artificial Intelligence College of Engineering Working with Data Exploring Data Data in Machine Learning How is data organized? Data typically presents as a table A single row of data: instance, sample, or record, observation A single cell in the row: attribute, facto...
GEN101 Introductory Artificial Intelligence College of Engineering Working with Data Exploring Data Data in Machine Learning How is data organized? Data typically presents as a table A single row of data: instance, sample, or record, observation A single cell in the row: attribute, factor, feature Datasets are a collection of instances Datasets are used to train and tests AI algorithms Record/Sample/ Feature 1 Feature 2 Instance No. Name Name Record/Sample/ Feature 1 Feature 2 Instance 1 Value in Value in Sample 1 Sample 1 Record/Sample/ Feature 1 Feature 2 Instance 2 Value in Value in Sample 2 Sample 2 Data in Machine Learning Sample Data Data typically presents as a table A single row of data: instance, sample, record, or observation A single cell in the row: attribute, factor, feature Datasets are a collection of instances Datasets are used to train and tests AI algorithms Patient Number Blood Glucose Pre-diabetic (Sample ID) Pressure Level (Class/Label/Target) (Feature 1) (Feature 2) Patient 1 120/80 90 No Patient 2 130/90 120 Yes Data Types Numerical Data: Any form of measurable data such as your height, weight, or the cost of your phone bill. You can determine if a set of data is numerical by attempting to average out the numbers or sort them in ascending or descending order. Categorical Data: Sorted by defining characteristics. Examples: gender, social class, ethnicity, hometown, the industry you work in, or a variety of other labels. Order is not important. Ordinal Data: data mixes numerical and categorical data. The data fall into categories, but the numbers placed on the categories have meaning. For example, rating a restaurant on a scale from 0 (lowest) to 4 (highest) stars gives ordinal data. Order is important. Time Series Data: Consists of data points that are indexed at specific points in time. More often than not, this data is collected at consistent intervals. Textual Data: Words, sentences, or paragraphs that can provide some level of insight to your machine learning models. Often grouped together or analyzed using various methods such as word frequency, text classification, or sentiment analysis. Explore Your Data Know what you’re working with You need to answer a set of basic questions about the dataset: How many observations do I have? How many features? What are the data types of my features? Are they numeric? Categorical? Do I have a target/class variable? Collecting Data 1. Collect it yourself Manual Automatic Contains far Cheaper fewer errors Gather Takes more time everything you to collect can find More expensive in general. Collecting Data 2. Someone has already collected it for you Google’s Dataset Microsoft Amazon Datasets UCI Machine Government Search Research Open Learning Datasets Data Repository Collecting Data The Size and Quality of a Data Set “Garbage in, garbage out” Your model is as good as your data How do you measure your data set's quality and improve it? How much data do you need to get useful results? Collecting Data Why is Collecting a Good Dataset Important? The Google Translate team "...one of our most impactful has more training data than quality advances since neural they can use. Rather than machine translation has been tuning their model, the in identifying the best subset team has earned bigger of our training data to use“ wins by using the best - Software Engineer, Google features in their data. Translate "...most of the times when I "Interesting-looking" errors tried to manually debug are typically caused by the interesting-looking errors they data. Faulty data may cause could be traced back to your model to learn the issues with the training data." wrong patterns, regardless - Software Engineer, Google of what modeling Translate techniques you try. Collecting Data The Size of a Dataset Your AI should train on at least an order of magnitude more examples than trainable Data set Size (number of parameters examples) Simple AI model on large data Iris flower data set 150 (total set) sets generally beat fancy models on small data sets MovieLens (the 20M data 20,000,263 (total set) Google has had great success set) training simple linear regression models on large data sets Google Gmail SmartReply 238,000,000 (training set) What counts as "a lot" of data? It Google Books Ngram 468,000,000,000 (total depends on the project set) Datasets come in a variety of Google Translate Trillions sizes Collecting Data The Quality of a Dataset It’s no use having a lot of data if it’s bad data; quality matters, too. Quality dataset is good data if it helps you accomplish its intended task. However, while collecting data, it's helpful to have a more concrete definition of quality. Certain aspects of quality tend to correspond to better-performing models: reliability feature representation minimizing skew Explore Your Data Why should you do that? The purpose of exploratory analysis is to "get to know" the dataset: You’ll gain valuable hints for Data Cleaning can make or break your models You’ll think of ideas for Feature Engineering can take your models from good to great). You’ll get a "feel" for the dataset will help you communicate results and deliver greater impact. Missing Data? Handle Missing Data Do NOT Ignore Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you impute it, that’s like trying to squeeze in a piece from somewhere else in the puzzle. Missing Data Types of Missing Data Missing The best way to handle You’re essentially adding a new class for the feature. missing data categor for categorical features is This tells the algorithm that the value was missing. ical to simply label them as This also gets around the technical data ’Missing’! requirement for no missing values. Missing For missing numeric data, Flag the observation with an indicator variable of missingness. you should flag and numeri fill the values. c data Then, fill the original missing value with 0 just to meet the technical requirement of no missing values. Data Interpolation Missing Data Interpolation Interpolation is a mathematical method that adjusts a function to your data and uses this function to extrapolate the missing data. The most simple type of interpolation is the linear interpolation, that makes a mean between the values before the missing data and the value after. Data Cleaning Data Cleaning Better Data Fancier Algorithms Garbage in, garbage out. In fact, a simple algorithm can outweigh a complex one just because it was given enough and high-quality data. Quality data beats fancy algorithms. Different types of data will require different types of cleaning. Remove Unwanted Observations Duplicate Observations Duplicate observations most frequently arise during data collection, such as when you: Combine datasets from multiple places Scrape data Receive data from clients/other departments Remove Unwanted Observations Irrelevant Observations Irrelevant observations are those that don’t actually fit the specific problem that you’re trying to solve. For example, if you were building a model for Villas only, you wouldn't want observations for Apartments in there. Checking for irrelevant observations before engineering features can save you time and effort. Fix Structural Errors Structural errors are those that arise during measurement, data transfer, or other types of "poor housekeeping." Check for: Typos Inconsistent capitalization. Mislabeled classes Fix Structural Errors Example – Before Applying the Fix Fix Structural Errors Example – After Applying the Fix Filter Unwanted Outliers An outlier is an observation that lies an abnormal distance from other values. Examine your data carefully to decide whether to remove a data outlier or not. You should never remove an outlier just because it’s a "big number." That big number could be very informative for your model. Removing an outlier can help your model’s performance. One Hot Encoding Definition One-hot encoding is used in machine learning as a method to quantify categorical data. Splitting the column which contains numerical categorical data to many columns depending on the number of categories present in that column. Each column contains “0” or “1” corresponding to which column it has been placed. Fruit Categoric Price Apple Mango Orang Price al Value e Apple 1 5 1 0 0 5 Mango 2 10 0 1 0 10 Apple 1 15 1 0 0 15 Orange 3 20 0 0 1 20 Data Augmentation Enlarge your Dataset How do I get more data, if I don’t have “more data”? Data Augmentation Common Augmentation Methods 1. Mirroring 2. Random Cropping 3. Rotation 4. Shearing 5. Color Shifting 6. Brightness Original Image Data Augmentation Common Augmentation Methods 1. Mirroring 2. Random Cropping 3. Rotation 4. Shearing 5. Color Shifting 6. Brightness Data Augmentation Common Augmentation Methods 1. Mirroring 2. Random Cropping 3. Rotation 4. Shearing 5. Color Shifting 6. Brightness Data Augmentation Common Augmentation Methods 1. Mirroring 2. Random Cropping 3. Rotation 4. Shearing 5. Color Shifting 6. Brightness Data Augmentation Common Augmentation Methods 1. Mirroring 2. Random Cropping 3. Rotation 4. Shearing 5. Color Shifting 6. Brightness Data Augmentation Common Augmentation Methods 1. Mirroring 2. Random Cropping 3. Rotation 4. Shearing 5. Color Shifting 6. Brightness Data Augmentation Common Augmentation Methods 1. Mirroring 2. Random Cropping 3. Rotation 4. Shearing 5. Color Shifting 6. Brightness Data Transformation Data Distribution/Histogram Categorical Numerical Ordinal Data Normalization Definition Transform data features to be on a similar scale which improves the performance and training stability of the machine learning model. Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data. Data Normalization Normalization Techniques Data Normalization Log Scaling Data Normalization Feature Clipping Data Normalization Z-Score Data Normalization Linear Scaling vs. Z-Score Data Normalization Linear Scaling Age Income #Credit Buy Cards Insuranc Age Range = 23 to 65 (42 e years) 35 15,000 1 No Income Range = 7k to 45k 45 32,000 3 No (AED 38,000) #Credit Cards Range = 0-4 (4 credit 23 7,000 4 No Normalization Needed !! cards) 55 45,000 3 Yes 65 12,000 0 Yes 27 20,000 2 No 33 25,000 2 Yes Linear Scaling Age min = 23Age max = 65 Age max - Age min = 42 Age Age – Age Age – Age Age' min min ---------------- -- Age max - Age min 35 35-23 = 12 12/42 = 0.29 0.29 45 45-23 = 22 22/42 = 0.52 0.52 23 23-23 = 0 00/42 = 0.00 0.00 55 55-23 = 32 32/42 = 0.76 0.76 65 65-23 = 42 42/42 = 1.00 1.00 27 27-23 = 4 04/42 = 0.10 0.10 Linear Scaling Income min = 7kIncome max = 45k Income max - Income min = 38k Income Income – Income – Income Income' Income min min ------------------ Income max - Income min 15,000 8,000 8,000/38,000 0.21 32,000 25,000 25,000/38,000 0.66 7,000 0 0/38,000 0.00 45,000 38,000 38,000/38,000 1.00 12,000 5,000 5,000/38,000 0.13 20,000 13,000 13,000/38,000 0.34 Linear Scaling #Credit Cards min = 0 #Credit Cards max #Credit =4 Cards max - #Credit Cards min = #Credit Cards #Credit Cards – #Credit Cards – #Credit Cards #Credit #Credit Cards min min Cards ' ------------------ #Credit Cards max - #Credit Cards min 1 1 1/4 0.25 3 3 3/4 0.75 4 4 4/4 1.00 3 3 3/4 0.75 0 0 0/4 0.00 2 2 2/4 0.50 Linear Scaling Age’ Income’ #Credit Buy Age Range = 23 to 65 (42 Cards’ Insuranc e years) 0.29 0.21 0.25 No Income Range = 7k to 45k (AED 38,000) 0.52 0.66 0.75 No #Credit Cards Range = 0-4 (4 credit 0.00 0.00 1.00 No Normalization Completed !! cards) 0.76 1.00 0.75 Yes 1.00 0.13 0.00 Yes 0.10 0.34 0.50 No 0.24 0.47 0.50 Yes Data Normalization Clipping Room People Humidity Cooling Temperatur Inside (%) Needed e (C) Room Temperature = 16-27 23.2 30 100 High degrees People Inside = 0-50 24.8 150 65 Medium Humidity = 60 - 85 22.1 20 67 Low Normalization Needed !! 23.7 30 70 Medium 44.3 40 80 High 22.5 20 69 Low 14 -10 73 Low Data Normalization Clipping Room People Humidity Cooling Temperatur Inside (%) Needed e (C) Room Temperature = 16-27 23.2 30 100 High degrees People Inside = 0-50 24.8 150 65 Medium Humidity = 60 - 85 22.1 20 67 Low Normalization Needed !! 23.7 30 70 Medium 44.3 40 80 High 22.5 20 69 Low 14 -10 73 Low Data Normalization Clipping Room People Humidity’ Cooling Temperatur Inside’ (%) Needed e’ (C) Room Temperature = 16-27 23.2 30 100 85 High degrees People Inside = 0-50 24.8 150 50 65 Medium Humidity = 60 - 85 22.1 20 67 Low Normalization Needed !! 23.7 30 70 Medium 43.3 27.0 40 80 High 22.5 20 69 Low 14.0 16.0 -10 0 73 Low Data Normalization Clipping Room People Humidity’ Cooling Temperatur Inside’ (%) Needed e’ (C) Room Temperature = 16-27 23.2 30 85 High degrees People Inside = 0-50 24.8 50 65 Medium Humidity = 60 - 85 22.1 20 67 Low Normalization Completed !! 23.7 30 70 Medium 27.0 40 80 High 22.5 20 69 Low 16.0 0 73 Low Data Normalization Z-Score Age Income #Credit Buy Cards Insuranc Age Range = 23 to 65 (42 e years) 35 15,000 1 No Income Range = 7k to 45k 45 32,000 3 No (AED 38,000) Normalization Needed !! #Credit Cards Range = 0-4 (4 credit 23 7,000 4 No cards) 55 45,000 3 Yes 65 12,000 0 Yes 27 20,000 2 No 33 25,000 2 Yes Z-Score Age Age – Age Age – Age Age' mean mean Age mean = 40.43 ---------------- Age’ mean =0 -- Age std dev = 15.31 Age std dev Age’ std dev =1 35 -5.43 -0.35 -0.35 45 4.57 0.30 0.30 23 -17.43 -1.14 -1.14 55 14.57 0.95 0.95 65 24.57 1.61 1.61 27 -13.43 -0.88 -0.88 33 -7.43 -0.49 -0.49 Breakout Session Clean your Data! Years Of Position Salary (k Activity Experience AED) 1 Staff 8 1. Fill the missing Data using Interpolation 2 staff 11 2. Remove Duplicate Observations 3 Staff _ 3. Fix Structural Errors 4 Staff 17 4. Remove Outliers 5. Apply One Hot Encoding 3 Staff 14 6 Staff _ 7 Staff 26 7 Manager 20 8 Supervisr 30 9 Supervisor 33 Clean your Data! Years Of Experien Staff Supervis or Manager Salary (k AED) Activity ce 1 1 0 0 8 1. Fill the missing Data using 2 1 0 0 11 Interpolation 2. Remove Duplicate Observations 3 1 0 0 14 3. Fix Structural Errors: Supervisr, staff 4 1 0 0 17 4. Remove Outliers 3 1 0 0 14 5. Apply One Hot Encoding 6 1 0 0 23 7 1 0 0 26 7 0 0 1 20 8 0 1 0 30 9 0 1 0 33 Breakout Session Data Augmentation Class Activity Configuration Data Augmentation Required Images Right Translate Left Scale Smaller Bigger 45 clockwise Rotate 45 counter clockwise From top Crop From bottom From side Data Augmentation Class Activity - Solution Configuration Data Augmentation Required Images Right Translate Left Scale Smaller Bigger 45 clockwise Rotate 45 counter clockwise From top Crop From bottom From side Breakout Session Normalize Rest of Data Data Normalization Linear Scaling Age Income #Credit Buy Cards Insuranc Age Range = 23 to 65 (42 e years) 35 15,000 1 No Income Range = 7k to 45k 45 32,000 3 No (AED 38,000) #Credit Cards Range = 0-4 (4 credit 23 7,000 4 No Normalization Needed !! cards) 55 45,000 3 Yes 65 12,000 0 Yes 27 20,000 2 No 33 25,000 2 Yes