Full Transcript

Introduction to Machine Learning Table of contents 01 02 03 Course Structure What is Machine Learning? What is NOT Machine Learning? 04 05...

Introduction to Machine Learning Table of contents 01 02 03 Course Structure What is Machine Learning? What is NOT Machine Learning? 04 05 06 Sequence of ML process ML applications in Civil Benefits and Challenges engineering 01 Course Structure Understanding the science of Machine Learning (ML) ML applications in different fields in civil engineering 02 What is Machine Learning? Traditional Programming Machine Learning Machine Learning Machine Learning Machine Learning “Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.” —Arthur Samuel 03 What is NOT Machine Learning? 04 Sequence of ML process 05 ML applications in Civil engineering ML applications in Civil engineering Construction Material Transportation ML optimizes project scheduling, Enhances material Improves traffic resource allocation and risk properties and predict management and public management performance under transportation systems various conditions through predictive analysis Geotechnical Structure Health monitoring Predicting soil behavior, classify Sensors and ML algorithms to soil types, and model complex monitor infrastructure health in real underground interactions. time. 06 Benefits and Challenges Benefits Increased accuracy Cost reduction Improved efficiency Enhances precision in Lower operational and Accelerates project timelines predictions and designs maintenance costs Challenges Data quality and Need for domain Computational availability expertise resources Requires high quality data Successful ML implementation Processing large datasets may sets which may not be demands deep knowledge require significant always available of both ML and civil computational power engineering Thank you! Artificial Intelligence in Civil Engineering CIS908 Lecture 1 Introduction to Artificial Intelligence Course Instructor: Associate Prof. Dr. –Ing. Maggie Mashaly [email protected] C4.211 1 Lecture Outline Evolution of AI AI Use Cases Artificial Intelligence vs. Machine Learning vs. Deep Learning Machine Learning Types of Machine Learning 1. Supervised Learning AI for Civil Engineering CIS908 2. Unsupervised Learning Dr.- Ing. Maggie Mashaly 3. Reinforcement Learning 2 Artificial Intelligence: We use it everyday! Facebook’s Face Recognition AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 3 Artificial Intelligence: We use it everyday! Google Search AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 4 Artificial Intelligence: We use it everyday! Spam Filtering in e-mails AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 5 Artificial Intelligence: We use it everyday! And in many other scenarios: Data Mining programs detecting fraudulent credit card transactions Learning users’ preferences Self-driving vehicles AI for Civil Engineering CIS908 and many more… Dr.- Ing. Maggie Mashaly 6 AI for Civil Engineering AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 7 https://www.sciencedirect.com/science/article/abs/pii/B9780443131912000092 AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly AI for Civil Engineering 8 AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly AI for Civil Engineering 9 AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly AI for Civil Engineering 10 AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly AI for Civil Engineering 11 Growth of Artificial Intelligence Artificial Intelligence is the preferred approach for: Speech recognition Natural language processing Computer vision AI for Civil Engineering CIS908 Medical outcomes analysis Dr.- Ing. Maggie Mashaly Robot control Computational biology Sensor networks 12 AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly Growth of Artificial Intelligence 13 AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly AI/ML/DL 14 Artificial Intelligence in Civil Engineering CIS908 Lecture 2 Machine Learning Course Instructor: Associate Prof. Dr. –Ing. Maggie Mashaly [email protected] C4.211 1 What is Machine Learning? It is the science of getting computers to learn without being explicitly programmed, AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly Through constructing computer programs that automatically improve with experience 2 What is Machine Learning? Machine Learning: Study of Algorithms that Improve their performance P at some task T AI for Civil Engineering CIS908 With experience E Dr.- Ing. Maggie Mashaly So a well defined learning task is represented as {P,T,E} 3 Machine Learning Examples Example: Spam Filtering process T : Classifying e-mails as spam or not E : Watching you label emails as spam or not spam AI for Civil Engineering CIS908 P : number of correctly classified e-mails as Dr.- Ing. Maggie Mashaly spam/not spam 4 Machine Learning Examples Example: Image Recognition T : Detecting faces in images E : Given example training images AI for Civil Engineering CIS908 P : number of correctly Dr.- Ing. Maggie Mashaly recognized faces 5 Types of Machine Learning Supervised Learning Unsupervised Learning Also known as predictive Also known as descriptive learning learning Given: Training set D of N input- Given: Training set D of N inputs output pairs 𝐷 = 𝑥𝑖 , 𝑖 = 1, … 𝑁 𝐷 = 𝑥𝑖 , 𝑦𝑖 , 𝑖 = 1, … 𝑁 Goal: Finding interesting patterns Goal: Learn a mapping from in the data AI for Civil Engineering CIS908 inputs x to outputs y Has two types: Dr.- Ing. Maggie Mashaly Has two types: 1. Clustering: grouping data 1. Regression: when y is a into cohesive groups continuous value output 2. Non-Clustering: finding 2. Classification: when y is a structure in a chaotic discrete value output environment Other types: Reinforcement Learning 6 Supervised vs. Unsupervised Learning Supervised Unsupervised Learning Learning AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly Training set: Training set: { 𝒙 𝟏 ,𝒚 𝟏 , 𝒙 𝟐 ,𝒚 𝟐 , 𝒙 𝟏 ,𝒙 𝟐 ,𝒙 𝟑 ,…,𝒙 𝒎 𝒙 𝟑 ,𝒚 𝟑 ,…, 𝒙 𝒎 ,𝒚 𝒎 } 7 Supervised Learning Given: Training Training set D of N input-output pairs Set 𝐷 = 𝑥𝑖 , 𝑦𝑖 , 𝑖 = 1, … 𝑁 where Learning 𝑥𝑖 is the input variable (also called input features) Algorithm AI for Civil Engineering CIS908 𝑦𝑖 is the output/target value Dr.- Ing. Maggie Mashaly Predicted 𝑥𝑖 , 𝑦𝑖 is a training example x y h Goal: Learn a function h: x →y such that h(x) is a good predictor for the corresponding value of y 8 Supervised Learning Example: Predicting house prices Assume we are given a training set of living areas and prices for n houses Perform Regression to estimate the hypothesis function to predict continuous valued output AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 9 Supervised Learning Example: Shape Recognition Assume we are given a training set of different shapes corresponding to two classes C=Yes/No Perform Classification to estimate the hypothesis function to predict discrete valued output If C=2 output classes AI for Civil Engineering CIS908 ✓ Binary Classification Dr.- Ing. Maggie Mashaly If C>2 ✓ Multiclass Classification 10 Unsupervised Learning Given: Only input data is given without any outputs Goal: Deriving structure from data where we don’t necessarily know the effect of the variable AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 11 Unsupervised Learning Example: Clustering according to height and weight Assume we are given a heights and weights of a group Perform Clustering to divide the data into groups AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 12 Unsupervised Learning More examples: Organizing computer clusters: servers who work together are placed together for better data center performance Social Network Analysis Detecting cohesive groups and suggesting friends AI for Civil Engineering CIS908 Market Segmentation Grouping customers into segments for better marketing Dr.- Ing. Maggie Mashaly 13 Unsupervised Learning Example: Cocktail Party Problem Assume a cocktail party where everyone is speaking at the same time Trying to recognize what everyone is saying Perform a non-clustering algorithm to find structure in a chaotic environment AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 14 Artificial Intelligence in Civil Engineering CIS908 Lecture 3 Data Analytics I Course Instructor: Associate Prof. Dr. –Ing. Maggie Mashaly [email protected] C4.211 1 Lecture Outline Data Workflows & Pipelines Big Data Types of Data: Structured, Unstructured, Semi-structured Data Warehouses & Data Lakes Data Processing AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 2 The world’s most valuable resource is no longer oil, but Data AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 3 AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly Big Data has become our new norm 1 𝑍𝐵 = 1021 𝐵 4 AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly Big Data has become our new norm 5 Data Workflow How does data flow in an organization? 4 general steps AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 6 Data Workflow How does data flow in an organization? 1. Data Collection & Storage - Data is collected from traffic, surveys or media traces - Data has different types, e.g.: images, videos, text files, etc.… - Data is stored in Raw Format AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 7 Data Workflow How does data flow in an organization? 2. Data Preparation - Cleans data; e.g.: by removing missing/duplicate values - Converts data into a more organized format AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 8 Data Workflow How does data flow in an organization? 3. Exploration & Visualization - Usually done by building Dashboards that help us visualize the data and keep track of how it changes - Allows us to compare between different data sets, visualize trends and relations in data, etc.. AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 9 Data Workflow How does data flow in an organization? 4. Experimentation & Prediction - Involves gaining insights about the data to be able to draw conclusions and take decisions - Building prediction models could also be done using Machine learning techniques AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 10 Handling Data Workflow Handling Data Workflow is done by Data Engineers Tasks of a Data Engineer: Deliver Correct Data, in the Right Form, to the right people, as efficiently as possible Ingest data from different resources AI for Civil Engineering CIS908 Optimize databases Dr.- Ing. Maggie Mashaly for analysis Remove corrupted data Develop, construct, test and maintain data architectures 11 What is Big Data? Data with huge volumes that makes traditional data handling methods inefficient Characterized by five V’s AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 12 So how do we manage Big Data? By following a few – not so simple- steps: 1. Ingest data from different sources 2. Process the data 3. Store the data To do that efficiently, we need Data Pipelines AI for Civil Engineering CIS908 Enable the flow of data from sources to data warehouses, Dr.- Ing. Maggie Mashaly then to analytics database Automates flow between stations Provide up-to-date, accurate, relevant data 13 Data Pipeline Ensures an efficient flow of the data by: Automating ✓ Data Extraction ✓ Data Transformation ✓ Combining Data Usually following the ETL Framework ✓ Validating Data “Extract, Transform, Load” AI for Civil Engineering CIS908 ✓ Loading Data 1. Extract Data 2. Transform extracted data Dr.- Ing. Maggie Mashaly 3. Load transformed data to Reducing another database ✓ Human Intervention ✓ Errors ✓ Data Flow time 14 Data Types So what types of data are we dealing with? Structured Semi-Structured Un-Structured Data Data Data AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 15 Data Types – Structured Data Data having a defined data model, format & structure Characteristics: Has a consistent model of rows and columns Easy to search & organize Often managed using Structured Query Language ‘SQL’ Can be grouped to form relations through relational databases AI for Civil Engineering CIS908 Example Dr.- Ing. Maggie Mashaly Students’ list in a course Index Name ID Faculty GPA 0 John Doe 46-1234 IET 1.2 1 Jane Doe 37-5678 MET 2.3 2 Mohamed Ahmed 43-9010 BI 3.4 3 Ahmed Mohamed 40-1112 MGMT 2.5 16 Data Types – Structured Data Example – Relational Databases Structured data tables can easily be related to each other In this example: connect on the ID column Index Name ID Faculty GPA ID Major 0 John Doe 46-1234 IET 1.2 46-1234 Networks 1 Jane Doe 37-5678 MET 2.3 37-5678 DMET AI for Civil Engineering CIS908 2 Mohamed Ahmed 43-9010 BI 3.4 43-9010 BI Dr.- Ing. Maggie Mashaly 3 Ahmed Mohamed 40-1112 MGMT 2.5 40-1112 Marketing Index Name ID Faculty Major GPA 0 John Doe 46-1234 IET Networks 1.2 1 Jane Doe 37-5678 MET DMET 2.3 2 Mohamed Ahmed 43-9010 BI BI 3.4 3 Ahmed Mohamed 40-1112 MGMT Marketing 2.5 17 Data Types – Semi-Structured Data Data having some consistent characteristics but doesn’t conform to a structure as rigid as a database Characteristics: Consistent model, less rigid implementation Relatively easy to search & organize Managed using ‘NoSQL’ databases: JSON, XML, YAML AI for Civil Engineering CIS908 Can be grouped, but needs more work Example Dr.- Ing. Maggie Mashaly Email Messages ▪ Unstructured Content ▪ Structured sender/recipient names and addresses, time sent, … Digital Images ▪ Unstructured Image Content ▪ Structured date/time stamp, location, device,… 18 Data Types – Semi-Structured Data Example JSON File of favorite artists list ▪ Consistent model, all entries have the same set of features ▪ Number of favorite artists may differ (a characteristic not allowed in structured databases) AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 19 Data Types – Unstructured Data Data that doesn’t follow a model and can’t be contained in a row- column database Characteristics: Challenging to search, manage and analyze Usually stored in data lakes, but can appear in data warehouses or databases Most of our data nowadays is unstructured AI for Civil Engineering CIS908 Extremely valuable, can be analyzed using AI & ML techniques Dr.- Ing. Maggie Mashaly Example Facebook feed consisting of Status updates, Sound Clips, Pictures & Videos Presentations & PDFs 20 Data Lakes & Data Warehouses Where is our data stored? Data Lake Data Warehouse Stores all the Raw Data Stores specific data for specific use Stores all data structures Stores mainly structured data, Warehouse is a type of database AI for Civil Engineering CIS908 Cost-effective More costly to update Dr.- Ing. Maggie Mashaly Difficult to analyze Optimized for data analysis Highly Agile, configure & reconfigure is Less Agile, fixed configuration needed Used by data scientists for big Used by data & bussiness analysts data/real time analytics 21 Data Lakes Practices: Data Catalog What is a Data Catalogue ? “An organized inventory of data assets in the organization” Is a source of truth that compensates for the lack of structure in a data lake Uses Metadata to help organizations manage their data AI for Civil Engineering CIS908 Identifies: Dr.- Ing. Maggie Mashaly What is the source of the data? Where is this data used? Who is the owner of the data? How often is the data updated? 22 Data Lakes Practices: Data Catalog Data Catalogue is a good practice for any data storage solution Ensures reliability, autonomy, scalability & speed Prevents a Data Lake from turning into a Data Swamp AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 23 Data Lake vs. Data Swamp Data Catalogue prevents Data Lake from turning into a Data Swamp AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 24 Data Processing “The process of converting Raw Data into Meaningful Information” AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 25 Data Processing Why do we need to process data? 1. Remove unwanted data 2. Optimize memory, storage & network costs Storing & transferring data comes at a high cost Uncompressed data can be 10 times bigger then compressed AI for Civil Engineering CIS908 3. Convert data from one type to another Dr.- Ing. Maggie Mashaly E.g.: changing file type to reduce its size 4. Organize data From data lake to data warehouse to ease analysis step 5. Increase productivity 26 Data Processing How often do we process data? 1. Batch Processing Group records at intervals Often cheaper (can be scheduled when resources are not consumed by other processes) AI for Civil Engineering CIS908 2. Stream Processing Dr.- Ing. Maggie Mashaly Send individual records right away E.g.: updating database with new registered users so they can use their accounts 27 Data Processing There are hundreds of software for this task (FYI) AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 28 Artificial Intelligence in Civil Engineering CIS908 Lecture 4 Data Analytics II Course Instructor: Associate Prof. Dr. –Ing. Maggie Mashaly [email protected] C4.211 1 Lecture Outline Types of Variables Data Transformation Handling Data Challenges: 1. Missing Data 2. Rare Labels 3. Outliers AI for Civil Engineering CIS908 4. Linear Models Assumption 5. Variable Distribution Dr.- Ing. Maggie Mashaly 6. Variable Magnitude 2 Poll – What do data scientists mostly spend their time on? Cleaning Building Mining & Training data for Organizing Sets patterns Data AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly Collecting Refining datasets Algorithms Other 3 AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly Reference: https://www.forbes.com/sites/gilpress 4 Variables Variables are data values that can be counted/measured e.g: Name, Age, Gender, Data, etc.. Types of Variables: AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 5 Numerical Variables Discrete Variables Continuous Variables Variables whose values are whole Variables that can take any value within numbers/counts a certain range Examples: Examples: - Number of family members - Mobile phone bill - Number of items bought from store - House price - Number of students enrolled in course - Time spent on social media AI for Civil Engineering CIS908 70 10 Dr.- Ing. Maggie Mashaly 60 8 50 40 6 30 20 4 10 2 0 Data Cloud Local Area 0 Engineering Computing Networks Thursday Friday Saturday Sunday 6 Categorical Variables Variables that can take on one of a limited, and usually fixed number of possible values Example: - Mobile Network Provider (Etisalat, Vodafone, …) - Gender (Male/Female) AI for Civil Engineering CIS908 - Marital Status (Married, Divorced, Single, …) Dr.- Ing. Maggie Mashaly 7 Categorical Variables Ordinal Variables Nominal Variables Categorical variables in which Categorical variables that cannot be categories can be meaningfully ordered meaningfully ordered Examples: Examples: - Students’ exam grades (A=0.7, - Place of birth B=1.3,…, F=5) - Postal codes - Week Days ( Saturday=1,…, Friday=7) - Gender AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly Special cases of Categorical Variables: Categories can be encoded as numbers (Male=0, Female=1) ID variables: numbers that uniquely identify an observation 8 Categorical Variables : Cardinality Number of different categories that a variable can take is referred to as Cardinality Gender Language Example: Male Chinese Male English - Cardinality of variable ‘Gender’=2 Male English - Cardinality of variable ‘Language’=8 Male French AI for Civil Engineering CIS908 Male Dutch Male French Dr.- Ing. Maggie Mashaly Variables with high number of categories as postal Female Italian codes are referred to as being Highly Cardinal Female Arabic Female German Problems with Highly Cardinal variables: Female German 1. Scikit-Learn does not support strings as inputs Female Arabic Female Spanish 2. Categories must be encoded into numbers 9 Categorical Variables : Cardinality Problems with Highly Cardinal variables: 3. Un-even distribution between Train/Test sets for Machine Learning models Some categories might appear in train set only: Over-fitting Why? Variables with too many categories tend AI for Civil Engineering CIS908 to dominate those with fewer categories Large number of categories introduces Train Test Dr.- Ing. Maggie Mashaly more noise and less information Thus, reducing cardinality may improve performance Some categories might appear in test set only: Prediction failure Why? Models cannot perform predictions for categories they haven’t been trained on 10 Date & Time Variables Date & Time variables require special consideration to extract useful information from them Has three types: Date only e.g.: Date of Birth (‘27/3/1989’, ‘Dec-2019’) AI for Civil Engineering CIS908 Time only e.g.: Time of accident (’13:04:05’) Dr.- Ing. Maggie Mashaly Date & Time e.g.: Payment date (‘2/11/2020 10:00:07’) 11 Mixed Variables Variables containing both numbers and categories in their values Has two types: Numbers or labels in different Numbers and labels in the same observations observation Examples: Examples: - Students’ grade record (251, C, 3) - Seat Number (A5) AI for Civil Engineering CIS908 - Number of pending payments (1-5, - Car plates (ABC123) Dr.- Ing. Maggie Mashaly D/C) 12 Artificial Intelligence in Civil Engineering CIS908 Lecture 5 Data Challenges Course Instructor: Associate Prof. Dr. –Ing. Maggie Mashaly [email protected] C4.211 1 Data Challenges A few of the problems we deal with while handling variables include: 1. Missing Data 2. Rare Labels 3. Outliers AI for Civil Engineering CIS908 4. Linear Models Assumption Dr.- Ing. Maggie Mashaly 5. Variable Distribution 6. Variable Magnitude 2 Missing Data Missing data/variables occurs when no data is stored for a certain observation in a variable A common occurrence in most datasets Has a significant effect on conclusions drawn from data AI for Civil Engineering CIS908 Why might that happen? Dr.- Ing. Maggie Mashaly 1. Lost Data: Data missing due to being forgotten, lost, or not properly stored. E.g.: missing fields in forms/surveys 2. Non-existing: e.g.: NaN resulting from dividing by zero 3. Not found/identified: results when attempting to match against wrong or non existing data. 3 Missing Data Solution: Data Imputation The process of replacing missing data with substituted values But first, we need to understand the mechanisms by which missing data is introduced: AI for Civil Engineering CIS908 1. MCAR : Missing Data Completely at Random Dr.- Ing. Maggie Mashaly 2. MAR : Missing Data at Random 3. MNAR : Missing Data Not at Random 4 Missing Data: MCAR Missing Data Completely at Random (MCAR) Occurs when the probability of being missing is the same for all the observations No relationship exists between missing data and any other values (also missing or observed) within the dataset AI for Civil Engineering CIS908 Causes of the missing data are unrelated to the data Dr.- Ing. Maggie Mashaly Thus: Disregarding those missing values will not cause bias Examples: - Device running out of batteries, thus missing some readings - Repeated members in data samples 5 Missing Data: MAR Missing Data at Random (MAR) Occurs when the probability of an observation being missing depends on available information i.e.: there is a systematic relationship between the behavior of missing values and the observed data AI for Civil Engineering CIS908 Example: Dr.- Ing. Maggie Mashaly Gender Weight Gender Weight Male 60 kg Female NA 2 NA / 6 Men = 33% Male NA Female NA 3 NA / 6 Women = 50% Male NA Female 60 kg More missing values for women than men Male 77 kg Female 55 kg Gender should be considered while Male 80 kg Female NA compensating for missing data Male 62 kg Female 58 kg 6 Missing Data: MNAR Missing Data Not at Random (MNAR) Occurs when there exists a reason/mechanism why missing values are introduced in the dataset Example: Depression No. of Clinical No. of Weekly Depression No. of Clinical No. of Weekly AI for Civil Engineering CIS908 Visits Sport Classes Visits Sport Classes Yes 1 NA No 0 0 Dr.- Ing. Maggie Mashaly Yes NA NA No NA 5 Yes NA 0 No 1 2 Yes 4 2 No 1 1 Yes NA 1 No 2 1 Yes 3 NA No NA 2 More NA overall for depressed patients 7 Less NA for non-depressed patients Rare Labels Rare labels are those appearing only in a tiny proportion of the observations in a dataset Example: for variable ‘City’ where a person lives Cairo would be a frequent category Farafra would be a rare category AI for Civil Engineering CIS908 Problems with Rare Labels: Same as with High Cardinality Dr.- Ing. Maggie Mashaly 1. Scikit-Learn does not support strings as inputs 2. Categories must be encoded into numbers 3. Un-even distribution between Train/Test sets Some categories might appear in train set only: Over-fitting Some categories might appear in test set only: Prediction failure 8 Rare Labels Example: Predicting house prices given material of their exteriors AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly Rare labels, only present in less than 5% of the data Price varies tremendously among these rare labels… 9 Rare Labels Interpretation for huge variation in the average price values: Rare labels could be very predictive, or.. They could be introducing noise rather than information Can we know for sure which of them is the case? AI for Civil Engineering CIS908 Unfortunately No!, Why? Dr.- Ing. Maggie Mashaly Because these labels are under-represented (only present in a very small percentage of the data) Thus, it is hard to derive reliable information from these labels 10 Rare Labels One possible solution: Group all rare labels together AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 11 Outliers “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism”[D. Hawkins, Identification of Outliers, Chapman and Hall, 1980] How to deal with outliers: Give them special attention to find out what caused them? Or.. AI for Civil Engineering CIS908 Ignore and remove all outlier data? Dr.- Ing. Maggie Mashaly Depending on the context: 12 Revenue Forecasting Credit Card Transaction Outliers How to detect outliers: Based on the distribution of data But first, a review on quantiles & quartiles Quartiles: dividing the distribution in 4 Quantiles: dividing the distribution into 100 1st Quartile=25th Quantile, 3rd Quartile=75th Quantile AI for Civil Engineering CIS908 2nd Quartile=50th Quantile=Median Dr.- Ing. Maggie Mashaly Interquartile Range (IQR) : where 50% of the observations are found = 75th Quantile- 25th Quantile = 3rd Quartile -1st Quartile 13 Outliers How to detect outliers: Based on the distribution of data Normal Distribution Approximately 99% of the observations of a normally distributed variable lie within the mean±3*standard deviation 𝜎 So, values outside mean±3*standard deviation are considered AI for Civil Engineering CIS908 outliers Dr.- Ing. Maggie Mashaly 14 Outliers How to detect outliers: Based on the distribution of data Skewed Distribution Calculate the quantiles Interquantile Range (IQR) = 75th Quantile- 25th Quantile AI for Civil Engineering CIS908 Upper limit= 75th Quantile + IQR*1.5 Dr.- Ing. Maggie Mashaly Lower limit= 25th Quantile - IQR*1.5 *For detecting extreme outliers, multiply the IQR by 3 instead of 1.5 15 AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly Outliers Visualizing outliers: Box Plots 16 AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly Outliers Visualizing outliers: Box Plots 17 Linear Models Using a linear function to estimate the relation between inputs variables and labels Follows the linear equation: 𝑌 ≈ 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛 Linear relationship can be determined using Scatter Plots AI for Civil Engineering CIS908 Linear Relationship Semi-Linear Non-Linear Dr.- Ing. Maggie Mashaly Relationship Relationship 18 Linear Models Also Residual plots, which express the difference between the predictions and the real value Shows residuals on the vertical axis and the independent variables on the horizontal axis AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly If relationship between X & Y is linear, residuals should be normally distributed and centered around 0 Linear Relationship Semi-Linear Relationship 19 Linear Models Linear Models makes the following (misleading) assumptions: Linear relationship between variables and labels Multivariate normality: all independent variables x are normally distributed No/little co-linearity: AI for Civil Engineering CIS908 variables have zero or little correlation between them Dr.- Ing. Maggie Mashaly Homoscedasticity: assuming that different variables have the same variance 20 Linear Models - Normality Variables are considered Normal when they follow the Gaussian (Normal) distribution Why non-normality is it a problem? - Most linear models are based on it - Easier to deal with, normally distributed variables, no need to handle skewness AI for Civil Engineering CIS908 Can be statistically tested (ex:. Kolmogorov-Smirnov test) Dr.- Ing. Maggie Mashaly If variables are not normal? Do a non-linear transformation (e.g.: logarithm transformation) 21 Linear Models - Normality Evaluating Normality: Histograms Gaussian distributions adopt a ‘bell shape’ AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly Linear Relationship Semi-Linear Non-Linear Relationship Relationship 22 Linear Models - Normality Evaluating Normality: Q-Q PLOTS Q-Q plots present the variable quantiles on the y-axis and the expected quantiles of the normal distribution on the x-axis For normally distributed variables, relations should fall on a 45 degree line AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly Linear Relationship Semi-Linear Non-Linear Relationship Relationship 23 Linear Models - Collinearity Multicollinearity exists whenever an independent variable is highly correlated with one or more of the other independent variables Why assuming non-collinearity a problem? - Undermines the statistical significance of an independent variable - Challenging to detect the effect of each variable on the label AI for Civil Engineering CIS908 Can be assessed using a correlation matrix or Dr.- Ing. Maggie Mashaly Variance Inflation factor (VIF) How to resolve it? - Removing redundant variables might solve the issue - Principal Component Analysis 24 Linear Models - Collinearity Evaluating Multicollinearity: Heat Maps of Correlation Matrix AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 25 Linear Models - Homoscedasticity Assumption that the independent variables have the same finite variance Also known as homogeneity of variance Describes a situation when the random disturbance in the relationship between the independent variables (x) and the dependent variable (y) is the same across all values of the independent variables AI for Civil Engineering CIS908 Why is it a problem? Dr.- Ing. Maggie Mashaly - Most linear models are based on it Can be assessed using tests/plots: Residual’s plot, Leven’s test, Barlett’s test, etc.. How to resolve it? - Transforming skewed variables (e.g.: logarithm transformation) 26 Feature Magnitude Recall linear models’ equation 𝑌 ≈ 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛 Coefficients 𝜃𝑖 indicate the change in 𝑌 per unit change in 𝑥𝑖 Thus if the scale of 𝑥𝑖 changes, 𝜃𝑖 will change - e.x.: if 𝑥𝑖 represents distance, then being represented in centimeters or kilometers AI for Civil Engineering CIS908 will affect the value of 𝜃𝑖 Dr.- Ing. Maggie Mashaly Coefficients’ values highly depend on the magnitude of features 27 Feature Magnitude Problems with features of high/low magnitudes: 1. Features with bigger magnitudes dominate over features with smaller magnitudes 2. Will cause some machine learning models to converge slower (e.g.: Gradient descent for Regression/Neural Networks) 3. Inaccuracy while calculating Euclidean distance between features AI for Civil Engineering CIS908 𝐷(𝑥, 𝑦)2 = σ𝑑𝑖=1(𝑥𝑖 − 𝑦𝑖 )2 Dr.- Ing. Maggie Mashaly Example: 28 Artificial Intelligence in Civil Engineering CIS908 Lecture 6 Applying Machine Learning Course Instructor: Associate Prof. Dr. –Ing. Maggie Mashaly [email protected] C4.211 1 Contents Machine Learning Diagnostics Model Selection Diagnosing Bias vs. Variance AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 2 Machine Learning Diagnostic Suppose you have implemented linear regression to predict housing prices. 𝑚 1 (𝑖) (𝑖) 2 𝐽 𝜃0 , 𝜃1 ,.. , 𝜃𝑚 = ෍ ℎ𝜃 𝑥 −𝑦 2𝑚 𝑖=1 However, when you test your hypothesis on a new set of houses, AI for Civil Engineering CIS908 you find that it makes unacceptably large errors in its predictions. What should you try next? Dr.- Ing. Maggie Mashaly Get more training examples Try adding polynomial features Try smaller sets of features (𝑥12 , 𝑥22 , 𝑥1 𝑥2 ) Try getting additional features Try a different hypothesis 3 Choosing your Hypothesis Which one is the best hypothesis ?? Price Price AI for Civil Engineering CIS908 0 Area 0 Area Dr.- Ing. Maggie Mashaly This one will present the lowest Price cost function (lowest error) But this may not present a general learning 0 Area It represents an Over Fitting of the data 4 Over Fitting & Under Fitting Over Fitting Under Fitting › Excellent fitting of the training › Poor Fitting of the learning data › Poor ability to predict › Called high bias unlearned data › Caused by: › Called high variance – Simple hypothesis – Too few features AI for Civil Engineering CIS908 › Caused by – Small number of training data – Very complex Hypothesis Dr.- Ing. Maggie Mashaly – Too many features 5 Model Selection Consider your given dataset: Split the data into a Training Set (70%) and a Size Price Test Set (30%) 2104 400 For Linear Regression: 1600 330 Training 2400 369 Learn parameter  from training data to Set AI for Civil Engineering CIS908 Size=𝑚 1416 232 minimize training error 𝐽(𝜃) 3000 540 Dr.- Ing. Maggie Mashaly Compute test set error: 1985 300 𝑚𝑡𝑒𝑠𝑡 1534 315 1 (𝑖) (𝑖) 2 Test Set 𝐽𝑡𝑒𝑠𝑡 𝜃 = ෍ ℎ𝜃 𝑥𝑡𝑒𝑠𝑡 − 𝑦𝑡𝑒𝑠𝑡 1427 199 2𝑚𝑡𝑒𝑠𝑡 Size=𝑚𝑡𝑒𝑠𝑡 𝑖=1 1380 212 1494 243 6 Model Selection Back to our dataset: Size Price 2104 400 Split the data into a Training Set (60%), a Cross 1600 330 Validation (CV) Set (20%) and a Test Set (20%) Training 2400 369 Optimize the parameters in  using the training Set 1416 232 Size=𝑚 set for each polynomial degree 𝐽𝑡𝑟𝑎𝑖𝑛 𝜃 3000 540 1985 300 AI for Civil Engineering CIS908 Find the polynomial degree d with the least Cross 1534 315 error using the cross validation set. Validation Set Dr.- Ing. Maggie Mashaly 𝑚𝑐𝑣 Size=𝑚𝐶𝑉 1427 199 1 (𝑖) (𝑖) 2 𝐽𝑐𝑣 𝜃 = ෍ ℎ𝜃 𝑥𝑐𝑣 − 𝑦𝑐𝑣 Test Set 1380 212 2𝑚𝑐𝑣 Size=𝑚𝑡𝑒𝑠𝑡 1494 243 𝑖=1 Estimate the generalization error using the test 1 (𝑖) (𝑖) 2 set 𝐽𝑡𝑒𝑠𝑡 𝜃 = σ𝑚 𝑡𝑒𝑠𝑡 ℎ𝜃 𝑥𝑡𝑒𝑠𝑡 − 𝑦𝑡𝑒𝑠𝑡 7 2𝑚𝑡𝑒𝑠𝑡 𝑖=1 Model Selection However there are still limitations for using a single training/CV/test set: Original data might not be enough to make a sufficiently large training/CV/test sets A single training set doesn’t tell us how sensitive accuracy is to a particular training sample Possible solution: Random Resampling AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 8 Model Selection Done more accurately it is called: K-fold Sampling Partition data into n subsamples Iteratively leave one subsample out for the test set, train on the rest Suppose we have 100 instances: AI for Civil Engineering CIS908 Accuracy=73/100 =73% Dr.- Ing. Maggie Mashaly Common value for K is 10, smaller numbers are also used when learning is time consuming 9 Model Selection Other possible solution: Stratified Sampling Ensures that class proportions are maintained in each selected set How it is done: 1. Stratify instances by class 2. Randomly select instances from each class proportionally AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 10 Model Evaluation How to understand types of mistakes your model is making? Construct a Confusion Matrix Emotion Recognition from Video Actual Class AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 11 Predicted Class Model Evaluation Confusion Matrix 𝑇𝑃+𝑇𝑁 Is the simple accuracy calculation sufficient? Not Always… 𝑇𝑃+𝐹𝑃+𝐹𝑁+𝑇𝑁 Accuracy may not be useful measure in cases where there is a large class skew Is 98% accuracy good if 97% of the instances are negative? AI for Civil Engineering CIS908 There are differential misclassification costs; say, getting a positive Dr.- Ing. Maggie Mashaly wrong costs more than getting a negative wrong Consider a medical domain in which a false positive results in an extraneous test but a false negative results in a failure to treat a disease We are most interested in a subset of high-confidence predictions 12 Model Evaluation Confusion Matrix Other Accuracy metrics: TP–rate & FP-rate ROC Curves: “Receiver Operating Characteristics” curve plots the TP-rate (sensitivity/prob. Of detection) vs. the FP-rate (prob. of false alarm) at various threshold settings AI for Civil Engineering CIS908 Dr.- Ing. Maggie Mashaly 13 Model Evaluation Diagnosing Bias vs. Variance Recall Over/Under fitting hypotheses Price Price Price 0 Area 0 0 Area Area High Bias AI for Civil Engineering CIS908 “Just Right” High Variance (Underfit) d=2 (Overfit) Dr.- Ing. Maggie Mashaly Training Error: d=1 d=4 𝑚𝑇𝑟𝑎𝑖𝑛 1 (𝑖) (𝑖) 2 𝐽𝑇𝑟𝑎𝑖𝑛 𝜃 = ෍ ℎ𝜃 𝑥𝑇𝑟𝑎𝑖𝑛 − 𝑦𝑇𝑟𝑎𝑖𝑛 2𝑚 𝑇𝑟𝑎𝑖𝑛 𝑖=1 Cross Validation Error: 𝑚𝑐𝑣 1 (𝑖) (𝑖) 2 𝐽𝑐𝑣 𝜃 = ෍ ℎ𝜃 𝑥𝑐𝑣 − 𝑦𝑐𝑣 14 2𝑚𝑐𝑣 𝑖=1 Model Evaluation Diagnosing Bias vs. Variance How to diagnose if your learning algorithm is suffering from a Bias/Variance Problem? Bias (Underfit) Under Over Fitting 𝐽𝑇𝑟𝑎𝑖𝑛 𝜃 will be high Fitting Cost function 𝐽𝐶𝑉 𝜃 ≈ 𝐽𝑇𝑟𝑎𝑖𝑛 𝜃 AI for Civil Engineering CIS908 Variance (Overfit) 𝐽 𝜃 Dr.- Ing. Maggie Mashaly 𝐽𝑇𝑟𝑎𝑖𝑛 𝜃 will be low 𝐽𝐶𝑉 𝜃 ≫ 𝐽𝑇𝑟𝑎𝑖𝑛 𝜃 0 1 2 3 4 5 6 Degree of the Under Best Over polynomial Fitting Hypothesis Fitting 15

Use Quizgecko on...
Browser
Browser