AI for Software Engineers Lecture 2 PDF
Document Details
Uploaded by SeasonedHarmonica
Dr. Hager Hussein
Tags
Summary
This document is a lecture on artificial intelligence (AI) and machine learning (ML). It discusses different types of AI, like classifications, probabilities, regressions, and rankings. The material also covers machine learning algorithms and how they work including the concepts of feature engineering, model evaluation, and overfitting/underfitting.
Full Transcript
AI for Software Engineers Lecture 2 Dr. Hager Hussein 1 Things Intelligence Can Predict ⚫Classifications of the context into a small set of possibilities or outcomes. ⚫Estimations of probabilities about the context or future outcomes. ⚫Regressions...
AI for Software Engineers Lecture 2 Dr. Hager Hussein 1 Things Intelligence Can Predict ⚫Classifications of the context into a small set of possibilities or outcomes. ⚫Estimations of probabilities about the context or future outcomes. ⚫Regressions that predict numbers from the context. ⚫Rankings which indicate which entities are most relevant to the context. ⚫Hybrids and combinations of these. 2 Classification ⚫A classification is a statement from a small set of possibilities. It could be a statement about the context directly, or it could be a prediction of an outcome that will occur based on the context. ⚫Classifications are problematic when: ⚫ There are many possible choices—when you have hundreds or thousands of possibilities. In these situations you might need to break up the problem into multiple sub-problems or change the question the intelligence is trying to answer. ⚫ You need to know how certain the prediction is—for example, if you want to take an action when the intelligence is really certain. In this case, consider probability estimates instead of classifications. 3 Classification Algorithms ⚫Decision Trees ⚫Logistic Regression ⚫Naïve Bayes ⚫K-Nearest Neighbors ⚫Support Vector Machines 4 Probability Estimates ⚫Probability estimations predict the probability the context is of a certain type or that there will be a particular outcome. ⚫Probability estimations are problematic when: ⚫As with classifications, probabilities don’t work well when there are many possible outcomes. ⚫ You need to react to small changes: Slight changes in the context can cause probabilities to jitter. 5 Regressions 1/2 ⚫Regressions are numerical estimates about a context, for example: ⚫ The picture contains 6 cows. ⚫The manufacturing process will have 11 errors this week. ⚫The house will sell for 743 dollars per square foot. ⚫Regressions allow you to have more detail in the answers you get from your intelligence. For example, consider an intelligence for an auto-pilot for a boat. ⚫A classification might say, “The correct direction is right.” ⚫A probability might say, “The probability you should turn right is 75%.” ⚫A regression might say, “You need to turn 130 6 degrees right.” Regressions 2/2 ⚫Regressions are problematic when: ⚫ You need to react to small changes: Slight changes in the context can cause regressions to jitter. ⚫You need to get training data from users: It is much easier to know “in this context, the user turned right” than to know “in this context the user is going to turn 114 degrees right.” ⚫Classifications can be used to simulate regressions. For example, you could try to predict classifications with the following possibilities: ⚫“Turn 0 - 10 degrees right.” ⚫ “Turn 11 - 45 degrees right.” ⚫ “Turn 46 - 90 degrees right.” ⚫And so on. 7 Rankings ⚫Rankings are used to find the items most relevant to the current context: ⚫Which songs will the user want to listen to next? ⚫Which web pages are most relevant to the current one? ⚫Which pictures will the user want to include in the digital scrap-book they are making? 8 Hybrids and Combinations ⚫Most intelligences produce classifications, probability estimations, regressions, or rankings. But combinations and composite answers are possible. ⚫For example, you might need to know where the face is in an image. You could have one regression that predicts the X location of the face and another that predicts the Y location, but these outputs are highly correlated—the right Y answer depends on which X you select, and vice versa. It might be better to have a single regression with two simultaneous outputs, the X location of the face and the Y 9 location. Machine Learning ⚫A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. 10 How Machine Learning Works ⚫A machine learning algorithm is essentially a search procedure that looks for accurate models, using training data to evaluate the accuracy. Generally, machine learning algorithms do the following: ⚫Start with a simple model. ⚫Try slightly refined versions of the model (usually informed by the training data). ⚫Check to see if the refined versions are better (using the training data). ⚫And iterate (roughly) until their search procedure can’t find better models. 11 Example 1/3 ⚫For example, recall that a decision tree represents intelligence with a tree. Each node contains an if condition, with one child when the if-test is true and one child when the if-test is false. Here is a sample decision tree for predicting how much money a movie will make: 12 Example 2/3 ⚫Machine learning for decision trees produces increasingly complex trees by adding if-tests until no further additions improve the model (or until the model reaches some complexity threshold). For example, a refinement of the sample decision tree for moviesuccess-prediction might be this: 13 Example 3/3 ⚫This model is a bit more complex, and possibly a bit more accurate. A human could carry out the same process by hand, but machine learning algorithms automate the process and can consider millions of contexts and produce hundreds of thousands of small refinements in the time it would take a human to type “hello world.” 14 Important Factors to Consider Factors an intelligence creator must control when using machine learning include these: ⚫FeatureEngineering: How the context is converted into features that the machine learning algorithm can add to its model. The features you select should be relevant to the target concept, and they should contain enough information to make good predictions. ⚫Model structure complexity: How big the model becomes. For example, the number of tests in the decision tree or the number of features in the linear model. ⚫Model searchcomplexity: How many things the machine learning algorithm tries in its search. This is separate from (but related to) structure complexity. The more things the search tries, the more chance it has to find something that looks good by chance but doesn’t generalize well. ⚫Data size: How much training data you have. The more good, diverse data available to guide the machine learning search, the more complex and accurate your 15 models can become. Running Example: House Price Analysis ⚫Given data about a house and its neighborhood, what is the likely sales price for this house? f(size, rooms, tax, neighborhood,... ) → price 16 Training Data for House Price Analysis ⚫Collect data from past sales 17 Learning with Decision Trees ⚫We are using decision trees as an example of a simple and easy to understand learning algorithm. 18 Decision Trees 19 Building Decision Trees ⚫Identify all possible decisions ⚫Select the decision that best splits the dataset into distinct outcomes (typically via entropy or similar measure) ⚫Repeatedly further split subsets, until stopping criteria reached 20 Example Feature 2, Feature 3 ⚫In this example “Energy” will be our root node and we’ll do the same for sub-nodes. Here we can see that when the energy is “high” the entropy is low and hence we can say a person will definitely go to the gym if he has high energy, but what if the energy is low? We will again split the node based on the new feature which is “Motivation”. Overfitting with Decision Trees ⚫The tree perfectly fits the data, except when there is not enough data to distinguish outcomes. ⚫Not obvious that this tree will generalize well. ⚫In decision trees, over-fitting occurs when the tree is designed so as to perfectly fit all samples in the training data set. Thus it ends up with branches with strict rules of sparse data. Thus this effects the accuracy when predicting samples that are not part of the training set. 33 Identify overfitting ⚫If you start overfitting you’ll need: ⚫Features that better match the problem. ⚫A model structure that better matches the problem. ⚫Less search to produce models. ⚫Or more data! ⚫More data is the best way to avoid overfitting, allowing you to create more complex and accurate models. 34 Underfitting with Decision Trees ⚫If the model can only learn a single decision, it picks the best fit, but does not have enough freedom to make good predictions. ⚫When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting. 35 Overfitting/Underfitting ⚫Overfitting: Model learned exactly for the input data, but does not generalize to unseen data (e.g., exact memorization) ⚫Underfitting: Model makes very general observations but poorly fits to data (e.g., brightness in picture) ⚫Typically adjust degrees of freedom during model learning to balance between overfitting and underfitting: can better learn the training data with more freedom (more complex models); but with too much freedom, will memorize details of the training data rather than generalizing 36 On Terminology ⚫The decisions in a model are called model parameter of the model (constants in the resulting function, weights, coefficients), their values are usually learned from the data ⚫The parameters to the learning algorithm that are not the data are called model hyperparameters ⚫Degrees of freedom ~ number of model parameters 37 Improvements ⚫Averaging across multiple trees to avoid overfitting ⚫Building different trees on different subsets of the training data or basing decisions on different subsets of features ⚫Different decision selection criteria and heuristics, Gini impurity, information gain, statistical tests, etc ⚫Better handling of numeric data ⚫Extensions for graphs 38 NO SPECIFICATIONS ⚫No specification given for f(outlook, temperature, humidity, windy). ⚫Learning from data! ⚫We do not expect perfect predictions; no possible model could always predict all training data correctly. ⚫We are looking for models that generalize well. 39 Machine Learning Pipeline 40 Pipeline Steps 1/2 ⚫Data collection: identify training data, often many sources. ⚫Data cleaning: remove wrong data, outliers, merge data from multiple sources. ⚫Data labeling: identify labels (Y) on training data. ⚫Feature engineering: convert raw data into a form suitable for learning, identifying features, encoding, normalizing. 41 Pipeline Steps 2/2 ⚫Model training: build the model, tune hyperparameters. ⚫Model evaluation: determine fitness for purpose. ⚫Data science education focuses on feature engineering and model training. ⚫Data science practitioners spend substantial time collecting and cleaning data. ⚫Requirements, deployment, and monitoring rarely focus in data science 42 education. Normalizing ⚫Sometimes numerical features have very different values. For example, age (which can be between 0 and about a hundred) and income (which can be between 0 and about a hundred million). Normalization is the process of changing numerical features so they are more comparable. Instead of saying a person is 45 with $75,000 income, you would say a person is 5% above the average age with 70% above the average income. 43 Exposing Hidden Information ⚫Some features aren’t useful in isolation. They need to be combined with other features to be helpful. Or (more commonly) some features are useful in isolation but become much more useful when combined with other features. For example, if you are trying to build a model for the shipping cost of boxes you might have a feature for the height, the width, and the depth of the box. These are all useful. But an even more useful feature would be the total volume of the box. Some machine- learning algorithms are able to discover these types of relationships on their own. Some 44 aren’t. Eliminating Misleading Data ⚫Another approach is to delete features from your data. ⚫But you may have created some really poor features. These can add complexity to the learning process without adding any value. For example, using a person’s eye color to predict their age. Sure, every person has an eye color. Sure, it is in the context. But is it relevant to predicting how old the person is? Not really. 45 Modeling ⚫Modeling is the process of using machine learning algorithms to search for effective models. ⚫There are many ways an intelligence creator can assist this process: ⚫Deciding which features to use. ⚫Deciding which machine learning algorithms and model representations to use. ⚫Deciding what data to use as input for training. ⚫Controlling the model creation process. ⚫But in all of these, the goal is to get the model that generalizes the best, has the best mistake 46 profile, and will create the most value for your customers. ⚫Artificial intelligence: computers acting humanly / thinking humanly / thinking rationally / acting rationally ⚫Machine learning: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. ⚫Deep learning: specific learning technique based on neural networks. 47 Artificial Intelligence ⚫Acting humanly: Turing test approach, requires natural language processing, knowledge representation, automated reasoning, machine learning, maybe vision and robotics. ⚫Thinking humanly: mirroring human thinking, cognitive science. ⚫Acting rationally: rational agents interacting with environment. ⚫Thinking rationally: law of thoughts, logic, patterns and structures. 48 Learning Paradigms ⚫Supervised learning -- labeled training data provided. ⚫Unsupervised learning -- training data without labels. ⚫Reinforcement learning -- agents learning from interacting with an environment. 49 Artificial Neural Networks (ANN) ⚫Simulating biological neural networks of neurons (nodes) and synapses (connections), popularized in 60s and 70s. ⚫Basic building blocks: Artificial neurons, with n inputs and one output; output is activated if at least m inputs are active. 50 Single Layer Feedforward Neural Networks 51 Multi Layer Feedforward Neural Networks 52 Example1 1/2 ⚫The input has 3 neurons X1, X2 and X3, and single output Y. ⚫The weights associated with the inputs are: {0.2, 0.1, -0.3} ⚫Inputs= {0.3, 0.5, 0.6} ⚫Net input ={x1*w1+x2*w2+ x3*w3} ⚫Net input = (0.3*0.2) + (0.5*0.1) + (0.6*-0.3) 53 ⚫Net input= -0.07 Example1 2/2 ⚫X is -0.07 54 Example2 1/3 ⚫Suppose we input the values 10, 30, 20 into the three input units, from top to bottom. Then the weighted sum coming into H1 will be: ⚫SH1 = (0.2 * 10) + (-0.1 * 30) + (0.4 * 20) = 2 - 3 + 8 = 7. ⚫Then the σ function is applied to SH1 to give: σ(SH1) = 1/(1+e-7) = 1/(1+0.000912) = 0.999 55 Example2 2/3 ⚫Similarly, the weighted sum coming into H2 will be: SH2 = (0.7 * 10) + (-1.2 * 30) + (1.2 * 20) = 7 - 36 + 24 = -5 ⚫and σ applied to SH2 gives: σ(SH2) = 1/(1+e5) = 1/(1+148.4) = 0.0067 ⚫From this, we can see that H1 has fired, but H2 has not. We can now calculate that the weighted sum going in to output unit O1 will be: SO1 = (1.1 * 0.999) + (0.1*0.0067) = 1.0996 56 Example3 3/3 ⚫and the weighted sum going in to output unit O2 will be: SO2 = (3.1 * 0.999) + (1.17*0.0067) = 3.1047 ⚫The output sigmoid unit in O1 will now calculate the output values from the network for O1: σ(SO1) = 1/(1+e-1.0996) = 1/(1+0.333) = 0.750 ⚫and the output from the network for O2: σ(SO2) = 1/(1+e-3.1047) = 1/(1+0.045) = 0.957 ⚫Therefore, if this network represented the learned rules for a categorization problem, the input triple (10,30,20) would be 57 categorized into the category associated with O2, because this has the larger output. Deep Learning ⚫More layers ⚫Layers with different numbers of neurons ⚫Different kinds of connections fully connected (feed forward) ⚫Not fully connected (eg. convolutional networks) ⚫Keeping state (eg. recurrent neural networks) ⚫Skipping layers ⚫... 58 Deep Learning cont. ⚫Can approximate arbitrary functions ⚫Able to handle many input values (e.g., millions of pixels) ⚫Internal layers may automatically recognize higher-level structures ⚫Often used without explicit feature engineering ⚫Often huge number of parameters, expensive inference and training ⚫Often large training sets needed ⚫Too large and complex to understand 59 what is learned, why, or how decisions are On Terminology ⚫Deep learning: neural networks with many internal layers ⚫Deep Neural Network (DNN) architecture: network structure, how many layers, what connections, which ϕ (hyperparameters) ⚫Model parameters: weights associated with each input in each neuron 60 Thank You 61