AI Int 2 Merged.pdf
Document Details
Uploaded by RoomierSpinel
Tags
Related
- Artificial Intelligence with Machine Learning (Unit 1) PDF
- Artificial Intelligence PDF
- Artificial Intelligence (AI) PDF
- COMCTN4 Second Grading Exam - Artificial Intelligence and Machine Learning (PDF)
- DST301 Artificial Intelligence Applications Lecture 02 - Machine Learning PDF
- CPCS-335 Introduction to Artificial Intelligence Lecture 8 PDF
Full Transcript
TY.BSC.CS SEM V ARTIFICIAL INTELLIGENCE Introduction to AI Topics covered are: What is Intelligence? Components of Intelligence What is AI? Objective of AI Applications of AI What is intelligence? Intelligence is a property of mind that encompasses many relate...
TY.BSC.CS SEM V ARTIFICIAL INTELLIGENCE Introduction to AI Topics covered are: What is Intelligence? Components of Intelligence What is AI? Objective of AI Applications of AI What is intelligence? Intelligence is a property of mind that encompasses many related mental abilities, such as the capabilities to reason plan solve problems think abstractly comprehend ideas and language and learn What is Intelligence Composed of? The intelligence is intangible. It is composed of − Reasoning Learning Problem Solving Natural Language Processing (NLP) Perception Reasoning - It is the set of processes that enables us to provide basis for judgement, making decisions, and prediction. Learning - It is the study of the computer algorithms which improve automatically through experiences. This concept is known as Machine Learning. Problem Solving - It is the process in which one perceives and tries to arrive at a desired solution from a present situation by taking some path, which is blocked by known or unknown hurdles. Natural Language Processing- This processing enables a machine to read and understand human language by processing the human language into machine language. Perception - An ability of the machine to use input from sensors, microphones, wireless signals, etc. for understanding different aspects of the world. WHAT IS AI? Definition of AI According to John McCarthy, who is known as the father of AI, “AI is the science and engineering of making intelligent machines, especially computer programs.” Artificial Intelligence is a branch of Science which deals with helping machines find solutions to complex problems in a more human-like fashion. This generally involves borrowing characteristics from human intelligence, and applying them as algorithms in a computer friendly way. A more or less flexible or efficient approach can be taken depending on the requirements established, which influences how artificial the intelligent behavior appears. AI is generally associated with Computer Science, but it has many important links with other fields such as Maths, Psychology, Cognition, Biology and Philosophy, among many others. Our ability to combine knowledge from all these fields will ultimately benefit our progress in the quest of creating an intelligent artificial being. AI is a branch of computer science which is concerned with the study and creation of computer systems that exhibit some form of intelligence or those characteristics which we associate with intelligence in human behavior THE STATE OF THE ART The highest level of development, as of a device, technique, or scientific field, achieved at a particular time. Program that plays chess Automatically Program books tickets online via voice, and suggest best booking rates on days Auto driven car which can make driver relax, and drive on highway without touching steering wheel Full diagnostic system which will check symptoms and give proper solution accordingly to test cases. Applications of AI There are many application of AI, some of them are as following: Gaming Natural Language Processing Social Media Feeds Speech Recognition Ecommerce 1. Gaming - AI plays crucial role in strategic games such as chess, poker, tic-tac-toe, etc., where machine can think of large number of possible positions based on heuristic knowledge. 2. Natural Language Processing - It is possible to interact with the computer that understands natural language spoken by humans. e.g Search results, predictive texts. 3. social media - If you are using social media, most of your decisions are being impacted by artificial intelligence. From the feeds that you see in your timeline to the notifications that you receive from these apps, everything is curated by AI. 4. Speech Recognition - Some intelligent systems are capable of hearing and comprehending the language in terms of sentences and their meanings while a human talks to it. It can handle different accents, slang words, noise in the background, change in human’s noise due to cold, etc. 5. Ecommerce - Artificial intelligence is perfectly suited to recommending products to ecommerce customers. Chap 2: Intelligent Agents Agents and Environment Artificial intelligence is defined as study of rational agents. A rational agent could be anything which makes decisions, like a person, firm, machine, or software. It carries out an action with the best outcome after considering past and current percepts(agent’s perceptual inputs at a given instance). An AI system is composed of an agent and its environment. The agents act in their environment. The environment may contain other agents. Agent = Architecture + Program An agent is anything that can be viewed as : 1. perceiving its environment through sensors and 2. Observation to take decision 3. acting upon that environment through actuators 4. Action taken by an AI agent must be a rational action Sensor: Sensor is a device which detects the change in the environment and sends the information to other electronic devices. An agent observes its environment through sensors. Actuators: Actuators are the component of machines that converts energy into motion. The actuators are only responsible for moving and controlling a system. An actuator can be an electric motor, gears, rails, etc. Effectors: Effectors are the devices which affect the environment. Effectors can be legs, wheels, arms, fingers, wings, fins, and display screen. Examples of Agent:- 1. A software agent has Keystrokes, file contents, received network packages which act as sensors and displays on the screen, files, sent network packets acting as actuators. 2. A Human agent has eyes, ears, and other organs which act as sensors and hands, legs, mouth, and other body parts acting as actuators. 3. A Robotic agent has Cameras and infrared range finders which act as sensors and various motors acting as actuators. ] The Nature of Environment Well-behaved agents Rational Agent: “For each possible percept sequence, a rational agent should select an action that is expected to maximize its performance measure, given the evidence provided by the percept sequence and whatever built-in knowledge the agent has.” An ideal rational agent is the one, which is capable of doing expected actions to maximize its performance measure, on the basis of − 1. Its percept sequence 2. Its built-in knowledge base Rationality of an agent depends on the following − 1. The performance measures, which determine the degree of success. 2. Agent’s Percept Sequence till now. 3. The agent’s prior knowledge about the environment. 4. The actions that the agent can carry out. The problem the agent solves is characterized by Performance Measure, Environment, Actuators, and Sensors (PEAS). OR PAGE – PERCEPT, ACTION, GOAL, ENVIRONMENT (AS GIVEN IN REF BOOK) PEAS When we define a rational agent, we group these properties under PEAS, the problem specification for the task environment. The rational agent we want to design for this task environment is the solution. PEAS stands for: Performance Environment Actuators Sensors What is PEAS for a self-driving car? Performance: Safety, time, legal drive, comfort. Environment: Roads, other cars, road signs. Actuators: Steering, accelerator, brake, signal, horn. Sensors: Camera, GPS, Speedometer, odometer, accelerometer, engine sensors, keyboard. How about a vacuum cleaner? Features of Environment (TASK ENVIRONMENT 1) Fully observable (accessible) e.g. Chess vs. partially observable (inaccessible) e.g self driving cars, poker 2) Deterministic (Strategic)e.g chess vs. Non-Deterministic (Stochastic) e.g self driving cars. 3) Competitive e.g, Chess vs. Collaborative e.g self driving cars 4) Dynamic e.g. self driving cars vs Static e.g chess without clock vs Semi e.g. chess with clock 5) Discrete (finite) e.g Chess vs Continuous e.g. Self driving cars 6) Episodic (current) e.g Part Robot vs Sequential e.g Chess: 7) Single Agent vs Multi-agent Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Chess without a clock Fully observable vs. partially observable Deterministic vs. stochastic Episodic vs. sequential Static vs. dynamic Discrete vs. continuous Single agent vs. multiagent Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Fully Strategic Sequential Semi Discrete Multi Chess without a clock Fully Strategic Sequential Static Discrete Multi Fully observable vs. partially observable Deterministic vs. stochastic Episodic vs. sequential Static vs. dynamic Discrete vs. continuous Single agent vs. multiagent Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Fully Strategic Sequential Semi Discrete Multi Chess without a clock Fully Strategic Sequential Static Discrete Multi Poker Fully observable vs. partially observable Deterministic vs. stochastic Episodic vs. sequential Static vs. dynamic Discrete vs. continuous Single agent vs. multiagent Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Fully Strategic Sequential Semi Discrete Multi Chess without a clock Fully Strategic Sequential Static Discrete Multi Poker Partial Non Sequential Static Discrete Multi Determi nistic Fully observable vs. partially observable Deterministic vs. stochastic Episodic vs. sequential Static vs. dynamic Discrete vs. continuous Single agent vs. multiagent Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Fully Strategic Sequential Semi Discrete Multi Chess without a clock Fully Strategic Sequential Static Discrete Multi Poker Partial Non Sequential Static Discrete Multi Determi nistic Backgammon Fully observable vs. partially observable Deterministic vs. stochastic Episodic vs. sequential Static vs. dynamic Discrete vs. continuous Single agent vs. multiagent Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Fully Strategic Sequential Semi Discrete Multi Chess without a clock Fully Strategic Sequential Static Discrete Multi Poker Partial Stochast Sequential Static Discrete Multi ic Backgammon Fully Stochast Sequential Static Discrete Multi ic Fully observable vs. partially observable Deterministic vs. stochastic Episodic vs. sequential Static vs. dynamic Discrete vs. continuous Single agent vs. multiagent Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Fully Strategic Sequential Semi Discrete Multi Chess without a clock Fully Strategic Sequential Static Discrete Multi Poker Partial Stochast Sequential Static Discrete Multi ic Backgammon Fully Stochast Sequential Static Discrete Multi ic Taxi driving Partial Stochast Sequential Dyna Continu Multi ic mic ous Fully observable vs. partially observable Deterministic vs. stochastic Episodic vs. sequential Static vs. dynamic Discrete vs. continuous Single agent vs. multiagent Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Fully Strategic Sequential Semi Discrete Multi Chess without a clock Fully Strategic Sequential Static Discrete Multi Poker Partial Stochast Sequential Static Discrete Multi ic Backgammon Fully Stochast Sequential Static Discrete Multi ic Taxi driving Partial Stochast Sequential Dyna Continu Multi ic mic ous Medical diagnosis Fully observable vs. partially observable Deterministic vs. stochastic Episodic vs. sequential Static vs. dynamic Discrete vs. continuous Single agent vs. multiagent Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Fully Strategic Sequential Semi Discrete Multi Chess without a clock Fully Strategic Sequential Static Discrete Multi Poker Partial Stochast Sequential Static Discrete Multi ic Backgammon Fully Stochast Sequential Static Discrete Multi ic Taxi driving Partial Stochast Sequential Dyna Continu Multi ic mic ous Medical diagnosis Partial Stochast Episodic Static Continu Single ic ous Fully observable vs. partially observable Deterministic vs. stochastic Episodic vs. sequential Static vs. dynamic Discrete vs. continuous Single agent vs. multiagent Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Fully Strategic Sequential Semi Discrete Multi Chess without a clock Fully Strategic Sequential Static Discrete Multi Poker Partial Stochast Sequential Static Discrete Multi ic Backgammon Fully Stochast Sequential Static Discrete Multi ic Taxi driving Partial Stochast Sequential Dyna Continu Multi ic mic ous Medical diagnosis Partial Stochast Episodic Static Continu Single ic ous Image analysis Fully observable vs. partially observable Deterministic vs. stochastic Episodic vs. sequential Static vs. dynamic Discrete vs. continuous Single agent vs. multiagent Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Fully Strategic Sequential Semi Discrete Multi Chess without a clock Fully Strategic Sequential Static Discrete Multi Poker Partial Stochast Sequential Static Discrete Multi ic Backgammon Fully Stochast Sequential Static Discrete Multi ic Taxi driving Partial Stochast Sequential Dyna Continu Multi Fully observable vs. ic mic ous partially observable Medical diagnosis Partial Stochast Episodic Static Continu Single ic ous Deterministic vs. Image analysis Fully Determi Episodic Semi Discrete Single stochastic nistic Episodic vs. sequential Static vs. dynamic Discrete vs. continuous Single agent vs. multiagent Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Fully Strategic Sequential Semi Discrete Multi Chess without a clock Fully Strategic Sequential Static Discrete Multi Poker Partial Stochast Sequential Static Discrete Multi ic Backgammon Fully Stochast Sequential Static Discrete Multi ic Taxi driving Partial Stochast Sequential Dyna Continu Multi Fully observable vs. ic mic ous partially observable Medical diagnosis Partial Stochast Episodic Static Continu Single ic ous Deterministic vs. Image analysis Fully Determi Episodic Semi Discrete Single stochastic nistic Episodic vs. sequential Robot part picking Static vs. dynamic Discrete vs. continuous Single agent vs. Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Fully Strategic Sequential Semi Discrete Multi Chess without a clock Fully Strategic Sequential Static Discrete Multi Poker Partial Stochast Sequential Static Discrete Multi ic Backgammon Fully Stochast Sequential Static Discrete Multi ic Taxi driving Partial Stochast Sequential Dyna Continu Multi Fully observable vs. ic mic ous partially observable Medical diagnosis Partial Stochast Episodic Static Continu Single ic ous Deterministic vs. Image analysis Fully Determi Episodic Semi Discrete Single stochastic nistic Episodic vs. sequential Robot part picking Fully Determi Episodic Semi Discrete Single Static vs. dynamic nistic Discrete vs. continuous Single agent vs. multiagent Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Fully Strategic Sequential Semi Discrete Multi Chess without a clock Fully Strategic Sequential Static Discrete Multi Poker Partial Stochast Sequential Static Discrete Multi ic Backgammon Fully Stochast Sequential Static Discrete Multi ic Taxi driving Partial Stochast Sequential Dyna Continu Multi Fully observable vs. ic mic ous partially observable Medical diagnosis Partial Stochast Episodic Static Continu Single ic ous Deterministic vs. Image analysis Fully Determi Episodic Semi Discrete Single stochastic nistic Episodic vs. sequential Robot part picking Fully Determi Episodic Semi Discrete Single nistic Static vs. dynamic Interactive English Discrete vs. continuous tutor Single agent vs. Environment Examples Environment Obser Determi Episodic Static Discrete Agents vable nistic Chess with a clock Fully Strategic Sequential Semi Discrete Multi Chess without a clock Fully Strategic Sequential Static Discrete Multi Poker Partial Stochast Sequential Static Discrete Multi ic Backgammon Fully Stochast Sequential Static Discrete Multi ic Taxi driving Partial Stochast Sequential Dyna Continu Multi Fully observable vs. ic mic ous partially observable Medical diagnosis Partial Stochast Episodic Static Continu Single ic ous Deterministic vs. Image analysis Fully Determi Episodic Semi Discrete Single stochastic nistic Episodic vs. sequential Robot part picking Fully Determi Episodic Semi Discrete Single Static vs. dynamic nistic Discrete vs. continuous Interactive English Partial Stochast Sequential Dyna Discrete Multi tutor ic mic Single agent vs. multiagent Structure of Intelligent agents Types of Agents Agents can be grouped into four classes based on their degree of perceived intelligence and capability : 1. Simple Reflex Agents 2. Model-Based Reflex Agents 3. Goal-Based Agents 4. Utility-Based Agents 5. Learning Agents 1. Simple reflex agents (reflex = immediate action) Simple reflex agents ignore the rest of the percept history and act only on the basis of the current percept. Percept history is the history of all that an agent has perceived till date. The agent function is based on the condition-action rule. A condition-action rule is a rule that maps a state i.e, condition to an action. If the condition is true, then the action is taken, else not. This agent function only succeeds when the environment is fully observable. Q. Simple reflex agents have _______ type of environment. A)Partially obs B)fully obs C)Dynamic D)Continuous Simple reflex agents Reflex Vacuum Agent If status=Dirty then return Clean else if location=A then return Right else if location=B then returnLeft 2. Model-based reflex agents It works by finding a rule whose condition matches the current situation. A model-based agent can handle partially observable environments by use of model about the world. The agent has to keep track of internal state which is adjusted by each percept and that depends on the percept history. The current state is stored inside the agent which maintains some kind of structure describing the part of the world which cannot be seen. Updating the state requires the information about : 1. how the world evolves in-dependently from the agent, and 2. how the agent actions affects the world. Model-based reflex agents 3. Goal-based agents: These kind of agents take decision based on how far they are currently from their goal(description of desirable situations). Their every action is intended to reduce its distance from goal. This allows the agent a way to choose among multiple possibilities, selecting the one which reaches a goal state. The knowledge that supports its decisions is represented explicitly and can be modified, which makes these agents more flexible. They usually require search and planning. The goal based agent’s behavior can easily be changed. Goal-based agents: 4. Utility-based agents The agents which are developed having their end uses as building blocks are called utility based agents. When there are multiple possible alternatives, then to decide which one is best, utility based agents are used. They choose actions based on a preference (utility) for each state. Sometimes achieving the desired goal is not enough. We may look for quicker, safer, cheaper trip to reach a destination. Agent happiness should be taken into consideration. Utility describes how “happy” the agent is. Because of the uncertainty in the world, a utility agent chooses the action that maximizes the expected utility. A utility function maps a state onto a real number which describes the associated degree of happiness. 5. Learning Agent A learning agent in AI is the type of agent which can learn from its past experiences or it has learning capabilities. It starts to act with basic knowledge and then able to act and adapt automatically through learning. A learning agent has mainly four conceptual components, which are: 1. Learning element :It is responsible for making improvements by learning from the environment 2. Critic: Learning element takes feedback from critic which describes how well the agent is doing with respect to a fixed performance standard. 3. Performance element: It is responsible for selecting best action 4. Problem Generator: This component is responsible for suggesting actions that will lead to new and informative experiences. Structure of Intelligent agents Types of Agents Agents can be grouped into four classes based on their degree of perceived intelligence and capability : 1. Simple Reflex Agents e.g vaccum cleaner 2. Model-Based Reflex Agents e.g Waymo 3. Goal-Based Agents 4. Utility-Based Agents 1. Simple reflex agents Simple reflex agents ignore the rest of the percept history and act only on the basis of the current percept. Percept history is the history of all that an agent has perceived till date. The agent function is based on the condition-action rule. A condition-action rule is a rule that maps a state i.e, condition to an action. If the condition is true, then the action is taken, else not. This agent function only succeeds when the environment is fully observable. Simple reflex agents 2. Model-based reflex agents It works by finding a rule whose condition matches the current situation. A model-based agent can handle partially observable environments by use of model about the world. The agent has to keep track of internal state which is adjusted by each percept and that depends on the percept history. The current state is stored inside the agent which maintains some kind of structure describing the part of the world which cannot be seen. Updating the state requires the information about : 1. how the world evolves in-dependently from the agent, and 2. how the agent actions affects the world. Model-based reflex agents 3. Goal-based agents: These kind of agents take decision based on how far they are currently from their goal(description of desirable situations). Their every action is intended to reduce its distance from goal. This allows the agent a way to choose among multiple possibilities, selecting the one which reaches a goal state. The knowledge that supports its decisions is represented explicitly and can be modified, which makes these agents more flexible. They usually require search and planning. The goal based agent’s behavior can easily be changed. Goal-based agents: 4. Utility-based agents The agents which are developed having their end uses as building blocks are called utility based agents. When there are multiple possible alternatives, then to decide which one is best, utility based agents are used. They choose actions based on a preference (utility) for each state. Sometimes achieving the desired goal is not enough. We may look for quicker, safer, cheaper trip to reach a destination. Agent happiness should be taken into consideration. Utility describes how “happy” the agent is. Because of the uncertainty in the world, a utility agent chooses the action that maximizes the expected utility. A utility function maps a state onto a real number which describes the associated degree of happiness. Training Dataset: The sample of data used to fit the model. Ex: Traning data (2+3)= (program) Model - (5) Output Data Actual Value Test Data: 3+2 6 Predicted value 6-5 = 1 Bias error In supervised Learning x + Y; In Unsupervised Learning X, NO Y, Test set—a subset to test the trained model. So, we use the training data to fit the model and testing data to test it. What is X and Y in machine learning? Training Data Ser X1(study X2 X3 Y(marks) hour) (attendance) (class P) 6 90 80 70 -Model Prediction (Output) X : input data or variable / independent variable / precdictor / observations / features / fields / data point / X Y : output data or variable / dependent variable / Target / labelled data What is bias? Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. What Is Variance? Variance indicates how much the estimate of the target function will alter if different training data were used. In other words, variance describes how much a random variable differs from its expected value. Variance is the variability of model prediction for a given data point. Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data. Unsupervised Learning Supervised Learning: The system is presented with example inputs and their desired outputs, given by a “teacher”, and the goal is to learn a general rule that maps inputs to outputs. Unsupervised Learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning). K means clustering In some pattern recognition problems, the training data consists of a set of input vectors x without any corresponding target values. The goal in such unsupervised learning problems may be to discover groups of similar examples within the data, where it is called clustering. What is Clustering? Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. K-means is one of the simplest unsupervised learning algorithms that solves the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centres, one for each cluster. These centroids should be placed in a smart way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. X = (1,2,3,4,5,6,7,8,9); K=2 dis = srqt[ (x1-y1)^2 + (X2-Y2)^2 ] Dis =sqrt[(xi-xc)^2]; sqrt(5-1)^2 = sqrt(4) = 4 Sqrt(6-1)^2 = 5 1 2 5 8 9 4 6 7 3 K-mean Clustering algorithm 1. Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the set of centers. 2. Randomly select ‘c’ cluster centers. 3. Calculate the distance between each data point and cluster centers. 4. Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers. 5. Recalculate the new cluster center by calculating mean 6. Recalculate the distance between each data point and new obtained cluster centers. 7. If no data point was reassigned then stop, otherwise repeat from step 3). X= {2,6,1,3,7,4,8}; Assume, k=2 Hierarchical Clustering Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left. It overcomes the problem of pre-defining the number of clusters. There are two types of hierarchical clustering: 1. Agglomerative 2. Divisive 1. Agglomerative : 1. In agglomerative or bottom-up clustering method we assign each observation to its own cluster. 2. Then, compute the similarity (e.g., distance) between each of the clusters. 3. Then we join the two most similar clusters. 4. Finally, repeat steps 2 and 3 until there is only a single cluster left. Agglomerative Hierarchical clustering Technique: In this technique, initially each data point is considered as an individual cluster. At each iteration, the similar clusters merge with other clusters until one cluster or K clusters are formed. The basic algorithm of Agglomerative is straight forward. 1. Compute the proximity matrix 2. Let each data point be a cluster 3. Repeat: Merge the two closest clusters and update the proximity matrix until only a single cluster remains 4. Key operation is the computation of the proximity of two clusters The Hierarchical clustering Technique can be visualized using a Dendrogram. A Dendrogram is a tree-like diagram that records the sequences of merges or splits. 2. Divisive: In divisive or top-down clustering method we assign all of the observations to a single cluster and then partition the cluster to two least similar clusters. Finally, we proceed recursively on each cluster until there is one cluster for each observation. Before any clustering is performed, it is required to determine the proximity matrix containing the distance between each point using a distance function. Then, the matrix is updated to display the distance between each cluster. The following three methods differ in how the distance between each cluster is measured. 1. Single Linkage 2. Complete Linkage 3. Average Linkage 4. Centroid Linkage---(data point to data point) (1,2,3,4,5,6,7,8) === 3,7; 3 to 1, 3 to 2,….etc 1. Single Linkage : In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two closest points. 2. Complete Linkage : In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two furthest points. 3. Average Linkage : In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster. For example, the distance between clusters “r” and “s” to the left is equal to the average length each arrow between connecting the points of one cluster to the other. Example of Complete Linkage Clustering Clustering starts by computing a distance between every pair of units that you want to cluster. A distance matrix will be symmetric (because the distance between x and y is the same as the distance between y and x) and will have zeroes on the diagonal (because every item is distance zero from itself). The table below is an example of a distance matrix. Only the lower triangle is shown, because the upper triangle can be filled in by reflection. Distance Matrix Now lets start clustering. The smallest distance is between 3 and 5 and they get linked up or merged first into a the cluster '35'. To obtain the new distance matrix, we need to remove the 3 and 5 entries, and replace it by an entry "35". Since we are using complete linkage clustering, the distance between "35" and every other item is the maximum of the distance between this item and 3 and this item and 5. For example, d(1,3)= 3 and d(1,5)=11. So, D(1,"35")=11. This gives us the new distance matrix. The items with the smallest distance get clustered next. This will be 2 and 4. Continuing in this way, everything is clustered. This is summarized below. On this plot, the y-axis shows the distance between the objects at the time they were clustered. This is called the cluster height. Different visualizations use different measures of cluster height. Below is the Single Linkage Dendrogram for the same distance matrix. It starts with cluster "35" but the distance between "35" and each item is now the minimum of d(x,3) and d(x,5). So c(1,"35")=3 Id MARKS 1 40 2 50 3 60 4 70 1D - Euclidian Distance= |X2-X1| 1 2 3 4 1 0 10 20 30 2 10 0 10 20 3 20 10 0 10 4 30 20 10 0 Cross-validation Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. Cross-validation is a technique in which we train our model using the subset of the data- set and then evaluate using the complementary subset of the data-set. Data is divided into train and test, ratio wise. This can be achieved using train_test_split function of sklearn. (Random_state = 2 Underfitting Over-fitting Best fit/Well generalized) Methods available for performing cross validation: 1. Leave one out cross validation (LOOCV) k=1 - Variance - Bias - High variance – one data point for testing - Execution time - Extreme K- n LOOCV leaves one data point out. Leave-one-out cross-validation is a special case of cross validation where the number of folds equals the number of instances in the data set Thus, the learning algorithm is applied once for each instance, using all other instances as a training set and using the selected instance as a single-item test data. It has some advantages as well as disadvantages also. An advantage of using this method is that we make use of all data points and hence it is low bias. The major drawback of this method is that it leads to higher variation in the testing model as we are testing against one data point. If the data point is an outlier it can lead to higher variation. Another drawback is it takes a lot of execution time as it iterates over ‘the number of data points’ times. The extreme is k = n, also known as leave-one-out cross-validation or LOOCV. K= 1; n=20 (In CV, an extreme k=n known as _____________.) 1. Model select 2. Train (train_test_split, CV) 3. Test Example: Data of 20 records means 20 partitions, each having one record. This leads to 20 iterations of training and evaluating. In 1st iteration, model is tested on 1st block (fold) and tested on remaining 19 folds, giving out an accuracy. In the next iteration, another block having next record as test set while remaining as train, gives another accuracy. LOOCV makes testing and training on all 20 records in 20 iterations, hence expensive computational power. Average of all sets are then computed and thrown as a final accuracy aka k-fold score. K = no of groups/ iteration/rounds K=1 N = 20 K=n 2. K-Fold Cross Validation In this method, we split the data-set into k number of subsets(known as folds) then we perform training on the all the subsets but leave one(k-1) subset for the evaluation of the trained model. In this method, we iterate k times with a different subset reserved for testing purpose each time. Popular values for k are 5 and 10—enough to give an estimate that is statistically likely to be accurate, at a cost of 5 to 10 times longer computation time. Note: Popular values for k are 5 and 10. Higher value of k leads to LOOCV method. Example The below example shows the training subsets and evaluation subsets generated in k- fold cross-validation. Here, we have total 25 instances. In first iteration we use the first 20 percent of data for evaluation, and the remaining 80 percent for training([1-5] testing and [5-25] training) while in the second iteration we use the second subset of 20 percent for evaluation, and the remaining three subsets of the data for training([5-10] testing and [1-5 and 10-25] training), and so on. Total instances n: 25 Value of k : 10 -- iteration 10 No of data points for testing = n/k = trunc(25/10) = 2.5 First iteration : 2+5 = 7; 2nd iteration : 2 2* 9 = 18 + 7 = 25 10th iteration 2 === No. Iteration Training set observations Testing set observations 1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4 5 6] 7 2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [8 9] 2 3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 ] 2 4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19] 2 5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24] 2 6.. 10 This runs K times faster than Leave One Out cross-validation because K-fold cross- validation repeats the train/test split K-times. Simpler to examine the detailed results of the testing process. Advantages of cross-validation: More accurate estimate of out-of-sample accuracy. More “efficient” use of data as every observation is used for both training and testing. N = 101, no.of data points K = 10 N/K = 101/10 === 10.1 = 10 – no. of data points to be selected for testing in each iteration k=10 R = N MOD K = 101 MOD 10 = 1 R = No. of iteration to e considered for extra data points FORMULA = trunc( N/K) + 1 101/10 = 10 + 1 = 11 datapoints No of data points for testing = 10 Training data Testing data 1. 1 to 11 11 2. 12 to 22 10 3. 10 4. 5. 6. 7. 8. 9. 10 10 = 100 91 to 101 10 = 101 N = 76 K = 10 iteration R = N/K = 76/10 = 6 iteration FORMULA = ( N/K) + 1 76/10 = 7 + 1 = 8 data points TO BE CONSIDERED TO TESTING 6*8 + 4*7 48 + 28 76 N = 25 K= 5 N/ K = 25/5 = 5 DATA POINTS TO BE CONSIDERED No of data points for testing = Training data Testing data 1. 1 to 8 2. 7 to 14 3. 4. 5. 6. 7. 8. 9. 10 =101 No. of data points for testing = 3. Stratified CV This CV ensures that ea ch iteration/fold is a good representative of whole CLASSIFICATION TREE 1. Entropy: Entropy in Decision Tree stands for homogeneity. If the data is completely homogenous, the entropy is 0 (pure), else if the data is divided (50-50%) entropy is 1 (impure). 2. Information Gain: Information Gain is the decrease/increase in Entropy value when the node is split. Steps: 1.compute the entropy for data-set 2.for every attribute/feature: i. Calculate entropy for all categorical values ii. Take average information entropy for the current attribute iii. Calculate gain for the current attribute 3. pick the highest gain attribute. 4. Repeat until we get the tree we desired. NAÏVE BAYES ALGORITHM WRITEUP P(c|x) is the posterior probability of class (target) given predictor (attribute). P(c) is the prior probability of class. P(x|c) is the likelihood which is the probability of predictor given class. P(x) is the prior probability of predictor. 1. The posterior probability can be calculated by first, constructing a frequency table for each attribute against the target. 2. Then, transforming the frequency tables to likelihood tables 3. And finally use the Naive Bayesian equation to calculate the posterior probability for each class. 4. The class with the highest posterior probability is the outcome of prediction. ‘b‘ – maximum branching factor in a tree. ‘d‘ – the depth of the least-cost solution. ‘m‘ – maximum depth state space(maybe infinity) Optimal: Extent of preference of the algorithm Depth First Search (DFS): Complete: No Time Complexity: O(bm) Space complexity: O(bm) Optimal: Yes 2. Breadth-First Search(BFS) Complete: Yes (assuming b is finite) Time Complexity: O(bd) Space complexity: O(bd) Optimal: Yes, if Step cost = 1 (i.e. no cost/all step costs are same) 3. Uniform Cost Search(UCS): Complete: Yes (if b is finite and costs are stepped costs are zero) Time Complexity: O(b(c/ϵ)) where, ϵ -> is the lowest cost, c -> optimal cost (Let c -> = cost of solution. ϵ -> arcs cost.) Space complexity: O(b(c/ϵ)) Optimal: Yes (even for non-even cost) 4. Depth Limited Search(DLS): Complete: Complete (if solution > depth-limit) Time Complexity: O(bl) where, l -> depth-limit Space complexity: O(bl) Optimal: Yes (only if l > d) 5. Iterative Deepening Depth First Search(IDDFS): Complete: Yes (by limiting the depth) Time Complexity: O(bd) Space complexity: O(bd) Optimal: Yes (if step cost is uniform) Systematic: It’s not systematic. 6. Bidirectional Search(BS): Complete: Yes Time Complexity: O(bd/2) Space complexity: O(bd/2) Optimal: Yes (if step cost is uniform in both forward and backward directions) 7. Greedy Best First Search Complete: No Time Complexity: O(b^d) where b is the maximum branching factor and d is the maximum depth of the search tree. Space Complexity: O(b^d) where b is the maximum branching factor and d is the maximum depth of the search tree. Optimal: No 8. A* Search Complete: Yes Time Complexity: O(b^d) where b is the maximum branching factor and d is the maximum depth of the search tree. Space Complexity: O(b^d) where b is the maximum branching factor and d is the maximum depth of the search tree. Optimal: Yes UNTI 1 Chap 3 : Problem Solving by Searching PROBLEM-SOLVING AGENTS Intelligent agents are supposed to act in such a way that the environment goes through a sequence of states that maximizes the performance measure. What is Problem solving agent? It is a kind of Goal-based Agents 4 general steps in problem-solving: 1. Goal Formulation 2. Problem Formulation 3. Search 4. Execute E.g. Driving from Arad to Bucharest... Consider one example that “A map is given with different cities connected and their distance values are also mentioned. Agent starts from one city and reach to other.” n Subclass of goal-based agents Goal formulation Problem formulation Example problems Toy problems Real-world problems search solution Goal Formulation Goal formulation, based on the current situation, is the first step in problem solving. As well as formulating a goal, the agent may wish to decide on some other factors that affect the desirability of different ways of achieving the goal. For now, let us assume that the agent will consider actions at the level of driving from one major town to another. The states it will consider therefore correspond to being in a particular town. Declaring the Goal: Goal information given to agent i.e. start from Arad and reach to Bucharest. Problem information: Agent will decide its action when it has some added knowledge about map. i.e. map of Romania is given to agent. Ignoring some actions: agent has to ignore some actions that will not lead agent to desire goal. i.e. there are three roads out of Arad, one toward Sibiu, one to Timisoara, and one to Zerind. None of these achieves the goal, so unless the agent is very familiar with the geography of Romania, it will not know which road to follow. In other words, the agent will not know which of its possible actions is best, because it does not know enough about the state that results from taking each action. Achieving Goal: The agent can use this information to consider subsequent stages of a hypothetical journey through each of the three towns, to try to find a journey that eventually gets to Bucharest. i.e. once it has found a path on the map from Arad to Bucharest, it can achieve its goal by carrying out the driving actions. Problem Formulation Problem formulation is the process of deciding what actions and states to consider, given a goal. Process of looking for action sequence (number of action that agent carried out to reach to goal) is called search. A search algorithm takes a problem as input and returns a solution in the form of an action sequence. Once a solution is found, the actions it recommends can be carried out. This is called the execution phase. Thus, we have a simple "formulate, search, execute" design for the agent Well-defined problem and solutions. A problem is defined by four items: 1. Initial state: The initial state that the agent starts in. e.g., the initial state for our agent in Romania might be described as “In(Arad)”. 2. Successor function S(x): A description of the possible actions available to the agent. The most common formulation uses a successor function , given a particular state x, SUCCESSOR-FN(x) returns a set of ordered pair where each action is one of the legal actions in state x and each successor is a state that can be reached from x by applying the action. 3. Goal test: It determines whether a given state is a goal state. 4. path cost (additive): Function that assigns a numeric cost to each path. e.g., sum of distances, number of actions executed, etc. Usually given as c(x, a, y), the step cost from x to y by action a, assumed to be ≥ 0. “A solution is a sequence of actions leading from the initial state to a goal state”. There are two types of searching strategies are used in path finding, 1) Uninformed Search strategies. 2) Informed Search strategies. 1) Uninformed Search Methods Uninformed search means that they have no additional information about states beyond that provided in the problem definition. All they can do is generate successors and distinguish a goal state from non-goal state. Uninformed strategies use only the information available in the problem definition. a. Breadth-first search b. Depth-first search c. Depth-limited search d. Iterative deepening search Breadth First Search (BFS Algorithm) Breadth First Search (BFS) searches breadth-wise in the problem space. Breadth-First search is like traversing a tree where each node is a state which may be a potential candidate for solution. Breadth first search expands nodes from the root of the tree and then generates one level of the tree at a time until a solution is found. It is very easily implemented by maintaining a queue of nodes. Initially the queue contains just the root. In each iteration, node at the head of the queue is removed and then expanded. The generated child nodes are then added to the tail of the queue. Algorithm: 1. Place the starting node. 2. If the queue is empty return failure and stop. 3. If the first element on the queue is a goal node, return success and stop otherwise. 4. Remove and expand the first element from the queue and place all children at the end of the queue in any order. 5. Go back to step 1. BFS Algorithm Types of searching strategies 1) Uninformed Search strategies. 2) Informed Search strategies. There are two types of searching strategies are used in path finding, 1) Uninformed Search strategies. 2) Informed Search strategies. 1) Uninformed Search Methods Uninformed search means that they have no additional information about states beyond that provided in the problem definition. All they can do is generate successors and distinguish a goal state from non-goal state. Uninformed strategies use only the information available in the problem definition. a. Breadth-first search b. Depth-first search c. Depth-limited search d. Iterative deepening search e. Uniform Cost search f. Bidirectional search Breadth First Search (BFS Algorithm) Breadth First Search (BFS) searches breadth-wise in the problem space. Breadth-First search is like traversing a tree where each node is a state which may be a potential candidate for solution. Breadth first search expands nodes from the root of the tree and then generates one level of the tree at a time until a solution is found. It is very easily implemented by maintaining a queue of nodes. Initially the queue contains just the root. In each iteration, node at the head of the queue is removed and then expanded. The generated child nodes are then added to the tail of the queue. Algorithm: 1. Place the starting node. 2. If the queue is empty return failure and stop. 3. If the first element on the queue is a goal node, return success and stop otherwise. 4. Remove and expand the first element from the queue and place all children at the end of the queue in any order. 5. Go back to step 1. BFS Algorithm A queue is an abstract data structure that follows the First-In-First-Out methodology (data inserted first will be accessed first). It is open on both ends, where one end is always used to insert data (enqueue) and the other is used to remove data (dequeue). Minimum Path P can be found by applying breadth first search algorithm that will begin at node A and will end at E. the algorithm uses two queues, namely QUEUE1 and QUEUE2. QUEUE1 holds all the nodes that are to be processed while QUEUE2 holds all the nodes that are processed and deleted from QUEUE1. 1. Add A to QUEUE1 and NULL to QUEUE2. QUEUE1 = {A} QUEUE2 = {NULL} 2. Delete the Node A from QUEUE1 and insert all its neighbors. Insert Node A into QUEUE2 QUEUE1 = {B, D} QUEUE2 = {A} 3. Delete the node B from QUEUE1 and insert all its neighbours. Insert node B into QUEUE2. QUEUE1 = {D, C, F} QUEUE2 = {A, B} 4. Delete the node D from QUEUE1 and insert all its neighbors. Since F is the only neighbor of it which has been inserted, we will not insert it again. Insert node D into QUEUE2. QUEUE1 = {C, F} QUEUE2 = { A, B, D} 5. Delete the node C from QUEUE1 and insert all its neighbours. Add node C to QUEUE2. QUEUE1 = {F, E, G} QUEUE2 = {A, B, D, C} 6. Remove F from QUEUE1 and add all its neighbours. Since all of its neighbours has already been added, we will not add them again. Add node F to QUEUE2. QUEUE1 = {E, G} QUEUE2 = {A, B, D, C, F} 7. Remove E from QUEUE1, all of E's neighbours has already been added to QUEUE1 therefore we will not add them again. All the nodes are visited and the target node i.e. E is encountered into QUEUE2. QUEUE1 = {G} QUEUE2 = {A, B, D, C, F, E} Now, backtrack from E to A, using the nodes available in QUEUE2. The minimum path will be A → B → C → E. Depth First Search (DFS) Depth-first search (DFS) is an algorithm for traversing or searching a tree, tree structure, or graph. One starts at the root (selecting some node as the root in the graph case) and explores as far as possible along each branch before backtracking. Algorithm: 1. Push the root node onto a stack. 2. Pop a node from the stack and examine it. 3. If the element sought is found in this node, quit the search and return a result. 4. Otherwise push all its successors (child nodes) that have not yet been discovered onto the stack. 5. If the stack is empty, every node in the tree has been examined – quit the search and return "not found". 6. If the stack is not empty, repeat from Step 2. DFS Algorithm Consider the graph G along with its adjacency list, given in the figure below. Calculate the order to print all the nodes of the graph starting from node H, by using depth first search (DFS) algorithm. 1) Push H onto the stack STACK : H 2) POP the top element of the stack i.e. H, print it and push all the neighbors of H onto the stack that are is ready state. Print H STACK : A 3) Pop the top element of the stack i.e. A, print it and push all the neighbors of A onto the stack that are in ready state. Print A Stack : B, D 4) Pop the top element of the stack i.e. D, print it and push all the neighbors of D onto the stack that are in ready state. print: D Stack: B, F 5) Pop the top element of the stack i.e. , print it and push all the neighbors of F onto the stack that are in ready state. Print: F Stack: B 6) Pop the top element of the stack i.e. B, print it and push all the neighbors of B onto the stack that are in ready state. Print: B Stack : C 7) Pop the top element of the stack i.e. C, print it and push all the neighbors of C onto the stack that are in ready state. Print: C Stack : E, G 5) Pop the top element of the stack i.e. G, print it and push all the neighbors of G onto the stack that are in ready state. Print: G Stack: E 6) Pop the top element of the stack i.e. E, print it. Print: E HADFBCGE Problem Solving Agents DEPTH FIRST SEARCH DEPTH LIMITED ITERATIVE DEEPENING (DFS) SEARCH (DLS) DFS (IDDFS) UNIFORM COST SEARCH BIDIRECTIONAL (UCS) SEARCH Depth First Search (DFS) Depth-first search (DFS) is an algorithm for traversing or searching a tree, tree structure, or graph. One starts at the root (selecting some node as the root in the graph case) and explores as far as possible along each branch before backtracking. Algorithm: 1. Push the root node onto a stack. 2. Pop a node from the stack and examine it. 3. If the element sought is found in this node, quit the search and return a result. 4. Otherwise push all its successors (child nodes) that have not yet been discovered onto the stack. 5. If the stack is empty, every node in the tree has been examined – quit the search and return "not found". 6. If the stack is not empty, repeat from Step 2. DFS Algorithm Depth Limited Search The unbounded tree problem appeared in DFS can be fixed by imposing a limit on the depth that DFS can reach, this limit we will call depth limit l, this solves the infinite path problem. Depth limited search (DLS) is a modification of depth-first search that minimizes the depth that the search algorithm may go. In addition to starting with a root and goal node, a depth is provided that the algorithm will not descend below. Any nodes below that depth are omitted from the search. This modification keeps the algorithm from indefinitely cycling by halting the search after the pre-imposed depth. Depth-limited search can terminate with two conditions: 1. If the solution is found. 2. If there is no solution within given depth limit. Process: If depth is fixed to 2, DLS carries out depth first search till second level in the search tree. Algorithm: 1. Determine the start node and the search depth. 2. Check if the current node is the goal node. If not: Do nothing If yes: return 3. Check if the current node is within the specified search depth If not: Do nothing If yes: Expand the node and save all of its successors in a stack. 4. Call DLS recursively for all nodes of the stack and go back to step 2. L = 1------- 0 – L = 1 + 1= 2 L = 2 ------- 0 , 1, 2 --- L= 2 + 1 = 3 Iterative Deepening Search Iterative Deepening Search (IDS) is a derivative of DLS and combines the feature of depth-first search with that of breadth-first search. IDS operate by performing DLS searches with increased depths until the goal is found. The depth begins at one, and increases until the goal is found, or no further nodes can be enumerated. By minimizing the depth of the search, we force the algorithm to also search the breadth of a graph. If the goal is not found, the depth that the algorithm is permitted to search is increased and the algorithm is started again. Iterative Deepening Search (IDS) Algorithm: 1. Determine the start node and the search depth. 2. Check if the current node is the goal node. If not: Do nothing If yes: return 3. Check if the current node is within the specified search depth If not: INCREMENT CURRENT LIMIT BY 1 & GO BACK TO STEP 2 If yes: Expand the node and save all of its successors in a stack. 4. Call DLS recursively for all nodes of the stack and go back to step 2. Uniform Cost Search(UCS) This algorithm is mainly used when the step costs are not the same but we need the optimal solution to the goal state. In such cases, we use Uniform Cost Search to find the goal and the path including the cumulative cost to expand each node from the root node to the goal node. It does not go depth or breadth, it searches for the next node with the lowest cost and in the case of the same path cost, let’s consider lexicographical order in our case. 1. Full form 2. Uninformed search strategy 3. Highest priority given to minimum path cost 4. If path cost is 1 then we have to use Breadth First search strategy. Algorithm for uniform cost search: 1)Insert the root node into the priority queue 2)Repeat while the queue is not empty: Remove the element with the highest priority If the removed node is the destination, print total cost and stop the algorithm Else, enqueue all the children of the current node to the priority queue, with their cumulative cost from the root as priority NOTE: Minimum Path Cost = Highest Priority Path found: S-A-D-G A Bidirectional Search Bidirectional search algorithm runs two simultaneous searches, one form initial state called as forward-search and other from goal node called as backward-search, to find the goal node. Bidirectional search replaces one single search graph with two small subgraphs in which one starts the search from an initial vertex and other starts from goal vertex. The search stops when these two graphs intersect each other. 1) Initial point & Goal point 2) 2 searches – i) from initial node (Forward Search) ii) From goal node (Backward Search) 3) Intersection point where algorithm ends 4) Return a single path. (Forward search path + Backward search path) 2 subgragraphs. Forward search : O(b^d/2) Backward search : O(b^d/2) Single Path : (b^d/2) + (b^d/2) 2 b^d/2 = b^d B is branching factor D Maximum depth/level The start node is 5 and the end node is 4. Aim: To find the shortest path from 5 to 4 using bidirectional search. Do BFS from both directions. 1.Start moving forward from start node (Green) and backwards from end node (Orange). 2) Similar to BFS, at every point explore the next level of nodes till you find an intersecting node 3) Stop on finding the intersecting node. 4)Trace back to find the path Path: 5-0-1-3-4 OR Path: 5-7-1-3-4 Algorithm: startq = Queue for BFS from start node endq = Queue for BFS from end node parent= Array where startparent[i] is parent of node i visited= Array where visited[i]=True if node i has been encountered while startq is not empty and endq is not empty perform next iteration of BFS for startq (also save the parent of children in parent array) perform next iteration of BFS for endq if we have encountered the intersection node save the intersection node break Informed Search Algorithms Greedy search OR Best First A* Algorithm search Algorithm Informed Search Algorithms Here, the algorithms have information on the goal state, which helps in more efficient searching. This information is obtained by something called a heuristic. In the informed search, two main algorithms which are given below: 1. Greedy search OR Best First search Algorithm 2. A* Algorithm Search Heuristics: In an informed search, a heuristic is a function that estimates how close a state is to the goal state. For examples – Manhattan distance, Euclidean distance, etc. (Lesser the distance, closer the goal.) Different heuristics are used in different informed algorithms discussed below. ADMISSIBILITY OF HEURISTIC: A heuristic function is said to be admissible if it never overestimates the cost of reaching the goal. Cost it estimates to reach the goal should never be greater than the lowest possible cost from the current point. H(n) D -> E -> G Q. Source Arad, Goal: Bucharest, check whether heuristic value of source is admissible or not, if yes, then show the bath using Greedy Best First Search, else show the actual path cost and heuristic value of source node. OR Q. Implement Recursive Best First Search for Romanian map problem. Greedy Search OR Best first search algorithm: Step 1: Place the starting node into the OPEN list. Step 2: If the OPEN list is empty, Stop and return failure. Step 3: Remove the node n, from the OPEN list which has the lowest value of h(n), and places it in the CLOSED list. Step 4: Expand the node n, and generate the successors of node n. Step 5: Check each successor of node n, and find whether any node is a goal node or not. If any successor node is goal node, then return success and terminate the search, else proceed to Step 6. Step 6: For each successor node, algorithm checks for evaluation function f(n), and then check if the node has been in either OPEN or CLOSED list. If the node has not been in both list, then add it to the OPEN list. Step 7: Return to Step 2. Example 2: Consider the below search problem, and we will traverse it using greedy best-first search. At each iteration, each node is expanded using evaluation function f(n)=h(n) , which is given in the below table. In this search example, we are using two lists which are OPEN and CLOSED Lists. Following are the iteration for traversing the above example. Expand the nodes of S and put in the CLOSED list Initialization: Open [S] Open [A, B], Closed [S] Checking heuristic value h[A] = 12; h[B]=4 Taking minimum h[n] = h[B] Iteration 1: Open [A], Closed [S, B] Checking heuristic value h[E] = 8; h[F]=2 Taking minimum h[n] = h[F] Iteration 2: Open [E, F, A], Closed [S, B] Open [E, A], Closed [S, B, F] Iteration 3: Open [I, G, E, A], Closed [S, B, F] Open [I, E, A], Closed [S, B, F, G] Hence the final solution path will be: S----> B----->F----> G A 2. A* Search Algorithm: A* search is the most commonly known form of best- first search. It uses heuristic function h(n), and cost to reach the node n from the start state g(n). It has combined features of UCS and greedy best-first search, by which it solve the problem efficiently. A* search algorithm finds the shortest path through the search space using the heuristic function. This search algorithm expands less search tree and provides optimal result faster. A* algorithm is similar to UCS except that it uses g(n)+h(n) instead of g(n). Algorithm of A* search: Step1: Place the starting node in the OPEN list. Step 2: Check if the OPEN list is empty or not, if the list is empty then return failure and stops. Step 3: Select the node from the OPEN list which has the smallest value of evaluation function (g+h), if node n is goal node then return success and stop, otherwise Step 4: Expand node n and generate all of its successors, and put n into the closed list. For each successor n', check whether n' is already in the OPEN or CLOSED list, if not then compute evaluation function for n' and place into Open list. Step 5: Else if node n' is already in OPEN and CLOSED, then it should be attached to the back pointer which reflects the lowest g(n') value. Step 6: Return to Step 2. Example: In this example, we will traverse the given graph using the A* algorithm. The heuristic value of all states is given in the below table so we will calculate the f(n) of each state using the formula f(n)= g(n) + h(n), where g(n) is the cost to reach any node from start state. Here we will use OPEN and CLOSED list. A Path cost + heuristic value S->A = 1+3 = 4 A->B = 1+ 2+ 4 = 7 add1=sum(q[n]) = 2 + 4 path_cost=pcost+add1 = 1 + 2+ 4 Solution: Initialization: (S, 5) Iteration1: {(S--> A, 1), (S-->G, 10)} A*(A) =1+3=4; A*(G)=10+0=10 A B, C B = S to A = 1, A to B = 2; S to B= 1+2=3; A*= S to B = 3+4 = 7 A*(S to C)= S to A to C = 1 + 1 = 2+2 = 4 C D, G D = S to D = 5+6 = 11 G = S to G = 6+0 = 6 Iteration 4 will give the final result, as S--->A--->C--->G it provides the optimal path with cost NOTE: S- (A,G); For A, heuristic value h(n)=3; UCS g(n)=1 Total f(n) = h(n) + g(n) = 3 +1 = 4 For G, heuristic value h(n)=0; UCS g(n)=10 Total f(n) = h(n) + g(n) = 0 +10 = 10 As we have to consider the smallest evaluation value, hence, it is 4. will consider A as the next node to expand. Solve Greedy and A* for this UNIT 2: Learning from Examples LEARNING FROM EXAMPLES An agent is learning if it improves its performance LEARNING on future tasks after making observations about the world. Supervised learning Supervised learning as the name indicates a presence of supervisor as teacher. Basically supervised learning is a learning in which we teach or train the machine using data which is well labeled that means some data is already tagged with correct answer. After that, machine is provided with new set of examples(data) so that supervised learning algorithm analyses the training data(set of training examples) and produces an correct outcome from labeled data. 1) BUILD MODEL 2) TRAIN DATA – Training data (input->correct output) labeled data X1, X2 = inputs/independent/predictor Y = Output/dependent/response X1 X2 Y HEIGHT WEIGHT TSHIRT SIZE 5.0 45 M 4.5 40 S 3) Test DATA (Unseen data) 4.3 45 ? 5.0 45 M (100% accuracy) Practical example, say we are trying to build a model which can detect fraud transaction of credit cards. What we can do is, try to gather as much data as we can from past incidents. Say, we have 500 data about fraud transactions and 500 data about normal transactions. Now we can use any supervised learning algorithm which will try to find patterns in data when it is fraud transaction as well when it is normal transaction. So now our model knows when xyz pattern exists in the data then there is some probability that it is a fraud transaction. Types of Supervised Learning The Supervised Learning mainly divided into two parts which are as follows- Regression Regression is the type of Supervised Learning in which labeled data used, and this data is used to make predictions in a continuous form. The output of the input is always ongoing, and the graph is linear. Regression is a form of predictive modeling technique which investigates the relationship between a dependent variable[Outputs] and independent variable[Inputs]. Ex:- One of the examples of the regression technique is House Price Prediction, where the price of the house will predict from the inputs such as No of rooms, Locality, Ease of transport, Age of house, Area of a home. Linear Regression There are two types of linear Regression: 1. Simple Linear Regression 2. Multiple Linear Regression 1. Simple Linear Regression : Simple linear regression is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable. The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction error (all data points) are as small as possible. Error is the distance between the point to the regression line. Example : We have a dataset which contains information about relationship between ‘number of hours studied’ and ‘marks obtained’. Many students have been observed and their hours of study and grade are recorded. This will be our training data. Goal is to design a model that can predict marks if given the number of hours studied. Using the training data, a regression line is obtained which will give minimum error. This linear equation is then used for any new data. That is, if we give number of hours studied by a student as an input, our model should predict their mark with minimum error. We can model a simple Linear Regression using following formula: Y = c + m*X + e OR Y = W0+W1 *X1 + e OR Y = B0 + B1.X1 + e Where, Y = Dependent Variable (Output variable) X = Independent Variable (Input Variable) c = Constant term a.k.a Intercept m = Coefficient of relationship between ‘X’ & ‘Y’ e = random error/ residual Model actual output= 50 Predicted output = 40 10 = e The intercept (sometimes called the “constant”) in a regression model represents the mean value of the response variable when all of the predictor variables in the model are equal to zero. Suppose we’d like to fit a simple linear regression model using hours studied as a predictor variable and exam score as the response variable. We collect this data for 50 students in a certain college course and fit the following regression model: Exam score = 65.4 + 2.67(hours) The value for the intercept term in this model is 65.4. This means the average exam score is 65.4 when the number of hours studied is Regression line always passes through mean of independent variable (x) as well as mean of dependent variable (y) So first, we have to calculate mean of X and mean of Y using following formula: _ X = ( Σ xi ) / n _ Y = ( Σ Yi ) / n For model with one predictor, Calculate the Constant (Y- intercept of the line) _ _ c = Y - m* x Calculate ‘m’ (Coefficient of the relationship between ‘X’ and ‘Y’ using following formula: m= Exploring ‘m’ : If m > 0, then X(predictor) and Y(target) have a positive relationship. That is increase in X will increase Y. If m < 0, then X(predictor) and Y(target) have a negative relationship. That is increase in X will decrease Y. Obtained Regression Line : Multiple Regression We use the multiple linear regression model when we're dealing with a dependent variable that is affected by more than one factor. For example, a person's salary can be affected by their years of experience, years of education, daily working hours, etc. In this case we would use multiple linear regression. The multiple regression equation explained above takes the following form: Y = m1*X1+ m2*X2 + m3*x3 + … + c+e Where, Y = Dependent Variable (Output variable) X1, X2, X3… = Independent Variables (Input Variables) c = Constant term a.k.a Intercept m1, m2, m3… = Coefficient of relationship between ‘X’ & ‘Y’ Example: Y = m.X + C + e Sales =slope⋅Radio + intercept + random error Slope the coefficient for the Radio independent variable. In machine learning we call coefficients weights. Radio the independent variable. In machine learning we call these variables features. Intercept the intercept where our line intercepts the y-axis Our algorithm will try to learn the correct values for Weight and Bias. By the end of our training, our equation will approximate the line of best fit. Classification Classification is technique to categorize our data into a desired and distinct number of classes where we can assign label to each class. Classifiers can be: 1. Binary classifiers: Classification with only 2 distinct classes or with 2 possible outcomes examples: 1. YES or NO 2. Classification of spam email and non spam email 3. Classification of author of book 4. Positive and negative sentiment 2. Multi-Class classifiers: Classification with more than two distinct classes. examples: 1. Classification of types of soil 2. Classification of types of crops 3. Classification of music Classification Tree Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data. Decision Tree, as it’s name says, makes decision with tree-like model. It splits the sample into two or more homogeneous sets (leaves) based on the most significant differentiators in your input variables. To choose a differentiator (predictor), the algorithm considers all features and does a binary split on them. It will then choose the one with the least cost (i.e. highest accuracy), and repeats recursively, until it successfully splits the data in all leaves (or reaches the maximum depth). A decision tree typically starts with a single node, which branches into possible outcomes. Each of those outcomes leads to additional nodes, which branch off into other possibilities. This gives it a tree-like shape. Classification Rules : Classification rules are the cases in which all the scenarios are taken into consideration and a class variable is assigned to each. We will now see ‘the greedy approach’ to create a perfect decision tree The Greedy Approach : “Greedy Approach is based on the concept of Heuristic Problem Solving by making an optimal local choice at each node. By making these local optimal choices, we reach the approximate optimal solution globally.” When we start to implement the algorithm, the first question is: ‘How to pick the starting test condition?’ The answer to this question lies in the values of ‘Entropy’ and ‘Information Gain’. 1. Entropy: Entropy in Decision Tree stands for homogeneity. If the data is completely homogenous, the entropy is 0 (pure), else if the data is divided (50-50%) entropy is 1 (impure). 2. Information Gain: Information Gain is the decrease/increase in Entropy value when the node is split. An attribute should have the highest information gain to be selected for splitting. Based on the computed values of Entropy and Information Gain, we choose the best attribute at any particular step. Let us consider the following data: There can be n number of decision trees that can be formulated from these set of attributes. Tree Creation Trial-1 : Here we take up the attribute ‘Student’ as the initial test condition. Tree Creation Trial-2 : Similarly, why to choose ‘Student’? We can choose ‘Income’ as the test condition. Creating the Perfect Decision Tree With Greedy Approach Let us follow the ‘Greedy Approach’ and construct the optimal decision tree. There are two classes involved: ‘Yes’ i.e. whether the person buys a computer or ‘No’ i.e. he does not. To calculate Entropy and Information Gain, we are computing the value of Probability for each of these 2 classes. Positive: For ‘buys_computer=yes’ probability will come out to be : Negative: For ‘buys_computer=no’ probability comes out to be : In a given data set assume that there are two classes P and N ( Positive (yes) and Negative (no) ) from an example. We now put calculate the Entropy by putting probability values in the formula: Formula : Entropy = -P log2 P – N log2 N We now put calculate the Entropy by putting probability values in the formula stated above. We have already classified the values of Entropy, which are: Entropy =0: Data is completely homogenous (pure) Entropy =1: Data is divided into 50- 50 % (impure) Our value of Entropy is 0.940, which means our set is almost impure. To find out the suitable attribute and calculate the Information Gain. What is information gain if we split on “Age”? This data represents how many people falling into a specific age bracket, buy and do not buy the product. For example, for people with Age 30 or less, 2 people buy (Yes) and 3 people do not buy (No) the product, the Info (D) is calculated for these 3 categories of people, that is represented in the last column. The Info (D) for the age attribute is computed by the total of these 3 ranges of age values. Now, the question is what is the ‘information gain’ if we split on ‘Age’ attribute. The difference of the total Entropy value (0.940) and the entropy computed for age attribute (0.694) gives the ‘information gain’. (We have to calculate entropy for ‘age’ ) Information Gain(X) = Entropy(T) – entropy (X) This is the deciding factor for whether we should split at ‘Age’ or any other attribute. Similarly, we calculate the ‘information gain’ for the rest of the attributes: Information Gain (Age) =0.246 Information Gain (Income) =0.029 Information Gain (Student) = 0.151 Information Gain (credit_rating) =0.048 On comparing these values of gain for all the attributes, we find out that the ‘information gain’ for ‘Age’ is the highest. Thus, splitting at ‘age’ is a good decision. Similarly, at each split, we compare the information gain to find out whether that attribute should be chosen for split or not. Thus, the optimal tree created looks like : The classification rules for this tree can be jotted down as: If a person’s age is less than 30 and he is not a student, he will not buy the product. Age(40) ^ credit_rating(fair) = Yes Thus, we achieve the perfect Decision Tree!! Example, How to build a tree? so which one do we need to pick first?? performing top-down, greedy search through the space of possible decision trees. So how do we choose the best attribute? use the attribute with the highest information gain in ID3 use the attribute with the highest information gain in ID3 (Iterative Dichotomizer 3) When we start to implement the algorithm, the first question is: ‘How to pick the starting test condition?’ The answer to this question lies in the values of ‘Entropy’ and ‘Information Gain’. apply these metrics to our dataset to split the data Steps: 1.compute the entropy for data-set 2.for every attribute/feature: i. Calculate entropy for all categorical values ii. Take average information entropy for the current attribute iii. Calculate gain for the current attribute 3. pick the highest gain attribute. 4. Repeat until we get the tree we desired. Compute the entropy for the weather data set: For every feature calculate the entropy and information gain Similarity we can calculate for other two attributes(Humidity and Temp). Pick the highest gain attribute. So our root node is Outlook Repeat the same thing for sub-trees till we get the tree. Nonparametric Models A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. A nonparametric model is one that cannot be characterized by a bounded set of parameters. K-NN (K Nearest Neighbour) The k-Nearest Neighbors (k-NN) method of classification is one of the simplest methods in machine learning that stores all available cases and classifies new cases based on a similarity measure. K-NN is a non-parametric supervised learning technique in which we try to classify the data point to a given category with the help of training set. In simple words, it captures information of all training cases and classifies new cases based on a similarity. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. How K-NN Algorithm works? Suppose we have height, weight and T-shirt size of some customers and we need to predict the T-shirt size of a new customer given only height and weight information we have. Data including height, weight and T-shirt size information is shown below - Step 1 : Calculate Similarity based on distance function There are many distance functions but Euclidean is the most commonly used measure. It is mainly used when data is continuous. The idea to use distance measure is to find the distance (similarity) between new sample and training cases and then finds the k-closest customers to new customer in terms of height and weight. New customer named ‘Clay' has height 161cm and weight 61kg. Euclidean distance between first observation and new observation (clay) is as follows New observation : Height (x1) = 161 cm and Weight (y1) = 61 kg ; T-shirt Size = ? First observation : Height (x2) = 158 cm ; Weight (y2) = 58 kg ; T-shirt Size = M Distance = sqrt(x1 – x2)^2 + (y1 – y2)^2 Distance1 =SQRT((161-158)^2+(61-58)^2) Similarly, we will calculate distance of all the training cases with new case and calculates the rank in terms of distance. The smallest distance value will be ranked 1 and considered as nearest neighbor. Step 2 : Find K-Nearest Neighbors Let k be 5. Then the algorithm searches for the 5 customers closest to Monica, i.e. most similar to Clay in terms of attributes, and see what categories those 5 customers were in. If 4 of them had ‘Medium T shirt sizes’ and 1 had ‘Large T shirt size’ then your best guess for Clay is ‘Medium T shirt. See the calculation shown in the snapshot below - Calculated distance In the graph below, binary dependent variable (T-shirt size) is displayed in blue and orange color. 'Medium T-shirt size' is in blue color and 'Large T-shirt size' in orange color. New customer information is exhibited in yellow circle. Four blue highlighted data points and one orange highlighted data point are close to yellow circle. so the prediction for the new case is blue highlighted data point which is Medium T-shirt size. Optimal value of K = Sqrt(n) Where, n = Total no. of data points. Use KNN to predict given class (YES OR NO), X1={9,9,5,5}; X2={9,6,6,7} Y={YES,YES,NO,NO}; Given K=3; What will be the Y value for new records X1=5; X2=9, Y=? STEP 1: CALCULATING DISTANCE Obs formula dist Rank Y (K=3) Majority Vote 1st obs = Sqrt[(9-5)^2 + (9-9)^2 ] = 4 3 YES 2nd obs = Sqrt[(9-5)^2 + (6-9)^2 ] = 5 4 3rd obs = Sqrt[(5-5)^2 + (6-9)^2 ] = 3 2 NO NO 4th obs = Sqrt[(5-5)^2 + (7-9)^2 ] = 2 1 NO X1=5; X2=9; Y= NO Calculate the Euclidean distance between the two data points A(5,4), B(2,3). Sqrt(10) Below are the 7 actual values of target variable from the training data.[NO,NO,NO,YES,YES,YES,YES] What will be the entropy of the target variable? -PLog(p)-NLog(N) - 4/7log(4/7) – 3/7log(3/7) SUPERVISED LEARNING 1) Linear Regression i. Simple linear regression ii. Mulitple linear regression 2) Classification i. Classification Tree ii. Logistic Regression iii. KNN (K Nearest Neighbour) iv. SVM (Support Vector Machine) v. Naïve Bayes Classifier Y: class1, class2 KNN – DATASET 1 to 14 records (train) Test – New record K – value 5 Nearest neighbor (distance): New record –distance with X1,X2,X3,X4..etc New record with x1 = 3 (2) ----class1 New record with x2 = 4 (3) ------class2 New record with x3 = 2 (1)------class1 New record with x4 = 5 (4)-------class1 New record with x5 = 6 (5)-------class2 New record with x6 = 8 Logistic Regression Logistic regression is the most famous machine learning algorithm after linear regression. In a lot of ways, linear regression and logistic regression are similar. But, the biggest difference lies in what they are used for. Linear regression algorithms are used to predict/forecast values but logistic regression is used for classification tasks. For example, classifying whether an email is a spam or not, classifying whether a tumor is malignant or benign, classifying whether a website is fraudulent or not, etc. Binary Logistic Regression is used when dependent variable is ‘binary’(=zero or one, presence or absence, death or survival) The independent variables can be categorical or continuous. X : independent variable/ input variable/ predictor/ features/ attributes Y: Dependent variable/ output variable/ response variable/ target variable The Odds Ratio: It is equal to the probability of success divided by probability of failure. Odds(p) = The log of the Odds function is often called as “Logistic Function”. Y = log (x / 1-x) Inverse of Logistic function is called Sigmoid Function. Logistic regression algorithm also uses a linear equation with independent predictors to predict a value. The predicted value can be anywhere between negative infinity to positive infinity. We need the output of the algorithm to be class variable, i.e 0-no, 1-yes. To squash the predicted value between 0 and 1, we use the sigmoid function. The formula for the sigmoid function is σ(y) = 1 / ( 1 + exp^(-x) ) Sigmoid functions are useful when you are working with probabilities or trying to classify data. the term “sigmoid function” is used to refer to a class of functions with S-shaped curves The Logistic regression equation can be obtained from the Linear Regression equation. We know the equation of the straight line can be written as: Y = c + m1x1 Y = c + m1x1 +m2x2 +…+mnxn In Logistic Regression y can be between 0 and 1 only, log(y/1-y) e^x = y/1-y Y = e^x(1-y) Y = e^x – y.e^x Y + y.e^x = e^x Y(1+e^x) = e^x Y = e^x / 1 + e^x Y = 1/ [1+e^x/e^x] [1+e^x/e^x] = 1/e^x + e^x/e^x Y = 1/ [1/e^x + 1] Y = 1/ 1+ e^-x Here is a plot of the function: Support Vector Machine Support vector machines, also known as SVM, are well-known supervised classification algorithms that separate different categories of data. These vectors are classified by optimizing the line so that the closest point in each of the groups will be the farthest away from each other. The goal of a support vector machine is to find the optimal separating hyperplane which maximizes the margin of the training data. It is a classification method, where we plot each data item as a point in n-dimensional space (where n is number of features) with the value of each feature being the value of a particular coordinate. The first thing we can see from this definition, is that a SVM needs training data. Which means it is a supervised learning algorithm. It is also important to know that SVM is a classification algorithm. Which means we will use it to predict if something belongs to a particular class. Basic Terminonlogies 1. Hyperplane Definition: A hyperplane in SVM is the decision boundary that separates different classes of data points. In a 2D space, this boundary is a line, while in a 3D space, it is a plane. In higher dimensions, it's referred to as a hyperplane. Example: Imagine a dataset with two features (2D space). The hyperplane would be a line that best separates the data points into different classes. 2. Support Vectors Definition: Support vectors are the data points that are closest to the hyperplane. These points are crucial because they determine the position and orientation of the hyperplane. The SVM algorithm focuses on these support vectors rather than the entire dataset. Example: In the previous example, the points nearest to the decision boundary (hyperplane) from each class are the support vectors. These points are critical because if they were removed, the hyperplane would change. 3. Margin Definition: The margin is the distance between the hyperplane and the nearest data points from each class (support vectors). The objective of SVM is to maximize this margin. A larger margin provides better generalization for the model, meaning it can classify unseen data more effectively. Example: The margin is the distance from the hyperplane to the closest points (support vectors) of each class. SVM