AIMidterm-Reviewer PDF
Document Details
Uploaded by AppreciatedDiscernment2599
Silliman University
Fabio S. Hascoet
Tags
Summary
This document provides an introduction to various search algorithms in artificial intelligence. It covers uninformed search methods, like breadth-first and depth-first search, as well as informed search strategies, such as greedy best-first search and A*. It also touches upon adversarial search paradigms, including minimax and techniques for optimizing it.
Full Transcript
Fabio S. Hascoet Introduction to Artificial Intelligence CS37 SEARCH Agent - An entity that perceives its environment and acts upon that environment. State - A configuration of an agent in its environment - Initial State...
Fabio S. Hascoet Introduction to Artificial Intelligence CS37 SEARCH Agent - An entity that perceives its environment and acts upon that environment. State - A configuration of an agent in its environment - Initial State - The state which the search algorithm starts. Actions - Choices that can be made in a state. Goal test - The condition that determines whether a given state is a goal state. Path Cost - A numerical cost associated with a given path. Solution - a sequence of actions that leads from the initial state to the goal state. - Optimal Solution - A solution that has the lowest path cost among all solutions Contents of this data structure *NODE* - State, Parent, Action, Path Cost Algorithms that fall under uninformed search - Breadth-First Search - Depth-First Search Depth-First Search algorithm exhausts each one direction before trying another direction. In these cases, the frontier is managed as a stack data structure. The catchphrase you need to remember here is “last-in first-out.” After nodes are being added to the frontier, the first node to remove and consider is the last one to be added. This results in a search algorithm that goes as deep as possible in the first direction that gets in its way while leaving all other directions for later. Pros of DFS: - At best, this algorithm is the fastest. If it “lucks out” and always chooses the right path to the solution (by chance), then depth-first search takes the least possible time to get to a solution. Cons of DFS: - It is possible that the found solution is not optimal. - At worst, this algorithm will explore every possible path before finding the solution, thus taking the longest possible time before reaching the solution. Breadth-First Search (BFS) algorithm will follow multiple directions at the same time, taking one step in each possible direction before taking the second step in each direction. In this case, the frontier is managed as a queue data structure. The catchphrase you need to remember here is “first-in first-out.” In this case, all the new nodes add up in line, and nodes are being considered based on which one was added first (first come first served!). This results in a search algorithm that takes one step in each possible direction before taking a second step in any one direction. Pros of BFS: - This algorithm is guaranteed to find the optimal solution. Cons of BFS: - This algorithm is almost guaranteed to take longer than the minimal time to run. - At worst, this algorithm takes the longest possible time to run. Greedy best-first search expands the node that is the closest to the goal, as determined by a heuristic function h(n) The efficiency of the greedy best-first algorithm depends on how good the heuristic function is. For example, in a maze, an algorithm can use a heuristic function that relies on the Manhattan distance between the possible nodes and the end of the maze. The Manhattan distance ignores walls and counts how many steps up, down, or to the sides it would take to get from one location to the goal location. For A* search to be optimal, the heuristic function, h(n), should be: Admissible, or never overestimating the true cost, and Consistent, which means that the estimated path cost to the goal of a new node in addition to the cost of transitioning to it from the previous node is greater or equal to the estimated path cost to the goal of the previous node. To put it in an equation form, h(n) is consistent if for every node n and successor node n’ with step cost c, h(n) ≤ h(n’) + c. Adversarial search the algorithm faces an opponent that tries to achieve the opposite goal. Often, AI that uses adversarial search is encountered in games, such as tic tac toe. - Minimax is A type of algorithm in adversarial search, Minimax represents winning conditions as (-1) for one side and (+1) for the other side. Further actions will be driven by these conditions, with the minimizing side trying to get the lowest score, and the maximizer trying to get the highest score. Pseudocode for the function MaxValue(state) in the Minimax algorithm function Max-Value(state) if Terminal(state) return Utility(state) v =- ∞ for action in Actions(state) v = Max(v, Min-Value(Result(state,action))) return v Pseudocode for the function MinValue(state) in the Minimax algorithm function Min-Value(state) if Terminal(state) return Utility(state) v=∞ for action in Actions(state) v = Min(v, Max-Value(Result(state,action))) return v Way to optimize the minimax algorithm - Alpha-Beta pruning skips some of the recursive computations that are decidedly unfavorable. After establishing the value of one action, if there is initial evidence that the following action can bring the opponent to get a better score than the already established action, there is no need to further investigate this action because it will decidedly be less favorable than the previously established one. KNOWLEDGE Transition model - A description of what state results from performing any applicable action in any state. State Space - The set of all states reachable from the initial state by any sequence of actions. knowledge-based agents* - agents that reason by operating on internal representations of knowledge (AI, used to store knowledge and understand situations to make smart decisions). Logic – taking sentences and reason with other sentences. sentence - an assertion about the world in a knowledge representation language (statement or fact, way of expressing information or facts). Propositional Logic – statements of the world that can either be true or false but not both. and (^) – both have to be true. or (v) – one has to be true. Implication (🡪) - If A is true, then B is also true. Biconditional (🡨🡪) – both have to be true or false. model* - assignment of a truth value to every propositional symbol (a "possible world"). P: It is raining. Q: It is a Tuesday. {P = true, Q = false} knowledge base - a set of sentences known by a knowledge-based agent (collection of information or facts that a knowledge-based agent uses). Entailment* - In every model in which sentence α is true, sentence β is also true. α⊨β inference* - the process of deriving new sentences from old ones (take existing statements/sentences and use them to create new statements that logically follow from the old ones). P: It is a Tuesday. Q: It is raining. R: Harry will go for a run. KB: (P ∧ ¬Q) → R P ¬Q Inference: R Inference algorithms - are methods used in logic and artificial intelligence to derive new knowledge from existing information or to check the validity of statements. Model checking - is a formal verification technique used in computer science and artificial intelligence to automatically verify whether a model of a system meets a given specification. Knowledge Engineering - is the field of artificial intelligence (AI) that focuses on the creation, representation, and management of knowledge in a way that enables computers to utilize it effectively for reasoning, problem-solving, and decision-making. Logic Puzzles – more complex propositional symbols, either true or false. Inference Rules* - are fundamental principles used in logic to derive new conclusions from given premises or statements. Modus Ponens - states that if you have a conditional statement and you know that the antecedent (the "if" part) is true, you can conclude that the consequent (the "then" part) is also true. α→β α β And Elimination - If you have a statement that says two things are true (a conjunction), you can infer that either part is true. α∧β α Double Negation Elimination - if you have a statement that is negated twice, you can conclude the original statement. ¬(¬α) α Implication Elimination – if you have a conditional statement and you know that the antecedent is true, you can conclude that the consequent (the “then” part) is also true. α→β ¬α ∨ β Biconditional Elimination - is a rule of inference in propositional logic that allows you to derive conclusions from biconditional statements. α↔β (α → β) ∧ (β → α) De Morgan's Laws - are two important rules in propositional logic that describe how negation interacts with conjunctions (AND) and disjunctions (OR). ¬(α ∧ β) ¬(α ∨ β) ¬α ∨ ¬β ¬α ∧ ¬β Distributive Property - describes how multiplication distributes over addition (and, in the context of logic, how conjunction distributes over disjunction and vice versa). (α ∧ (β ∨ γ)) (α ∨ (β ∧ γ)) (α ∧ β) ∨ (α ∧ γ) (α ∨ β) ∧ (α ∨ γ) Search Problems* initial state actions transition model goal test path cost function Theorem Proving* initial state: starting knowledge base actions: inference rules transition model: new knowledge base after inference goal test: check statement we're trying to prove path cost function: number of steps in proof Resolution - is a powerful rule of inference used in logic and automated reasoning, particularly in propositional logic and first-order logic. It provides a systematic way to derive conclusions from a set of premises. Clause* - is a disjunction (OR) of literals. e.g. P ∨ Q ∨ R conjunctive normal form* - logical sentence that is a conjunction of clauses e.g. (A ∨ B ∨ C) ∧ (D ∨ ¬E) ∧ (F ∨ G) Conversion to CNF (P ∨ Q) → R ¬(P ∨ Q) ∨ R eliminate implication (¬P ∧ ¬Q) ∨ R De Morgan’s Law (¬P ∨ R) ∧ (¬Q ∨ R) distributive law Inference by Resolution - is a fundamental rule of inference used in propositional logic and first-order logic P∨Q P∨Q∨S P∨Q∨S P ¬P ∨ R ¬P ∨ R ∨ S ¬P ∨ R ∨ S ¬P (Q ∨ R) (Q ∨ S ∨ R ∨ S) (Q ∨ R ∨ S) () To determine if KB ⊨ α Convert (KB ∧ ¬α) to Conjunctive Normal Form. Keep checking to see if we can use resolution to produce a new clause. If ever we produce the empty clause (equivalent to False), we have a contradiction, and KB ⊨ α. Otherwise, if we can't add new clauses, no entailment. Does (A ∨ B) ∧ (¬B ∨ C) ∧ (¬C) entail A? (A ∨ B) ∧ (¬B ∨ C) ∧ (¬C) ∧ (¬A) (A ∨ B) (¬B ∨ C) (¬C) (¬A) (¬B) (A) ( ) Conversion to CNF (P v Q ) -> R ¬(P v Q) v R Eliminate Implication (¬P ∧ ¬Q) v R De Morgan’s Law (¬P v R) ∧ (¬Q v R) Distributive Law First-Order Logic (FOL) - also known as Predicate Logic or First-Order Predicate Logic, is an extension of propositional logic that allows for more complex statements about objects and their relationships. Universal Quantification* is a concept in first-order logic that allows us to make statements about all elements in a particular domain. It is represented using the symbol ∀ , which means "for all" or "for every." ∀x. BelongsTo(x, Gryffindor) → ¬BelongsTo(x, Hufflepuff) For all objects x, if x belongs to Gryffindor, then x does not belong to Hufflepuff. Anyone in Gryffindor is not in Hufflepuff. Existential Quantification* - is a key concept in first-order logic that allows us to assert the existence of at least one element within a certain domain that satisfies a given property or predicate. It is represented by the symbol ∃, which means "there exists" or "there is at least one." ∃x. House(x) ∧ BelongsTo(Minerva, x) There exists an object x such that x is a house and Minerva belongs to x. Minerva belongs to a house. ∀x. Person(x) → (∃y. House(y) ∧ BelongsTo(x, y)) For all objects x, if x is a person, then there exists an object y such that y is a house and x belongs’ to y. Every person belongs to a house. Uncertainty Probability Uncertainty can be represented as a number of events and the likelihood, or probability, of each of them happening. Possible Worlds P(ω) Every possible situation can be thought of as a world, represented by the lowercase Greek letter omega ω. For example, rolling a die can result in six possible worlds: a world where the die yields a 1, a world where the die yields a 2, and so on. To represent the probability of a certain world, we write P(ω). The probabilities of every possible event, when summed together, are equal to 1 The probability of rolling a number R with a standard die can be represented as P(R). In our case, P(R) = 1/6, because there are six possible worlds (rolling any number from 1 through 6) and each is equally likely to happen. Now, consider the event of rolling two dice. Now, there are 36 possible events, which are, again, equally as likely To get the probability of an event, we divide the number of worlds in which it occurs by the number of total possible worlds. For example, there are 36 possible worlds when rolling two dice. Only in one of these worlds, when both dice yield a 6, do we get the sum of 12. Thus, P(12) = 1/36, or, in words, the probability of rolling two dice and getting two numbers whose sum is 12 is 1/36. What is P(7)? We count and see that the sum 7 occurs in 6 worlds. Thus, P(7) = 6/36 = 1/6. Unconditional Probability - is the degree of belief in a proposition in the absence of any other evidence. Conditional probability - is the degree of belief in a proposition given some evidence that has already been revealed. P(a | b) - “the probability of a given b.” A random variable - is a variable in probability theory with a domain of possible values that it can take on. Independence - is the knowledge that the occurrence of one event does not affect the probability of the other event. P(a ∧ b) = P(a)P(b). Bayes’ rule is commonly used in probability theory to compute conditional probability. Joint probability - is the likelihood of multiple events all occurring. Probability Rules - Negation: P(¬a) = 1 - P(a). This stems from the fact that the sum of the probabilities of all the possible worlds is 1, and the complementary literals a and ¬a include all the possible worlds. - Inclusion-Exclusion: P(a ∨ b) = P(a) + P(b) - P(a ∧ b). This can be interpreted in the following way: the worlds in which a or b are true are equal to all the worlds where a is true, plus the worlds where b is true. However, in this case, some worlds are counted twice (the worlds where both a and b are true)). To get rid of this overlap, we subtract once the worlds where both a and b are true (since they were counted twice). - Marginalization: P(a) = P(a, b) + P(a, ¬b). The idea here is that b and ¬b are disjoint probabilities. That is, the probability of b and ¬b occurring at the same time is 0. We also know b and ¬b sum up to 1. Thus, when a happens, b can either happen or not. When we take the probability of both a and b happening in addition to the probability of a and ¬b, we end up with simply the probability of a. Marginalization can be expressed for random variables the following way: - Conditioning: P(a) = P(a | b)P(b) + P(a | ¬b)P(¬b). This is a similar idea to marginalization. The probability of event a occurring is equal to the probability of a given b times the probability of b, plus the probability of a given ¬b time the probability of ¬b. A Bayesian network - is a data structure that represents the dependencies among random variables. Bayesian networks have the following properties: - They are directed graphs. - Each node on the graph represents a random variable. - An arrow from X to Y represents that X is a parent of Y. That is, the probability distribution of Y depends on the value of X. - Each node X has probability distribution P(X | Parents(X)). Inference (in Uncertainty) - While this does not allow us to know new information for certain, it allows us to figure out the probability distributions for some values. Inference has multiple properties. Query X: the variable for which we want to compute the probability distribution. Evidence variables E: one or more variables that have been observed for event e. For example, we might have observed that there is light rain, and this observation helps us compute the probability that the train is delayed. Hidden variables Y: variables that aren’t the query and also haven’t been observed. For example, standing at the train station, we can observe whether there is rain, but we can’t know if there is maintenance on the track further down the road. Thus, Maintenance would be a hidden variable in this situation. The goal: calculate P(X | e). For example, compute the probability distribution of the Train variable (the query) based on the evidence e that we know there is light rain. Inference by enumeration is a process of finding the probability distribution of variable X given observed evidence e and some hidden variables Y. Sampling - is one technique of approximate inference. In sampling, each variable is sampled for a value according to its probability distribution. Likelihood Weighting - Start by fixing the values for evidence variables. - Sample the non-evidence variables using conditional probabilities in the Bayesian network. - Weight each sample by its likelihood: the probability of all the evidence occurring. To represent the variable of time we will create a new variable, X, and change it based on the event of interest, such that Xₜ is the current event, Xₜ₊₁ is the next event, and so on. To be able to predict events in the future, we will use Markov Models. The Markov assumption is an assumption that the current state depends on only a finite fixed number of previous states. This is important to us. Think of the task of predicting weather. In theory, we could use all the data from the past year to predict tomorrow’s weather. A Markov chain is a sequence of random variables where the distribution of each variable follows the Markov assumption. That is, each event in the chain occurs based on the probability of the event before it. A hidden Markov model is a type of a Markov model for a system with hidden states that generate some observed event. This means that sometimes, the AI has some measurement of the world but no access to the precise state of the world. Sensor Markov Assumption - The assumption that the evidence variable depends only on the corresponding state. A hidden Markov model can be represented in a Markov chain with two layers. The top layer, variable X, stands for the hidden state. The bottom layer, variable E, stands for the evidence, the observations that we have. Based on hidden Markov models, multiple tasks can be achieved: Filtering: given observations from start until now, calculate the probability distribution for the current state. For example, given information on when people bring umbrellas form the start of time until today, we generate a probability distribution for whether it is raining today or not. Prediction: given observations from start until now, calculate the probability distribution for a future state. Smoothing: given observations from start until now, calculate the probability distribution for a past state. For example, calculating the probability of rain yesterday given that people brought umbrellas today. Most likely explanation: given observations from start until now, calculate most likely sequence of events. OPTIMIZATION Optimization - is choosing the best option from a set of possible options. Local search - is a search algorithm that maintains a single node and searches by moving to a neighboring node. An Objective Function is a function that we use to maximize the value of the solution. A Cost Function is a function that we use to minimize the cost of the solution (this is the function that we would use in our example with houses and hospitals. We want to minimize the distance from houses to hospitals). A Current State is the state that is currently being considered by the function. A Neighbor State is a state that the current state can transition to. In the one-dimensional state-space landscape above, a neighbor state is the state to either side of the current state. In our example, a neighbor state could be the state resulting from moving one of the hospitals to any direction by one step. Neighbor states are usually similar to the current state, and, therefore, their values are close to the value of the current state. Hill climbing - is one type of a local search algorithm. In this algorithm, the neighbor states are compared to the current state, and if any of them is better, we change the current node from the current state to that neighbor state. What qualifies as better is defined by whether we use an objective function, preferring a higher value, or a decreasing function, preferring a lower value. pseudocode: function Hill-Climb(problem): current = initial state of problem repeat: neighbor = best valued neighbor of current if neighbor not better than current: return current current = neighbor Local and Global Minima and Maxima - A local maximum (plural: maxima) is a state that has a higher value than its neighboring states. - A global maximum is a state that has the highest value of all states in the state-space. The problem with hill climbing algorithms is that they may end up in local minima and maxima. The algorithms below are phrased such that a higher value is better, but they also apply to cost functions, where the goal is to minimize cost. Steepest-ascent: choose the highest-valued neighbor. This is the standard variation that we discussed above. Stochastic: choose randomly from higher-valued neighbors. Doing this, we choose to go to any direction that improves over our value. This makes sense if, for example, the highest-valued neighbor leads to a local maximum while another neighbor leads to a global maximum. First-choice: choose the first higher-valued neighbor. Random-restart: conduct hill climbing multiple times. Each time, start from a random state. Compare the maxima from every trial, and choose the highest amongst those. Local Beam Search: chooses the k highest-valued neighbors. This is unlike most local search algorithms in that it uses multiple nodes for the search, and not just one. Simulated Annealing - allows the algorithm to “dislodge” itself if it gets stuck in a local maximum. This mechanism allows the algorithm to change its state to a neighbor that’s worse than the current state, which is how it can escape from local maxima. pseudocode: function Simulated-Annealing(problem, max): current = initial state of problem for t = 1 to max: T = Temperature(t) neighbor = random neighbor of current ΔE = how much better neighbor is than current if ΔE > 0: current = neighbor with probability e^(ΔE/T) set current = neighbor return current Traveling Salesman Problem - In the traveling salesman problem, the task is to connect all points while choosing the shortest possible distance. Linear programming is a family of problems that optimize a linear equation (an equation of the form y = ax₁ + bx₂ + …). Linear programming will have the following components: A cost function that we want to minimize: c₁x₁ + c₂x₂ + … + cₙxₙ. Here, each x₋ is a variable and it is associated with some cost c₋. A constraint that’s represented as a sum of variables that is either less than or equal to a value (a₁x₁ + a₂x₂ + … + aₙxₙ ≤ b) or precisely equal to this value (a₁x₁ + a₂x₂ + … + aₙxₙ = b). In this case, x₋ is a variable, and a₋ is some resource associated with it, and b is how much resources we can dedicate to this problem. Individual bounds on variables (for example, that a variable can’t be negative) of the form lᵢ ≤ xᵢ ≤ uᵢ. Constraint Satisfaction problems are a class of problems where variables need to be assigned values while satisfying some conditions. Constraints satisfaction problems have the following properties: Set of variables (x₁, x₂, …, xₙ) Set of domains for each variable {D₁, D₂, …, Dₙ} Set of constraints C A Hard Constraint is a constraint that must be satisfied in a correct solution. A Soft Constraint is a constraint that expresses which solution is preferred over others. A Unary Constraint is a constraint that involves only one variable. In our example, a unary constraint would be saying that course A can’t have an exam on Monday {A ≠ Monday} A Binary Constraint is a constraint that involves two variables. This is the type of constraint that we used in the example above, saying that some two courses can’t have the same value {A ≠ B} Node consistency is when all the values in a variable’s domain satisfy the variable’s unary constraints. Arc consistency is when all the values in a variable’s domain satisfy the variable’s binary constraints (note that we are now using “arc” to refer to what we previously referred to as “edge”). revised = false for x in X.domain: if no y in Y.domain satisfies constraint for (X,Y): delete x from X.domain revised = true return revised Often we are interested in making the whole problem arc-consistent and not just one variable with respect to another. In this case, we will use an algorithm called AC-3, which uses Revise: function AC-3(csp): queue = all arcs in csp while queue non-empty: (X, Y) = Dequeue(queue) if Revise(csp, X, Y): if size of X.domain == 0: return false for each Z in X.neighbors - {Y}: Enqueue(queue, (Z,X)) return true A constraint satisfaction problem can be seen as a search problem: Initial state: empty assignment (all variables don’t have any values assigned to them). Actions: add a {variable = value} to assignment; that is, give some variable a value. Transition model: shows how adding the assignment changes the assignment. There is not much depth to this: the transition model returns the state that includes the assignment following the latest action. Goal test: check if all variables are assigned a value and all constraints are satisfied. Path cost function: all paths have the same cost. As we mentioned earlier, as opposed to typical search problems, optimization problems care about the solution and not the route to the solution. Backtracking search is a type of a search algorithm that takes into account the structure of a constraint satisfaction search problem. In general, it is a recursive function that attempts to continue assigning values as long as they satisfy the constraints. function Backtrack(assignment, csp): if assignment complete: return assignment var = Select-Unassigned-Var(assignment, csp) for value in Domain-Values(var, assignment, csp): if value consistent with assignment: add {var = value} to assignment result = Backtrack(assignment, csp) if result ≠ failure: return result remove {var = value} from assignment return failure By interleaving backtracking search with inference (enforcing arc consistency), we can get at a more efficient algorithm. This algorithm is called the Maintaining Arc-Consistency algorithm. This algorithm will enforce arc-consistency after every new assignment of the backtracking search. function Backtrack(assignment, csp): if assignment complete: return assignment var = Select-Unassigned-Var(assignment, csp) for value in Domain-Values(var, assignment, csp): if value consistent with assignment: add {var = value} to assignment inferences = Inference(assignment, csp) if inferences ≠ failure: add inferences to assignment result = Backtrack(assignment, csp) if result ≠ failure: return result remove {var = value} and inferences from assignment return failure There are additional ways to make the algorithm more efficient. Minimum Remaining Values (MRV) is one such heuristic. The Degree heuristic relies on the degrees of variables, where a degree is how many arcs connect a variable to other variables. Least Constraining Values heuristic, where we select the value that will constrain the least other variables. To summarize, optimization problems can be formulated in multiple ways. Today we considered local search, linear programming, and constraint satisfaction.