AI Exam Answers PDF
Document Details
Uploaded by Deleted User
2020
Tags
Summary
This document contains answers to possible questions from an AI exam taken on June 1, 2020. It covers various topics related to Artificial Intelligence, including historical figures, concepts, and tasks. The document also presents different algorithms employed in the field.
Full Transcript
Answers to Possible Questions for the AI Exam June 1, 2020 AI in general 1. What did John McCarthy NOT do? John McCarthy served on the board of the ALGOL 60 committee, proposed if/then/else and recursion (Page 10) John McCarthy, father of A...
Answers to Possible Questions for the AI Exam June 1, 2020 AI in general 1. What did John McCarthy NOT do? John McCarthy served on the board of the ALGOL 60 committee, proposed if/then/else and recursion (Page 10) John McCarthy, father of AI, developer of Lisp (Page 15) He coined the name for the field: “Artificial Intelligence” (Page 14) 2. Which of these statements is true about the Turing test and the Chinese room argument? 1950: Computing Machinery and Intelligence - Turing test (Page 13) Kasparov suggested that human chess players intervened → Deep Blue passed the “chess Turing test” (Page 29) John Searle: Chinese room argument (Page 9) Nobody supposes that the computational model of rainstorms in London will leave us all wet. 3.Which of these tasks was NOT solved much better by Deep Learning than previous algorithms? Tasks solved better by Deep learning are: Image and object recognition, voice generation and recognition. It doesn’t work so well with small data Deep Learning in practice is hard and expensive Deep networks are not easily interpreted. https://towardsdatascience.com/three-reasons-that-you-should-not-use-deep-learning- 15bec517b622. (Page 25) 4. What is NOT crucial for deep learning algorithms? One of the main goals is self-driving cars (Page 25) Deep learning achieves recognition accuracy at higher levels than ever before. This helps consumer electronics meet user expectations, and it is crucial for safety- critical applications. Deep learning requires large amounts of labeled data. Deep learning requires substantial computing power. 5. Which of these advances in AI that are used extensively in software technology today were NOT invented by John McCarthy’s lab? Ray Solomonoff: the inventor of algorithmic probability and founder of algorithmic information theory Claude Shannon: the father of information theory W. S. McCulloch: first mathematical model of neural networks with Pitts Nat Rochester: wrote the first assemble David Sayre: partly developed FORTRAN (Page 16) 6. Finish the sentence: Nobody supposes that the computational model of rainstorms in London Nobody supposes that the computational model of rainstorms in London ——will leave us all wet. (Page 9) 7. Which theory says that our minds are in fact computer programs? Computational theory of mind: our minds are computer programs (Page 9) 8. Who was NOT present at the Dartmouth Summer Research Project on Artificial Intelligence? Organized by John McCarthy. Attendees: Ray Solomonoff Marvin Minsky Claude Shannon John Nash W. S. McCulloch Arthur Samuel Nat Rochester David Sayre Herbert Simon. (Page 16) 9. What was the name of the world’s first chatterbot? 1966 – ELIZA: the first chatterbot, could fool some people into believing it was human (Page 19) 10. What was NOT one of the problems with AI identified in the Lighthill report? 1973: Lighthill report: utter failure to achieve the grandiose objectives, dismantling of AI research in the UK (Page 21) Problems: Not enough computational power: in some NLP applications, 20 words would fit into the memory Commonsense knowledge and reasoning, the knowledge acquisition bottleneck Moravec’s paradox: “it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility” 11. What is Moravec’s paradox? Moravec’s paradox: “it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility” 12. Who was not awarded the Turing prize despite being a significant contributor to deep learning? Yoshua Bengio, Geoffrey Hinton and Yann LeCun are awarded the Turing prize (Page 25) Herbert Simon: Nobel and Turing Prize in economics and Turing award (Page 16) Search 1. Which of these is not a search algorithm? Search algorithms: Backtrack, Local search, Graph search (BFS, DFS, A*), Adversarial search (min-max, alpha-beta), Evolutionary algorithms. Also: resolution, rule-based reasoning. 2. How can we NOT reduce the complexity of a state space? Reducing complexity of state space: https://www.hackerearth.com/practice/algorithms/dynamic-programming/state-space- reduction/tutorial/ Concept of state-space model ❑State-space: set of states, where one state is a collection of values belonging to the data (objects) that are needed to describe the problem − the state-space can be defined as a subset of a base-set by a so-called invariant statement. ❑Operators: map from the state-space to the state-space − step from a state to another state − defined with its precondition and effect ❑ Initial state(s) or its description (initial condition) ❑Final state(s) or its description (goal condition) Graph-representation of state-space model ❑State-space model (state, effect of an operator on a state, cost of an operator, initial state, final state) State-graph node(node, directed arc, cost of arc, start node, goal node) ❑Graph- representation: state-graph, start node, goal nodes ▪ sequence of operators directed path ▪ solution directed path from start to goal State-space vs. problem space ❑The elements of the problem space can be symbolized with the paths driving from the start node in the state-graph. ❑There is a very close relationship between the state-space and the problem space, but the state-space is not identical to the problem space. In many cases (just in the Hanoi tower problem) the elements of the problem space are not the states (nodes) but the sequences of operators (paths driving from the start node) and some of them are the solutions. Sometimes the solution might be only one state but a sequence of the operators (path) is needed to achieve it. Reduce the problem space ❑A problem may have several models: the best model = the smallest problem space o In the previous representation the size of the problem space is huge. o Expand the state space with the states containing less queens than n, and use a new operator: put a new queen on the board (the initial state is the empty board). The state-graph can be further reduced with limiting the precondition of the new operator (decreasing the branching factor): - Put the queens on the board row by row. - A new queen is never put on the board if it would be under attack. 3.What does the complexity of a representation graph NOT depend on? 4. Which of these is NOT true of a state space graph? Number of paths driving from the start depends on: number of nodes and arcs, branching factor(average number of outgoing arcs), frequency of the cycles and diversity of their length. 5. In which of these problems is the problem space NOT the same as paths of the representation graph starting from the start node? ❑The elements of the problem space can be symbolized with the paths driving from the start node in the state-graph. ❑There is a very close relationship between the state-space and the problem space, but the state-space is not identical to the problem space. In many cases (just in the Hanoi tower problem) the elements of the problem space are not the states (nodes) but the sequences of operators (paths driving from the start node) and some of them are the solutions. Sometimes the solution might be only one state but a sequence of the operators (path) is needed to achieve it. 6.Which of these is NOT true of a delta-graph? Delta-graph: Using an appropriate model, the problem space of a pathfinding problem can be treated as an arc-weighted, directed graph, where the items of the problem space are represented by either nodes or paths. This graph might be infinite but the number of the outgoing arcs of each node is always finite, and there is a positive constant lower bound (δ) on the cost of the edges (δ-graph). ❑In order to solve these problems either a special goal node or a path driving from a start node to any goal node must be found. (Sometimes the optimal path is needed.) 7. Which of these algorithms use a tentative control strategy? Tentative control strategy: backtracking, graph-search, rule-based reasoning. 8. Which of these algorthms use an irrevocable control strategy? Irrevocable control strategy: local search, evolutionary algorithm, resolution. 9. Which of these is a general control strategy? General control strategy: irrevocable and tentative. 10. Can we think of the hill climbing method as a special case of tabu search? Disadvantages: o It can rarely find the goal without a strong heuristics because after a wrong decision it can lose itself or even stick in a dead end several current nodes -> local beam search several attempts -> random-restart search give up the greedy strategy - > simulated annealing o It can lose track around a local optimum or on an equidistant surface of the evaluation function (where neighboring nodes have identical values) if there are cycles in the representation graph (that cannot be recognized). recognize smaller cycles -> tabu search 11. In how many places does simulated annealing use randomness? Instead of selecting the best child of the current node, the new node is picked randomly from among the children of the current node. ❑If the value of this new node is not worse than the value of the current node (f(new) ≤ f(current)), then the new node is accepted as the newer current one. ❑Otherwise (f(new) > f(current)), the probability of the acceptance of the new node is inversely proportional to the difference of the values of the new and the current node 12. Which of these is a drawback of the tabu search? Drawbacks of tabu search: the size of the tabu set can be set only a posteriori, without a strong heuristics it can rarely find the goal, after wrong decisions it can lose itself or even stick in a dead end. 13. Which of these is FALSE for local search algorithms? Local search algorithm The global workspace of a local search contains only one (current) node of the representation graph with its small environment. Initially this current node is the start node. The search stops if the current node is a goal node or the search could not take the next step. ❑In each step the current node is exchanged for its better child by a searching rule. The control strategy uses an evaluation (objective, fitness, heuristic) function to select a better child node. This function tries to estimate to what extent a node promises the achievement of the goal. This function involves some heuristics. 14. Which of these is NOT a drawback of the hill climbing algorithm? 15. Which of these algorithms was NOT invented to avoid hill climbing getting stuck in a dead end? Tabu search 16. What does the global workspace of backtracking search contain? Contains one path from the start node to the current node with all untested outgoing arcs from the nodes of this path initially this path contains only the start node the search terminates: either the current node is the goal, or the outgoing arcs of the start node are completely tested 17. What are the search rules of backtracking search? append a new untested outgoing arc driving from the current node to the end of the current path remove the last arc of the current path (backtrack) 18. What is the control strategy of backtracking search? Control strategy: applying the backtracking in last case 19. Which of these is NOT true about the first version of the backtracking search (BT1)? The first version of the backtracking algorithm (BT1) observes only the first two conditions of the backtracking: “dead end” and “checked crossroads”. In a finite acyclic directed graph, the BT1 always terminates, and if there exists a solution path, then it finds one. It can be implemented with a recursive procedure – Starting: solution: = BT1(start) 20. Which of these statements is NOT true about the second version of the backtracking search (BT2)? The second version of backtracking (BT2) implements all conditions of the backtracking step. In δ- graphs the BT2 always terminates, and if there exists a solution path shorter than the depth bound, then it finds a solution path. It can be implemented with a recursive procedure − Starting: solution := BT2() 21. Which of these statements is NOT true about the second version of the backtracking (BT2)? Same as above 22. Which of these is an advantage of backtracking search? always terminates, and finds solution (inside the depth bound) implementation is simple small memory 23. What does the global workspace of graph search contain? Global workspace: stores the discovered paths (the beginning part of all paths driving from the start node: this is the search graph) and separately records the last nodes of all discovered paths (they are called open nodes) initialvalue:startnode termination condition: a goal node must be expanded or there is no open node 24. What is the search rule of graph search? Searching rules: expand open nodes 25. What is the control strategy of graph search? – control strategy: selects an open node to be expanded based on an evaluation function (132) 26. What kind of nodes are the open nodes? the last nodes of all discovered paths (they are called open nodes) – set of open nodes (OPEN): the nodes that are waiting for their expansions because their successors are not known or not well-known (132) (133) 27. How do we call the subgraph we store in the global workspace of graph search? – search graph (G) : the subgraph of the representation graph that has been discovered the discovered paths (the beginning part of all paths driving from the start node: this is the search graph) (133) 28. What kind of nodes are the closed nodes? Already expanded (discovered) nodes, nodes of search graph(G) 29. What does the parent pointer function (pi) point to? : N → N parent pointer function start – (m) = one parent of m in G, (start) = nil o determines a spanning tree in G and helps to take the solution path out from G after successful termination o If only the (m) always showed an optimal path start→m in G when the node m is generated (135) 30. When is an evaluation function decreasing? An evaluation function f:OPEN→ℝis decreasing if for all nodes n (n∊N) the f(n) never increases but always decreases when a cheaper path has been found to the node n. For example the function g has got this property. (141) 31. When is a node of a search graph correct? The node m is correct if g(m) and (m) are consistent and optimal, i.e. g(m) = c (start, m) and c (start,m) = min {start⟶ m} G c (start, m). G is correct if its nodes are correct. (136) 32. Which of these statements is NOT true about the general graph search algorithm? See 33. (140) 33. Which of these statements is true about the general graph search? Each node is expanded only finite times in a -graph. The general graph-search always terminates in a finite -graph. The general graph-search finds a solution in a finite -graph if there exists a solution. (140) 34. Can we use order heuristic as a secondary control strategy in an uninformed graph search? Yes. The tie-breaking rules (secondary evaluation functions) may contain heuristics even in non- informed graph-search. (144) 35. Which of these is depth first search (f is the evaluation function, g is the cost function, c is the cost of an edge)? f = −g, c(n,m) = 1 no special property in infinite -graphs a depth bound is needed (147) 36.Which of these is breadth first search (f is the evaluation function, g is the cost function, c is the cost of an edge)? f = g, c(n,m) = 1 finds the shortest (not the cheapest) solution if there exists one even in infinite -graph each node is expanded at most once (147) 37. Which of these is uniform cost search (f is the evaluation function, g is the cost function, c is the cost of an edge) ? ( page 146 ) f=g 38.What does admissibility mean for a graph search? ( page 149 ) *Remarks 8 -puzzle : W, P are non-negative, admissible and monotone. Zero function is non-negative, admissible and monotone. If h is monotone and gives zero on goal, then it is admissible. 39.Which statement is NOT true about the constant 0 function? ( page 148-149 ) 0 (zero function) ~ fake heuristic function Zero function is non-negative, admissible and monotone 40.Which of these is the look-forward graph search (f is the evaluation function,g is the cost function, h is the heuristic, h-star is the optimal cost, c is the cost of an edge)? ( page 116-118, page 150 ) f=h 41.Which of these is the A algorithm (f is the evaluation function, g is the cost function, h is the heuristic, h-star is the optimal cost, c is the cost of an edge)? ( page 144, page 150 ) f=g+h , h≧0 42.Which of these is the A-star algorithm (f is the evaluation function, g is the cost function, h is the heuristic, h-star is the optimal cost, c is the cost of an edge) ? ( page 144, page 150 ) f=g+h, h≧0, h≦h* 43.Which of these is the A-c (consistent) algorithm (f is the evaluation function, g is the cost function, h is the heuristic, h-star is the optimal cost, c is the cost of an edge)? ( page 144, page 150 ) f=g+h, h≧0, h≦h*, h(n)-h(m) ≦c(n,m) 44Which of these is a property of the A algorithm? ( page 144, page 150 ) finds solution if there exists one (even in infinite δ -graph) 45Which of these is NOT true about the A-c (consistent) algorithm? ( page 144, page 150 ) Algorithm A-c properties: finds optimal solution if there exists one (even in infinite δ -graph) expands a node at most once 46. When do we say that a heuristic function is monotone? ( page 148, page 149 ) 47.Which of these statements is NOT true about breadth-first search? ( page 144,145,146,147 ) Breadth-first graph-search properties: finds the shortest (not the cheapest) solution if there exists one even in infinite δ -graph each node is expanded at most once 48. Which of these is true about uniform cost search? ( page 144,145,146,147 ) Uniform-cost graph-search properties: finds optimal (the cheapest) solution if there exists one even in infinite δ -graph each node is expanded at most once 49. Which of these was NOT true about the two-player games we have been examining in the course? The statements below ARE TRUE Two players take turns according to given rules until the game is over The game is in a fully observable environment, i.e., the players know completely what both players have done and can do Either the number of the possible steps in a current state or the length of the plays of the game are finite. Each step is unequivocal, its effect is predictable. The plays of the game do not depend on chance at all. The sum of the payoff values of the players at the end of the game is always zero. ( In special case player s Sometimes a draw is also possible.) 50. What does the state of a two-player games represent? state = configuration + player next to move 51. What is the winning strategy in a two-player game? The winning strategy shows how a player could win no matter what the opposite player does. A losing non-losing strategy (that guarantees at least a tie game) would be useful if a draw is possible 52. When do we cut in the alpha-beta algorithm? Cutting rule: if there are alpha and beta values on the currentpath so that alpha is greater than or equal to beta Alpha (lower limit on max value) Beta (higher limit on min value) 53. What is the stationary test for minimax search? The evaluation value of a node may be misleading if it significantly differs from the value of its parent node |f(parent node) - f(node)|>K is the stationary test 54. Which of these statements is NOT true about the game tree? The statements below ARE TRUE Both players have got their own AND/OR tree. The winning (or nonlosing) strategy of one player is a hyper path of his / her AND/OR tree that is driving from the root to winning goal nodes. The search of a winning strategy is a hyper path finding problem in an AND/OR tree. 55. Which of these is a step in the minimax algorithm? - 1.Construct the complete game tree 2. Evaluate scores for leaves using the evaluation function 3. Back-up scores from leaves to root, considering the player type: -For max player, select the child with the maximum score -For min player, select the child with the minimum score 4.At the root node, choose the node with max value and perform the corresponding move 56. What is the game tree? Game tree is a recursive search function that examines all possible moves of a strategy game, and their results, in an attempt to ascertain the optimal move. It is represented as a directed graph whose nodes are positions in a game and whose edges are moves 57. What is the general control strategy of evolutionary algorithms? The evolution is controlled by an irrevocable strategy where every time the population is updated with the mutated offspring of the individuals selected for reproduction the changes made will be irreversible. 58. What does the evolutionary algorithm store in its global workspace? The global workspace stores the discovered paths and separately records the last nodes of all the discovered paths 59. Which of these is NOT an evolutionary operator? Evolutionary operators: 1. Selection 2. Recombination 3. Mutation 4. Replacement 5. Termination condition 60. How do we code an individual? An individual is represented by a code(chromosome) that is most commonly a sequence of signals. One signal or one section of signals with its position in the code (gene) can describe one property of the individual. If one signal is changed, then one property of the individual is changed a bit, too. Frequent coding methods: − Array: fix length sequence of real numbers or integers − Binary code: fix length sequence of bits − Permutation of the elements of a finite set 61. How many steps does the evolutionary cycle consist of? Consists of 4 steps: Selection, Recombination(crossover), mutation and replacement. 62. Where can we incorporate randomness into the evolutionary algorithm? First an initial population is selected mostly at random. Selection -Tournament: the selected individuals are the best individuals of randomly selected groups of the population Culling: all individuals below a given threshold are discarded and then the individuals are selected randomly from the remaining individuals Recombination − Crossover: signals of the parent codes are exchanged at the positions chosen randomly Mutation - Each position of the code is subject to random change with a small independent probability (p). 63. Where do we use selection in the evolutionary algorithm? Selection: better individuals are selected for reproduction 64. What is a good selection algorithm in evolutionary algorithms? The better individuals must be selected but the worse ones must be given a chance to be chosen. (stochastic method) 65. What is the connection between crossover and recombination? Pairs of the selected individuals (parents) are bred in order to create their offspring 66. When does the evolutionary algorithm terminate? either a goal individual appears in the population or the overall fitness value of the population is not being changed 67. Which of these is not a strategy parameter of evolutionary algorithms? Settings of the strategy parameters size of the population, probability of mutation, rate of the offspring, rate of the replacement Answers to Possible Questions for the AI Exam June 1, 2020 Machine Learning 1. What does it mean for learning to be supervised? Supervised learning is the machine learning task where input variables X and an output variable y are given, and an algorithm needs to be used to learn the mapping function from the input to the output, i.e. y = f (X). The goal is to approximate the mapping function, such that when new input data is given, the output variables are predicted. 2. What does it mean for learning to be unsupervised? Unsupervised learning is the machine learning task where only input data X and no corre- sponding output variable y are given. The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about it. 3. What is an epoch? Epoch is a hyperparameter of the machine learning model that defines the number of times that the learning algorithm goes through the entire training dataset. 4. What is a mini-batch? Mini-batch is a portion of samples (data) which is used to train the model in a single iteration of a machine learning algorithm. 5. Why do we use separate training and test sets? Training dataset consists of the data used to fit the model, i.e. it is the actual dataset used to train the model (the model sees and learns from this data). But, in order to determine whether the model performs good, the test dataset is used (data on which model was not trained, previously unseen data). 6. Why do we use a validation set in addition to the training and test sets? Validation set is used to evaluate a given model, but compared to test dataset, it is used for frequent evaluation. It is used to fine-tune and update the model hyperparameters. Hence the model occasionally sees this data, but never does it learn from it. 7. What is a classification problem? Classification is a type of supervised machine learning where the output variable is a category (class, finite and discrete value). Classification model (classifier) learns how to assign a class label in order to predict a class for an input variable. 8. What are the hyperparameters of a learning algorithm? Hyperparameters are parameters of machine learning model whose values are set before the learning process begins (while the values of other parameters are derived in training time). 9. When do we use the sigmoid activation function? Sigmoid (logistic) activation function (φ(x) = 1+e1−x ) has a range of (0, 1) and is hence used for models where it is needed to predict the probability as an output, mostly in binary classification. 10. When do we use the softmax activation function? Softmax activation function is a generalized sigmoid activation function used for multi-class classification. 11. What is the definition of the ReLU activation function? ReLU (rectified linear unit) activation function is defined as R(x) = max(0, x). 12. When do we use the ReLu activation function? ReLU is used in artificial neural networks (deep learning) in order to escape the vanishing gradient problem. 13. When do we use the binary cross-entropy loss function? Binary cross-entropy is used in binary classification. 14. When do we use the categorical cross-entropy loss function? Categorical cross-entropy is used in multi-class classification. 15. What would be the activation function and loss for a binary classification problem? Sigmoid activation function and binary cross-entropy would be used for binary classification problem. 16. What would be the activation function and loss for a multi-class classification problem? Softmax activation function and categorical cross-entropy would be used for multi-class classification problem. 17. Which of these are stop-words? Stop-words are commonly used words which are filtered out before or after processing of natural language data (text). Typical examples include ’the’, ’a’, ’an’, ’in’, ’is’, ’are’, etc. 18. Which of these words were stemmed? Stemming is a process where words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix. Typical examples: occurred → occur, cooking → cook, etc. 19. What does a language model do? Language model is a model which assigns probability to a sequences of words (sentence). Formally: Let V be a finite vocabulary and V ∗ be the set of strings in P the language defined by V. The language model is a probability distribution p such that x∈V ∗ p(x) = 1 and ∀x ∈ V : p(x) ≥ 0. 20. What is the bag of words model? The bag-of-words model is a representation used in NLP, where a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity (counts how many times a word appears). 21. What is the difference between bag of words and TFIDF? Bag-of-words model just creates a set of vectors containing the count of word occurrences in the sentence/document, while the TF-IDF model contains information on the more im- portant words, and the less important words (measures relevance instead of frequency). 22. What kind of hyperplane is the Support Vector Machine (SVM) learning? SVMs choose hyperplane which represents the largest separation, or margin, between the two classes (used in binary classification). Hyperplane is chosen such that the distance from it to the nearest data point on each side is maximized (maximum-margin hyperplane). 23. What do we use the confusion matrix for? Confusion matrix is a summary of prediction results on a classification problem (supervised leaning), where the number of correct and incorrect predictions are summarized with count values and broken down by each class. 24. What is grid search? Why do we use it? Grid-search is used to find the optimal hyperparameters of a model which results in the most accurate predictions. It builds a model on each parameter combination possible (it iterates through every parameter combination and stores a model for each combination). 25. When would you use random search instead of grid search? Random search is used to find the optimal hyperparameters of a model by choosing random combinations of the hyperparameters (in contrast to grid-search, which chooses every com- bination). It is used with when datasets are large and data is high-dimensional (when grid search would be too expensive). 26. What would be the one-hot encoding of [1, 3, 0]? 1 0 1 0 0 one − hot − encoding( 3 ) = 0 0 0 1 0 1 0 0 0 27. What does a word embedding do? Word embedding is set of language modeling and feature learning techniques in NLP where words/phrases from the vocabulary are mapped to vectors of real numbers (additionally, words that have the same meaning have a similar representation). 28. What is an example of clustering? Clustering (unsupervised ML) is the task of dividing the data into a number of groups such that data points in the same groups are more similar to other data points in the same group, and dissimilar to the data points in other groups. Typical examples include use in recommendation engines, market segmentation, anomaly detection, etc. 29. What is the difference between hard and soft clustering? (slide 222) In hard clustering each data point is assigned to exactly one cluster (hard assignment), while in soft clustering each data point is assigned to all clusters, but with different weights/ probabilities (soft assignment). 30. What is NOT true of the k-means problem? (slide 227) k-means is a clustering algorithm which tries to partition the dataset into k predefined clusters (each cluster is represented by its centroid) by minimizing the squared distances of the points from the centroid of their cluster (equivalent to minimizing pairwise distances within clusters). It is an NP problem where only local optimum can be found (different initialization can lead to different clusters). 31. What are the two steps of the k-means algorithm? (slide 228) The two steps of the k-means algorithm are: assigning each data point to the nearest centroid, and computing new centroids. 32. What is NOT an issue with the k-means algorithm? (slides 236-239) Possible issues with k-means algorithm are: too large or too small k, bad initialization, the real clusters are not centroid based. 33. What does Latent Semantic Analysis do? (slides 242, 243) Latent semantic analysis (LSA) is a soft clustering technique used in NLP to analyze re- lationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. It uses SVD to construct a matrix containing word counts per document. 34. What is NOT a reason to use dimensionality reduction? (slide 247) Possible reasons for using dimensionality reduction are: data is low dimensional in a higher dimensional space, data visualization, noise reduction, to decrease the complexity of the learning problem, etc. 35. What is a principal component in Principal Components Analysis (PCA)? (slide 258) Principal components in PCA form the orthogonal basis (axes) to which data is transformed. 36. What is the relationship between Principal Component Analysis (PCA) and Singular Value Decomposition (SVD)? (slide 262) PCA can be computed using SVD of a design matrix X. 37. What does an autoencoder do? (slides 266. 267) Autoencoder is a type of neural network used to learn efficient data codings in an unsuper- vised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal noise. 38. Why doesnt the autoencoder just do an identity transformation? Potential of autoencoders resides in their non-linearity, allowing the model to learn more powerful generalizations compared to PCA. Autoencoder with linear mapping (such as iden- tity transformation) is nearly equivalent to PCA. 39. How many matrices does Latent Semantic Analysis produce from its input matrix? (slide 243) 3, as a result of SVD: words-topics, topics-topics and topics-documents.