DS 340 Final Study Guide PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document is a study guide for a final exam in a computer science course, specifically covering the fundamental concepts of Artificial Intelligence (AI). It includes topics on AI history, key algorithms like machine learning and search, and probabilistic reasoning.
Full Transcript
Lecture 1: Intro to AI ---------------------- **What is AI?** **Machine learning** is the study of algorithms that adapt to data - originally a subfield of AI Most AI can be categorized as either **machine learning** or **search** (or both) - if it worlds in the real world, it probably needs ML...
Lecture 1: Intro to AI ---------------------- **What is AI?** **Machine learning** is the study of algorithms that adapt to data - originally a subfield of AI Most AI can be categorized as either **machine learning** or **search** (or both) - if it worlds in the real world, it probably needs ML Some modern applications of AI: - - - - - - - - Some common ideas run through applications, but there are also specific algorithms for specific problems **AI Prehistory** 60 AD: Heron of Alexandria - primitive "robots" 1843: Ada Lovelace - first computer scientist, suggests automatic music composition - 1941: First Turing-complete computer 1943: McColluch & Pitt's first artificial neurons **Birth of AI** 1948: Turing's "Turochamp" plays chess using minimax search... on paper 1950: Turing argues machines will pass for human 1954: Death of Turing 1956: Dartmouth Conference coins term "Artificial Intelligence" **Logistic Rules** 1957-1973 *Assorted systems based on logical reasoning* 1968: A\* for Shakey 1969: Book *Perceptrons* kills interest in neural networks 1972: SHRDLU **Blooming in AI Winter** 1973-1995 1973: DARPA funding cuts signal, first "AI Winter" 1986: Neural networks revised, needed just one more layer 1988: Probabilistic reasoning becomes popular 1995: Vapnik heralds the invasion of CS theory in ML **Multiple Approaches Flourish** 1996-2005 1997: IBM's Deep Blue beats famous chess player (No machine learning) 1998: LeNet demonstrates \> 1 hidden NN layer useful 2005: Thrun wins DARPA Grand Challenge using probabilistic reasoning **Big Money and Big Data** 2006-2012 2006: Google unveils google translate, uses probabilistic reasoning (in 2016 switched to NN) 2011: IBM Watson wins Jeopardy with "kitchen sink" AI approach 2012: AlexNet crushes ImageNet competition with neural network **Neural Networks Ascendant** 2013-Present 2013: DeepMind shows NN can learn to play several Atari games 2016: Deep NN beats Go champ Sedol, gets even better through self-play 2017-present: "Transformer" NN architecture enables high quality gen of text **Outline of Major AI Topics** If you want an AI to learn to do something, you need a **supervised machine learning algorithm** **Deep neural networks** are the single most powerful way to do that If you want the AI to act continuously over time, like piloting a video game or robot, you want **reinforcement learning** If you want AI to think about a sequence of actions, use **search** Lecture 2: Probabilistic Reasoning ---------------------------------- **What is a probability?** A number between \[0,1\] that represents a **degree of belief** that a proposition is true or a random variable has a particular value Neural networks used for classification produce outputs that can be interpreted as probabilities Probabilities appear in generative AI - the goal is to create a model that represents a distribution that we can sample to create something new - **Axioms of probability** - - - - **Random Variables** - - - **Joint distributions** are the combination of random variable outcomes In ML, use the join distribution of dataset features **Expectations** of random variables tell us what to expect if we were to sample many times and average **Variance** measures how much of a random variable deviates from expectation **Continuous random variables** cannot use a join distribution, so we use a probability density functions **Conditional Probability** - - Machine learning is interested in calculating P(T \| F~1~ ៱ F~2~ ៱... F~n~) **Conditional Independence** If P(B ៱ C \| A) = P(B \| A) P(C \| A), once we know the value of A, we can treat B and C as independent events **Bayes' Rule** We may know rules that tell us how often causes have effects - We want to reason backwards from effects to causes - P(flu \| cough) P(hypothesis \| evidence) ∝ P(evidence \| hypothesis) P(hypothesis) P(A \| B) = P(B \| A) P(A) / P(B) **Naive Bayes** - - - - - - - Downsides to Bayes: - - **Summary:** - - - - - Lecture 3: Local Search and Gradient Descent -------------------------------------------- Review Session: How does hill climbing work? - What does the temperature control in simulated annealing? - Why do very few ML people like genetic algorithms anymore? - Why is gradient descent faster than hill climbing in high dimensional spaces? - What's an advantage and a disadvantage of doing gradient descent in a big batch? - **Search:** Trying to find a good sequence of moves to solve a problem **[Local Search]** A search where we don't need to remember the path - we just need to find a goal or just the best state possible Often, a state consists of a set of parameter setting, and a move consists of changing to a similar state (neighbor) **Utility/Fitness -** the score you obtain when you want to maximize something **Loss -** when you are trying to minimize something **Loss surfaces -** The landscape of the loss function - - **[Local Search Methods]** **Hill Climbing** - - - - - - - - - - **Random Restart** - - - **Simulated Annealing** - - - - - - **Genetic Algorithms** - - - - - - - - - **Local Beam Search** - - - - - - - - - **[Gradient Descent]** - - - - WIthin Neural Networks: - - - **Batches** Types of gradient descent: - - - - - - Advantages/Disadvantages: - - - - - - - - - **Convergence** When should we stop updating? - - - Gradient Descent is optimal when the function is **convex** **Momentum** - - - - **Other Key Optimizers** - - - - - - - - **Summary:** - - - - - Lecture 4: Machine Learning / Decision Trees -------------------------------------------- Review Session: What are precision and recall? - How can I get a really high recall at the expense of precision? - Suppose someone claimed decision tree learning produces the optimal tree given the training data (optimal = best test performance). Is that plausible? - What's the entropy of many flips of a fair coin? - Of many rolls of a four-sided die? - **Supervised machine learning** - - - **Unsupervised machine learning** - - **Reinforcement learning** - **Self-supervised learning** - **Semi-supervised learning** - - **Model Training** - - - - **Precision** - the % of positive classifications that were actually positive **Recall** - the % of positive labels that were actually found There is tradeoff between precision and recall, so we use the F-score to capture the balance 2 (precision \* recall) / (precision + recall) This value is higher when the two are close together **Area Under the Curve** is another way of rewarding good balance **[Decision Trees]** - - **Entropy -∑***p~i~* log~2~ *p~i~* where *p~i~* is the probability of a symbol Ideal entropy = 0 - - **Expected Entropy** - - **Decision Stumps** - - **Decision Tree Process** - - - - - - - **Overfitting** - - To combat overfitting... - - Lecture 5: Ensemble Methods --------------------------- Review Session: Describe how bagging works in a nutshell - What are some differences between random forests and AdaBoost? - - How does gradient boosting roughly work (for regression with squared error loss?) - What's stacking? - **K-Nearest Neighbors** - - - **[Ensemble Methods]** - - - **Bagging** - - - - - - - **Random Forests** - - - - **Feature Importance** - - - **Boosting** - - - - - - **Gradient Boosting** - - - - - - **Stacking** - - - - **Summary:** - - - - - - - Lecture 6: Support Vector Machines ---------------------------------- Review Session: What do support vectors do in support vector machines? - What\'s the name for the kind of optimization problem an SVM solves? - What kind of boundary do you get if you don't employ the kernel trick? - Name a popular kernel function for SVMs - **[Support Vector Machines]** - - - - Visual representation of classification ![](media/image15.png) - - - - - The function to classify examples is ![](media/image20.png) - - - Width of the margin = - What if the points are not linearly separable? **Soft Margins** - - - - - **The Kernel Trick** - - - **Polynomial Kernels** **Radial Basis Function Kernels** - - - - **How to choose your kernel:** - - - **Advantages:** - - - - **Summary:** - - - Lecture 7: Principles in Machine Learning ----------------------------------------- Review Session: Occam's razor says that if two models describe the data equally well, we should prefer the one that is... - High bias implies that a model\'s predictions are often far from what? - High variance implies that a model's predictions are often far from what? - What method have we seen that specifically sets out to reduce variance by averaging the output of multiple models? - What\'s a term for a low-dimensional space that is embedded within a higher dimensional space - thought to describe natural data well? - **Occam's Razor** - - - - Applied to Bayes' Rule: - - - MDL Version - - - **Regularization** - - L~1~ regularization: - - L~2~ regularization: - - - L~1~ regularization tends to **zero parameters more** than L~2~ - - **Bias/Variance** - - - A simple model that **underfits** data may have **high bias** but **low variance** - - A complex model can capture the right model but risks **overfitting** - **low bias** and **high variance** - Examples: **Averaging:** Say in a regression model, we just return the average of the targets. The bias would be very high since our model is not close to the optimal solution, but the variance would be very low since we return the average every time. This model **underfits** **High Variance:** Say there is a model that fits a polynomial to a very high degree. Varying the training data slightly will result in very different predictions, resulting in **high variance**. But, it might not be based if averaging over all models tends to produce the correct predictions. This model **overfits** **The Manifold Hypothesis** - - - - **What makes Neural Networks powerful?** - - - **Summary:** - - - - Lecture 8: Neural Networks Building Blocks ------------------------------------------ Review Session: What's an example of a classification problem that a single perception can't learn to solve? - If a perceptron lacks a bias weight, what can't it do? - If I say "backpropagation is basically gradient descent", what does that even mean? - - How many matrices are needed to describe a single hidden-layer neural network? - **The Perceptron:** the basic neuron, a linear sum compared to a threshold - How to make it learn: - - - - - - - - - - **Logistic Regression** - - ![](media/image14.png) - - **Multilayer Perceptron** - - - - - - - - **Backpropagation** - - - - Lecture 9&10: Deep Neural Networks ---------------------------------- Review Session: What's a ReLU function, and what problem does it try to solve? - What's data augmentation? - Is a 2D convolution a linear transformation? - How is cross-entropy different from entropy? - **Overview:** - - **The Vanishing Gradient** - - ReLU - - - - Residual Connections - - - **Invariance** Data Augmentation - - - - - - - - Convolutional Layers - 2D Convolutions: replaces every pixel with a weighted sum of itself and nearby pixels - - - - - Tensors: - - - Pooling layers - - - **Loss** Squared Error: - - Cross-Entropy: - - - - - Binary Cross-Entropy: - - - Categorical Cross-Entropy - **Regularization** - Weight Decay - - - - Dropout Layers - - - - - - - - **Borrowing** Transfer Learning - - - - - **How to Create a NN** Chollet's suggested steps 1. 2. 3. a. i. b. c. 4. d. 5. 6. **Tricks with Backpropagation** Deep Dreaming - - - Style Transfer - - - **Summary:** - - - - - - Lecture 11: RNNs and LSTMs -------------------------- Review Session: What kind of problem is an RNN or LSTM supposed to solve? - What are some weaknesses of a plain RNN that an LSTM corrects? - What are the three gates in an LSTM, and what do they do? - - - **Applications** - - **[RNNS]** **RNN Structure** - - - - **RNN Weight Sharing** - - ![](media/image13.png) **Backpropagation** **Through Time** - - **Vanishing and Exploding Gradient** - - - - **Training RNNs** - - - ![](media/image5.png) - - **[LSTMs]** **Benefits** - - - - **Applications** - - - - - **Memory Cells** - - - - - **LSTM Gates** - - - - - - - - - **LSTM Diagram** - **Vanishing Gradient** - - - - - **Predicting more than one timestep** - - - - - - - - **Summary:** - - - - - Midterm Review -------------- **Neural Network Architecture:** Input layer: - - Hidden layer: - - - - - - - - - - - - - - - - - - - - - - Output layer: - - - - - **Loss Functions** Regression: - - Binary classification: - Multi-class classification - - SVM - **Optimizers** Gradient descent - Stochastic gradient descent - Momentum - RMSProp - Adam - **Backpropagation** The loss is propagated backward through the network. During the process, the gradients **Parameter Update** Weights and biases are updated by gradients during backpropagation. The learning rate adjusts how much they are updated To figure out weights for a filter:\ Number of channels \* (x\*y) + 1 , where x\*y is the filter dimensions Lecture 12: Deep NLP -------------------- - - - - - **Embeddings** - - - - ![](media/image19.png) **Word2vec (2013)** - - - - **Sequence to Sequence Architecture** - - **Attention:** - - - - - **Transformer Architecture** - - - - - - - - - Encoder: - Decoder: - - BERT - - - - - Training: - - - - - **GPT Transformers** - - **Summary** - - - - Lecture 13-15: Markov Decision Processes and Reinforcement Learning ------------------------------------------------------------------- MDPs provide a framework for sequential decision making and reinforcement learning **Utility** - - - **Time and Temporal Discounting** - - - **[Markov Decision Processes]** A **policy** is a function from state to action (what to do in every scenario) An **MDP** is a problem that we want to find a policy for - - - - - **Bellman Equation** - - - - - - - - - **Value Iteration** - - - **Policy Iteration** - - - - - **POMDPs** - - - - - **[Reinforcement Learning]** **Overview** - - - - - - - **Approach 1: Adaptive Dynamic Programming** - - - - - - - **Approach 2: Temporal Difference Learning** - - - - - - - How do we decide where to move? - - - - ε-Greedy Exploration - - - - Using an exploration function - - - TD-Learning needs a world model to make decisions - - - - **Approach 3: Q-Learning** - - - - **SARSA** - - - - **Deep Q-Learning** - - - - - Catastrophic forgetting: - - - - - - - - **Approach 4: Policy Gradients** - - - **Summary** - - - - - - - - - - Lecture 16: Good Old-Fashioned AI --------------------------------- What is intelligence? - - - - - - - - - **Search and Planning** - - - - - - Core Search Problem: Solving a Maze - - - - - - - Uninformed Search: What's the information in a state? What are the neighbors? What is the upper bound on neighbors (branching factor)? - - - - - - - - - - **Logical Reasoning as Search** - - - - - ![](media/image1.png) Better Logical Reasoning can also be search - - - - **Predicates** - - - - - - - - Logical Proof Types - - - - - - - - - Classical Planning - - - - - - - - Meaning of Language - - - - - - **Downside to GOFAI** - - - - - - - - **A\* Search** - - - - - - - - - - - - - - - - - - - - - - - - - - - Why A\*? - - - - - **Summary** - - - - - - - Lecture 17: Adversarial Search: Minimax and MCTS ------------------------------------------------ **Zero Sum Games** - - **Minimax** MAX and MIN - - - - - - - - - - Evaluation Functions - - - - - - - - - - - Alpha-Beta Pruning - - - - - **Monte Carlo Tree Search** - - - - - - - - - - - - - - - - - **Summary** - - - - - Lecture 18: Bayesian Networks ----------------------------- Motivating Domains: - - - - - - - - **Recall: Naive Bayes is Naive** - - - - **Bayesian Networks** A tool for probabilistic reasoning that goes beyond Naive Bayes - - - - - ![](media/image23.png) **Inference by Enumeration** - - Marginalization - - Conditioning - - Explaining Away - - - Naive Bayes as Bayesian Network Special Case - - - **Sampling Methods** - - - - - - **Markov Blankets** - - - - - ![](media/image22.png) **Markov Chain Monte Carlo (approximate inference)** - - - - - - - - **Continuous Values** - - - Good priors for unknowns: Gaussian and Uniform - - **Conjugate Priors** - - - **Use Cases** - - - - - - - - - **The Catch** - - - **Strengths** - - - - - - Lecture 19: Ethics ------------------ Unclear Questions: - - - - - **Utilitarianism** - - - - **Humanism** - **Categorical Imperatives (Kant)** - **Contractualism** - **Virtue Ethics** - **Divine Command Theory** - **Consequentialism** - - - - - - - - - - - - **Fairness** - - - - - - - - - - - - **Conclusion** - - - - - Lecture 20: Recommender Systems ------------------------------- Common use cases: - - Efficacy - - - **Content-Based Filtering** - - - Items - - - - Users - - Tradeoffs - - - - - **Collaborative Filtering** - - - Embeddings - - Matrix - - - - - - Hidden Features - - - Two approaches to collaborative filtering: Matrix Factorization and Deep Neural Networks **Matrix Factorization** To get hidden features: - - - - To find embeddings: - - How to Solve for U and V - - - - - Weighted Matrix Factorization - - - Pros: - - Cons: - - Overall, Matrix Factorization allows us to take user history and make predictions for missing entries from that. But, there is additional information that might be relevant **Deep Neural Networks** - - - - - - - - - - - - - - - - - - Negative Sampling: - Special constraints: - - - - Softmax Pros and Cons: - - - - - - - - Dealing with Scale: - - Features for Candidate Generation - - - - \*More layers are good, more features are better Ranking: - - - - Hiding Information - - - - - - - **Summary** - - - - - - - - - - - - Lecture 21: Variational Autoencoders and GANs --------------------------------------------- Goal: Realistic faces & other images - - - - **Variational Autoencoders** Have two parts: An encoder and a decoder Encoder network - - Decoder network - - Ex) Digits - - - - Loss function - - - Overview - - - - - - Loss Function - KL Divergence - - - - - Ex) CelebFaces Attributes Dataset - - - - - - - - - - - - - - - - - - - - **GANs** Overview - - - - - - - - Training - - - - - - - Challenges - - - - - BigGAN (DeepMind, 2018) - - - - - **Summary** - - - - - Lecture 22: Diffusion and Generative Art ----------------------------------------