Cognitive Science Quiz PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document is a quiz on cognitive science, focusing on aspects of linguistics, cognitive modules, and other ideas in cognitive science. It includes questions about subjects like large language models, and aspects of language processing; multiple-choice questions test understanding of key concepts.
Full Transcript
In Large Language Models (LLMs), what do numerical vectors usually represent? Question 1 Answer a. Individual words within a language b. Complete documents or books c. Units o...
In Large Language Models (LLMs), what do numerical vectors usually represent? Question 1 Answer a. Individual words within a language b. Complete documents or books c. Units of meaning like sentences or paragraphs d. Specific letters or sounds Feedback The correct answer is: Individual words within a language Question 2 Correct Mark 1.00 out of 1.00 Flag question Question text What is the significance of Broca’s and Wernicke’s aphasia in cognitive science? Question 2 Answer a. They suggest that language is processed in multiple areas of the brain simultaneously. b. They show that language is processed by the same areas as reasoning and memory. c. They provide evidence that language processing can be selectively impaired while other abilities remain intact. d. They prove that language development is entirely dependent on environmental factors. Feedback The correct answer is: They provide evidence that language processing can be selectively impaired while other abilities remain intact. Question 3 Correct Mark 1.00 out of 1.00 Flag question Question text Jerry Fodor’s concept of cognitive modules includes several features. Which of these is not one of them? Question 3 Answer a. Constant interaction with all other modules b. Operating independently from other processes c. Ability to quickly process data d. Specialization in specific types of information Feedback The correct answer is: Constant interaction with all other modules Question 4 Correct Mark 1.00 out of 1.00 Flag question Question text According to Chomsky, what is unique about the human language faculty? Question 4 Answer a. It is closely tied to general reasoning abilities. b. It is a specialized system within the mind that functions separately from general intelligence. c. It emerges purely from social interactions. d. It develops automatically without any biological basis. Feedback The correct answer is: It is a specialized system within the mind that functions separately from general intelligence. Question 5 Correct Mark 1.00 out of 1.00 Flag question Question text Why is the modularity hypothesis useful in cognitive science? Question 5 Answer a. It demonstrates that all cognitive tasks are performed by a general-purpose processor. b. It supports the idea that the mind functions as a single, unified system. c. It explains how specialized modules can interfere with each other. d. It helps break down complex cognitive tasks into specialized components, facilitating scientific study. Feedback The correct answer is: It helps break down complex cognitive tasks into specialized components, facilitating scientific study. Question 6 Correct Mark 1.00 out of 1.00 Flag question Question text How many parameters are used in ChatGPT-4 to process language? Question 6 Answer a. Over several hundred trillion b. Less than a million c. Approximately 86 billion d. Hundreds of billions of parameters Feedback The correct answer is: Hundreds of billions of parameters Question 7 Correct Mark 1.00 out of 1.00 Flag question Question text Cognitive Science is primarily focused on: Question 7 Answer a. Studying the connections between human brains and computers b. Mapping out the physical structures of the human brain c. Exploring the role of neurons in decision-making d. Understanding the software-like processes that represent human thinking Feedback The correct answer is: Understanding the software-like processes that represent human thinking Question 8 Correct Mark 1.00 out of 1.00 Flag question Question text What do Chomsky’s “poverty of the stimulus” examples suggest? Question 8 Answer a. Children need a wealth of linguistic input to learn a language. b. Language development happens rapidly despite limited and imperfect input, suggesting that humans might have an innate capacity for language. c. There is no need for a specific language faculty in human cognition. d. Children can only learn language through direct teaching. Feedback The correct answer is: Language development happens rapidly despite limited and imperfect input, suggesting that humans might have an innate capacity for language. Question 9 Correct Mark 1.00 out of 1.00 Flag question Question text What is a key feature of meaning holism, as championed by W.V.O. Quine? Question 9 Answer a. Terms gain their meaning from the theories they are embedded in. b. Every word in a language has a distinct, isolated meaning. c. All sentences share the same meaning across different contexts. d. Words have fixed meanings independent of context. Feedback The correct answer is: Terms gain their meaning from the theories they are embedded in. Question 10 Correct Mark 1.00 out of 1.00 Flag question Question text The concept of recursive embedding in language allows for: Question 10 Answer a. A fixed limit on how many sentences someone can understand. b. The potential for mental rules of grammar to create an infinite number of sentences. c. More efficient communication using shorter sentences. d. A reduction in sentence structure for easier processing. Feedback The correct answer is: The potential for mental rules of grammar to create an infinite number of sentences. Question 11 Correct Mark 1.00 out of 1.00 Flag question Question text In David Marr’s framework, which level investigates how a cognitive task is physically carried out by the brain or a system? Question 11 Answer a. The functional level b. The algorithmic level c. The computational level d. The implementation level Feedback The correct answer is: The implementation level Question 12 Correct Mark 1.00 out of 1.00 Flag question Question text Why are LLMs considered to have "ungrounded" meanings? Question 12 Answer a. They generate meaning without considering definitions or dependencies between words. b. They learn about words and learn to use words without direct reference to any real-world objects. c. They can build a rich vocabulary of abstract concepts, as well as a vocabulary that is connected to real world objects. d. They rely entirely on interaction with the world through cameras and other perceptual mechanisms. Feedback The correct answer is: They learn about words and learn to use words without direct reference to any real-world objects. What is the primary purpose of the Imitation Game in Turing’s argument? Question 1 Answer a. To challenge machines to outperform humans in logical reasoning tasks and demonstrate superior intelligence. b. To test if machines can accurately simulate human emotions and creativity in a variety of conversational settings. c. To determine if machines can think in the same way humans do by comparing their brain functions. d. To show that if a machine can imitate human responses well enough, we should consider it capable of “thinking” without requiring a formal definition of thinking. Feedback The correct answer is: To show that if a machine can imitate human responses well enough, we should consider it capable of “thinking” without requiring a formal definition of thinking. Question 2 Correct Mark 1.00 out of 1.00 Flag question Question text In the Imitation Game, what is the role of the human participant who is not the interrogator, and what is their primary objective? Question 2 Answer a. The human participant’s role is to compete with the machine in solving complex problems, and their objective is to demonstrate superior reasoning abilities. b. The human must answer the interrogator’s questions honestly and naturally, with the goal of helping the interrogator recognize them as human, not the machine. c. The human’s role is to mislead the interrogator by imitating the machine’s behavior, and their objective is to make the machine seem more human-like. d. The human participant is required to collaborate with the machine in answering questions, and their objective is to confuse the interrogator about who is the human. Feedback The correct answer is: The human must answer the interrogator’s questions honestly and naturally, with the goal of helping the interrogator recognize them as human, not the machine. Question 3 Correct Mark 1.00 out of 1.00 Flag question Question text In Imitation Game, how is the machine’s role specifically defined, and what strategies should it employ to achieve success in the game? Question 3 Answer a. The machine’s role is to replicate the aspects of human reasoning that are factual and logical, avoiding emotional or abstract responses. b. The machine’s objective is to convince the interrogator it is human by providing intelligent and contextually appropriate answers. c. The machine’s role is to appear as human as possible by mimicking human emotions and employing random responses to confuse the interrogator. d. The machine’s objective is to outperform the human in logical reasoning tasks. Its success depends on proving its superior cognitive abilities. Feedback The correct answer is: The machine’s objective is to convince the interrogator it is human by providing intelligent and contextually appropriate answers. Question 4 Correct Mark 1.00 out of 1.00 Flag question Question text The Mathematical Objection to machine intelligence argues that: Question 4 Answer a. Machines cannot prove certain mathematical truths that humans can due to limitations in formal systems. b. Machines cannot understand abstract mathematical concepts. c. Machines can only solve problems they are explicitly programmed to solve. d. Machines are only able to do mathematical calculations. Feedback The correct answer is: Machines cannot prove certain mathematical truths that humans can due to limitations in formal systems. Question 5 Correct Mark 1.00 out of 1.00 Flag question Question text Turing suggests that machines could be programmed to learn from experience. Which of the following best describes his view on this? Question 5 Answer a. Machines might eventually learn in a way that resembles human learning. b. Machines will surpass human intelligence immediately upon learning. c. Machines can only learn from direct human programming. d. Machines will never be able to learn as well as humans can. Feedback The correct answer is: Machines might eventually learn in a way that resembles human learning. Question 6 Correct Mark 1.00 out of 1.00 Flag question Question text Lady Lovelace’s Objection claims that machines: Question 6 Answer a. Will eventually surpass human intelligence in terms of performing calculations. b. Can generate new ideas independently of human input. c. Will become self-aware and eventually threaten humankind. d. Can only perform tasks that they have been explicitly programmed to do. Feedback The correct answer is: Can only perform tasks that they have been explicitly programmed to do. Question 7 Correct Mark 1.00 out of 1.00 Flag question Question text What is the key thought experiment Searle uses in his article “Minds, Brains, and Programs”? Question 7 Answer a. The Symbol Manipulation Test b. The Chinese Room c. The Turing Test d. The Imitation Game Feedback The correct answer is: The Chinese Room Question 8 Correct Mark 1.00 out of 1.00 Flag question Question text In the Chinese Room experiment, Searle argues that: Question 8 Answer a. Computers will soon be able to have conscious experiences. b. A computer can understand Chinese if programmed correctly. c. A person inside the room can eventually learn Chinese by following the instructions. d. A computer can only simulate understanding but does not actually understand. Feedback The correct answer is: A computer can only simulate understanding but does not actually understand. Question 9 Correct Mark 1.00 out of 1.00 Flag question Question text According to Searle, what is the main difference between a human brain and a computer program? Question 9 Answer a. Programs can understand meaning, but the brain is purely mechanical. b. The human brain intentionally links symbols to meaning, while programs manipulate symbols without any type of understanding. c. Programs can adapt, but the human brain is fixed. d. The brain follows strict rules, while programs are more flexible. Feedback The correct answer is: The human brain intentionally links symbols to meaning, while programs manipulate symbols without any type of understanding. Question 10 Correct Mark 1.00 out of 1.00 Flag question Question text Searle introduces the term “strong AI.” What does he mean by this? Question 10 Answer a. Computers are not capable of true thought but can perform tasks that imitate thinking, making it appear as though they understand when they do not. b. Computers can be programmed in such a way that they don’t just simulate thinking—they actually possess minds, consciousness, and understanding. c. Computers will one day advance to the point where they control human intelligence and decision-making processes, leading to the development of minds similar to humans. d. Computers are designed to solve problems efficiently, but this does not mean they have the capacity for actual thought or understanding. Feedback The correct answer is: Computers can be programmed in such a way that they don’t just simulate thinking—they actually possess minds, consciousness, and understanding. Question 11 Correct Mark 1.00 out of 1.00 Flag question Question text What does Searle conclude about the possibility of computers having a mind? Question 11 Answer a. Computers already have minds, but we do not yet fully comprehend the nature of their thinking processes, which could be unlike human thought. b. Computers have minds only when they perform exceptionally well in tasks like the Imitation Game, where they convincingly imitate human thinking patterns. c. Computers, as purely syntactical devices, can never have minds because they lack the biological processes necessary for genuine understanding. d. Computers will eventually develop minds when they reach a higher level of complexity, simulating biological processes that enable understanding. Feedback The correct answer is: Computers, as purely syntactical devices, can never have minds because they lack the biological processes necessary for genuine understanding. What is the primary function of the hidden layer in a neural network? Question 1 Answer a. It performs intermediate calculations and extracts features. b. It receives the input data and processes it. c. It outputs the final predictions of the network. d. It provides feedback to the output layer. Feedback The correct answer is: It performs intermediate calculations and extracts features. Question 2 Correct Mark 1.00 out of 1.00 Flag question Question text How does backpropagation contribute to learning in neural networks? Question 2 Answer a. By adjusting the weights based on the gradient of the loss function (e.g., the amount of error produced by the network). b. By propagating inputs forward through the network. c. By making predictions based on the current weights. d. By generating random weights for the hidden layers. Feedback The correct answer is: By adjusting the weights based on the gradient of the loss function (e.g., the amount of error produced by the network). Question 3 Correct Mark 1.00 out of 1.00 Flag question Question text Which of the following best describes the role of an activation function in a neural network? Question 3 Answer a. It multiplies the input and weights. b. It calculates the loss function. c. It optimizes the weights using gradient descent. d. It introduces non-linearity to the model. Feedback The correct answer is: It introduces non-linearity to the model. Question 4 Correct Mark 1.00 out of 1.00 Flag question Question text What is the primary difference between biological and artificial neurons in terms of signal transmission? Question 4 Answer a. Biological neurons transmit signals slower than artificial neurons. b. Biological neurons use only chemical signals, while artificial neurons use electrical signals. c. Biological neurons transmit signals via synapses, while artificial neurons process signals as numerical values. d. Artificial neurons have a higher number of connections than biological neurons. Feedback The correct answer is: Biological neurons transmit signals via synapses, while artificial neurons process signals as numerical values. Question 5 Correct Mark 1.00 out of 1.00 Flag question Question text What is the function of the softmax activation function? Question 5 Answer a. To normalize logits into probabilities. b. To compute the weighted sum of inputs. c. To apply a linear transformation to the inputs. d. To filter out negative values in the input. Feedback The correct answer is: To normalize logits into probabilities. Question 6 Correct Mark 1.00 out of 1.00 Flag question Question text What does the gradient in gradient descent represent in terms of training a neural network? Question 6 Answer a. The time it takes to train the network. b. The distance between the input and output layers. c. The steepness of the error surface. d. The number of neurons in the hidden layer. Feedback The correct answer is: The steepness of the error surface. Question 7 Correct Mark 1.00 out of 1.00 Flag question Question text Which of the following best describes why weights are updated in a neural network during training? Question 7 Answer a. To ensure the model learns from every example in the dataset equally. b. To minimize the error in the model’s predictions. c. To prevent the model from overfitting. d. To increase the complexity of the network’s architecture. Feedback The correct answer is: To minimize the error in the model’s predictions. Question 8 Correct Mark 1.00 out of 1.00 Flag question Question text Why is the rectified linear unit (ReLU) activation function widely used in neural networks? Question 8 Answer a. It introduces non-linearity and is computationally efficient. b. It normalizes the output to be between 0 and 1. c. It prevents overfitting. d. It reduces the dimensionality of the data. Feedback The correct answer is: It introduces non-linearity and is computationally efficient. Question 9 Correct Mark 1.00 out of 1.00 Flag question Question text In the context of deep learning as discussed in Chollet, what does the idea of learning representations refer to? Question 9 Answer a. The process of hand-picking features from raw data. b. The process of compressing input data to reduce training time. c. The transformation of raw data into lower dimensions. d. The ability of the model to automatically discover useful ways to encode the data or useful features of the data. Feedback The correct answer is: The ability of the model to automatically discover useful ways to encode the data or useful features of the data. Question 10 Correct Mark 1.00 out of 1.00 Flag question Question text In the context of deep learning, as discussed in Chapter 1 of Chollet, what is the role of layers in a neural network? Question 10 Answer a. Layers help divide the input data into multiple parts for separate processing. b. Layers act as stages in a network where progressively more complex representations of the input data are learned. c. Layers are used to store the final output of a neural network. d. Layers are used only to reduce the dimensionality of the input data to speed up computation. Feedback The correct answer is: Layers act as stages in a network where progressively more complex representations of the input data are learned. Question 11 Correct Mark 1.00 out of 1.00 Flag question Question text What is the primary advantage of deep learning over traditional machine learning techniques, according to Chollet? Question 11 Answer a. Deep learning models do not require large datasets for training. b. Deep learning models can learn complex hierarchical representations from raw data. c. Deep learning models require less computational power. d. Deep learning models are less prone to overfitting than traditional models. Feedback The correct answer is: Deep learning models can learn complex hierarchical representations from raw data. Question 12 Correct Mark 1.00 out of 1.00 Flag question Question text According to Chollet, What makes deep learning “deep”? Question 12 Answer a. The ability to learn from small datasets with minimal labeled data. b. The use of multiple hidden layers to automatically learn complex features from data. c. The use of highly specialized hardware, such as GPUs. d. The use of advanced optimization algorithms. Feedback The correct answer is: The use of multiple hidden layers to automatically learn complex features from data. Question 13 Correct Mark 1.00 out of 1.00 Flag question Question text According to Chollet, why is there skepticism about AI achieving human-level general intelligence in the near future? Question 13 Answer a. The short-term hype often exceeds current technological capabilities. b. There are too few researchers working on AI. c. There is no funding available for AI research. d. AI is fundamentally different from human intelligence and can’t be improved further. Feedback The correct answer is: The short-term hype often exceeds current technological capabilities. Question 14 Correct Mark 1.00 out of 1.00 Flag question Question text Which of the following is NOT one of the major breakthroughs attributed to deep learning in Chapter 1? Question 14 Answer a. Improved decision-making in board games like Go b. Quantum computing simulations c. Autonomous driving d. Near-human-level speech transcription Feedback The correct answer is: Quantum computing simulations Question 15 Correct Mark 1.00 out of 1.00 Flag question Question text Chollet mentions a significant risk in the development of AI technologies. What is this risk? Question 15 Answer a. That too many AI systems will rely on outdated hardware. b. AI might become too powerful and uncontrollable. c. Lack of computational resources for scaling AI systems. d. That high expectations for short-term success will lead to a funding winter when they aren’t met. Feedback The correct answer is: That high expectations for short-term success will lead to a funding winter when they aren’t met. Question 16 Correct Mark 1.00 out of 1.00 Flag question Question text What is one of the key reasons behind deep learning’s prominence since the 2010s, as explained by Chollet? Question 16 Answer a. The availability of large datasets, such as ImageNet, and more powerful hardware like GPUs. b. The invention of new types of neural networks. c. The sudden increase in funding from governments. d. The discovery of entirely new algorithms for processing data. Feedback The correct answer is: The availability of large datasets, such as ImageNet, and more powerful hardware like GPUs. In a simple language model, what does the first element in the input sequence typically represent? Question 1 Answer a. A random symbol b. The first word c. Punctuation d. A start symbol Feedback The correct answer is: A start symbol Question 2 Correct Mark 1.00 out of 1.00 Flag question Question text What is the primary improvement when adding context to a language model? Question 2 Answer a. It generates random outputs. b. It predicts outputs based on a single previous word. c. It reduces the number of words in the sequence. d. It considers multiple previous outputs to inform predictions. Feedback The correct answer is: It considers multiple previous outputs to inform predictions. Question 3 Correct Mark 1.00 out of 1.00 Flag question Question text In a language model, what is the purpose of the logits generated by the model? Question 3 Answer a. They generate the next sentence. b. They are converted into probabilities using the softmax function. c. They calculate error values. d. They determine the sequence length. Feedback The correct answer is: They are converted into probabilities using the softmax function. Question 4 Correct Mark 1.00 out of 1.00 Flag question Question text Which method selects the word with the highest probability as the next word in a language model? Question 4 Answer a. Top-k Sampling b. Argmax (Greedy Decoding) c. Top-p Sampling d. Beam Search Feedback The correct answer is: Argmax (Greedy Decoding) Question 5 Correct Mark 1.00 out of 1.00 Flag question Question text What is the key feature of top-k sampling in language models? Question 5 Answer a. It restricts the possible next words to the top k most probable words and selects one randomly. b. It chooses the most frequent word. c. It generates words based on their exact probabilities. d. It limits predictions to the top 3 most probable words. Feedback The correct answer is: It restricts the possible next words to the top k most probable words and selects one randomly. Question 6 Correct Mark 1.00 out of 1.00 Flag question Question text How does top-p (nucleus) sampling select the next word in a sequence? Question 6 Answer a. It selects only the most probable word. b. It chooses the smallest set of words whose cumulative probability exceeds a threshold. c. It generates words based on the highest frequency. d. It eliminates all words below a fixed probability. Feedback The correct answer is: It chooses the smallest set of words whose cumulative probability exceeds a threshold. Question 7 Correct Mark 1.00 out of 1.00 Flag question Question text What effect does adjusting the temperature parameter in a language model have? Question 7 Answer a. It increases the model's memory. b. It decreases the number of possible outputs. c. It controls how conservative or adventurous the output is by adjusting the logits. d. It limits the length of the sentence. Feedback The correct answer is: It controls how conservative or adventurous the output is by adjusting the logits. Question 8 Correct Mark 1.00 out of 1.00 Flag question Question text Why is the logarithmic function used in error calculations for language models? Question 8 Answer a. It reduces the number of predictions. b. It simplifies the error calculation. c. It adds extra penalties as predictions deviate further from the target. d. It decreases the influence of wrong predictions. Feedback The correct answer is: It adds extra penalties as predictions deviate further from the target. Question 9 Correct Mark 1.00 out of 1.00 Flag question Question text What is a key limitation of representing words as integers in language models? Question 9 Answer a. It prevents words from being stored in memory. b. It does not capture semantic relationships between words. c. It increases computational complexity. d. It leads to repetitive outputs. Feedback The correct answer is: It does not capture semantic relationships between words. Question 10 Correct Mark 1.00 out of 1.00 Flag question Question text Why are dense vectors preferable to one-hot encodings in language models? Question 10 Answer a. Dense vectors increase the output length. b. Dense vectors eliminate randomness in the model. c. Dense vectors reduce the number of layers in the model. d. Dense vectors represent information in a way that allows them to capture relationships between words. Feedback The correct answer is: Dense vectors represent information in a way that allows them to capture relationships between words. Question 11 Correct Mark 1.00 out of 1.00 Flag question Question text According to Kevin Henner, what is a text embedding in the context of language models? Question 11 Answer a. A sentence converted into a number. b. A paragraph compressed into a smaller string. c. A piece of text projected into a high-dimensional space as a vector. d. A word translated into a different language. Feedback The correct answer is: A piece of text projected into a high-dimensional space as a vector. Question 12 Correct Mark 1.00 out of 1.00 Flag question Question text In Word2Vec, what famous example is often used to illustrate vector arithmetic? Question 12 Answer a. man + dog = friendship b. cat - mouse + cheese = meal c. king - man + woman = queen d. dog + bone = happiness Feedback The correct answer is: king - man + woman = queen Question 13 Correct Mark 1.00 out of 1.00 Flag question Question text According to Kevin Henner, what is the role of neural networks in generating text embeddings? Question 13 Answer a. They eliminate noise from the text. b. They calculate sentence lengths to improve coherence. c. They adjust the values of vectors during training to create useful embeddings. d. They store word frequencies for faster retrieval. Feedback The correct answer is: They adjust the values of vectors during training to create useful embeddings. Question 14 Correct Mark 1.00 out of 1.00 Flag question Question text According to Kevin Henner, what key feature does an embedding space preserve in text embeddings? Question 14 Answer a. It preserves distance, with similar items placed closer together than less similar ones. b. It groups similar words by their length. c. It categorizes words based on frequency. d. It arranges words in alphabetical order. Feedback The correct answer is: It preserves distance, with similar items placed closer together than less similar ones. Question 15 Correct Mark 1.00 out of 1.00 Flag question Question text According to Kevin Henner, what is the key advantage of the Transformer architecture in text embeddings? Question 15 Answer a. It uses an attention mechanism to weigh the influence of each token in the sequence. b. It ignores the position of words in a sentence. c. It uses random vectors for predictions. d. It eliminates the need for training data. Feedback The correct answer is: It uses an attention mechanism to weigh the influence of each token in the sequence. Question 16 Correct Mark 1.00 out of 1.00 Flag question Question text According to Kevin Henner, how are embeddings used in multimodal models like GPT-Vision? Question 16 Answer a. They map different types of data (text, audiovisual, robotics) into a shared embedding space. b. They filter out irrelevant text from the input. c. They separate data based on its format. d. They store large datasets for faster retrieval. Feedback The correct answer is: They map different types of data (text, audiovisual, robotics) into a shared embedding space. According to Alammar's blog, what is a key innovation of the Transformer model compared to previous architectures like RNNs? Question 1 Answer a. It processes input data in parallel. b. It uses convolutional layers. c. It relies entirely on positional encoding. d. It focuses on recurrent connections. Feedback The correct answer is: It processes input data in parallel. Question 2 Correct Mark 1.00 out of 1.00 Flag question Question text According to Alammar's blog, what role does the self-attention mechanism play in the Transformer model? Question 2 Answer a. It helps the model focus on different parts of the input. b. It restricts the model's attention to adjacent words. c. It replaces the need for positional encoding. d. It enables the model to ignore irrelevant tokens. Feedback The correct answer is: It helps the model focus on different parts of the input. Question 3 Correct Mark 1.00 out of 1.00 Flag question Question text According to Alammar's blog, what is the purpose of the encoder in the Transformer model? Question 3 Answer a. To compress input sequences into a fixed-length vector. b. To generate probabilities for word prediction. c. To encode input sequences into higher-dimensional representations. d. To calculate token frequency. Feedback The correct answer is: To encode input sequences into higher-dimensional representations. Question 4 Correct Mark 1.00 out of 1.00 Flag question Question text According to Alammar's blog, why is positional encoding necessary in the Transformer model? Question 4 Answer a. It allows the model to process input sequences backwards. b. It speeds up the training process. c. It reduces overfitting in the model. d. The model does not have access to information about the order of tokens otherwise. Feedback The correct answer is: The model does not have access to information about the order of tokens otherwise. Question 5 Correct Mark 1.00 out of 1.00 Flag question Question text According to Alammar's blog, what is the function of the multi-head attention mechanism? Question 5 Answer a. It eliminates the need for a feed-forward network. b. It performs a single attention calculation across the entire sequence. c. It focuses attention only on the first token. d. It allows the model to attend to different parts of the input sequence simultaneously. Feedback The correct answer is: It allows the model to attend to different parts of the input sequence simultaneously. Question 6 Correct Mark 1.00 out of 1.00 Flag question Question text According to Alammar's blog, what is the primary difference between the encoder and decoder in the Transformer architecture? Question 6 Answer a. The encoder processes tokens one by one, while the decoder processes them in parallel. b. The decoder includes a masked self-attention mechanism to prevent future token information from leaking. c. The encoder does not use positional encoding. d. The decoder lacks self-attention layers. Feedback The correct answer is: The decoder includes a masked self-attention mechanism to prevent future token information from leaking. Question 7 Correct Mark 1.00 out of 1.00 Flag question Question text According to Alammar's blog, how does self-attention compute relevance between tokens in a sequence? Question 7 Answer a. By multiplying the token embeddings with learned positional encodings. b. By computing dot products between query, key, and value vectors derived from token embeddings. c. By averaging the token embeddings across the entire sequence. d. By calculating a similarity score between each token and a random vector. Feedback The correct answer is: By computing dot products between query, key, and value vectors derived from token embeddings. Question 8 Correct Mark 1.00 out of 1.00 Flag question Question text According to Alammar's blog, what is the role of the feed-forward network in the Transformer model? Question 8 Answer a. It computes positional encodings. b. It reduces the dimensionality of token embeddings. c. It is applied to each token separately after the self-attention layer. d. It calculates the final output probabilities. Feedback The correct answer is: It is applied to each token separately after the self-attention layer. Question 9 Correct Mark 1.00 out of 1.00 Flag question Question text According to Alammar's blog, why does the Transformer model use layer normalization? Question 9 Answer a. To eliminate the need for attention mechanisms. b. To stabilize and accelerate training. c. To add noise to the input sequences. d. To speed up the self-attention process. Feedback The correct answer is: To stabilize and accelerate training. Question 10 Correct Mark 1.00 out of 1.00 Flag question Question text According to Alammar's blog, how does the model's output differ from traditional sequence-to-sequence models? Question 10 Answer a. It generates text by predicting entire sentences at once. b. It depends on convolutional layers to generate outputs. c. It processes inputs sequentially rather than in parallel. d. It relies on attention mechanisms instead of recurrence. Feedback The correct answer is: It relies on attention mechanisms instead of recurrence. Question 11 Correct Mark 1.00 out of 1.00 Flag question Question text According to Alammar's blog, why is a masked self-attention mechanism used in the decoder? Question 11 Answer a. To force the model to focus only on the first word of the sequence. b. To prevent the model from accessing information about future tokens. c. To prevent the model from attending to irrelevant parts of the input. d. To limit the model’s ability to generate multiple outputs simultaneously. Feedback The correct answer is: To prevent the model from accessing information about future tokens. Question 12 Correct Mark 1.00 out of 1.00 Flag question Question text According to Alammar's blog, what key advantage does parallelization provide in the Transformer model? Question 12 Answer a. It enables the model to skip irrelevant tokens. b. It decreases the need for training data. c. It allows for faster processing of large sequences compared to recurrent networks. d. It reduces the number of layers needed in the network. Feedback The correct answer is: It allows for faster processing of large sequences compared to recurrent networks. Question 13 Correct Mark 1.00 out of 1.00 Flag question Question text According to the class slides, what is the primary role of the transformer decoder? Question 13 Answer a. To calculate token frequency in text data. b. To create new word embeddings from scratch. c. To decode input sequences into their most probable outputs. d. To encode input sequences into dense vectors. Feedback The correct answer is: To decode input sequences into their most probable outputs. Question 14 Correct Mark 1.00 out of 1.00 Flag question Question text According to the class slides, what key feature do dense vectors provide in the embedding layer? Question 14 Answer a. They represent words with unique integers. b. They encode the output probability. c. They capture the relationships between words. d. They calculate word frequency in the sequence. Feedback The correct answer is: They capture the relationships between words. Question 15 Correct Mark 1.00 out of 1.00 Flag question Question text According to the class slides, what information does positional encoding add to word embeddings? Question 15 Answer a. The frequency of the word in a text. b. The word's grammatical category. c. The contextual meaning of the word. d. The position of the word in the sequence. Feedback The correct answer is: The position of the word in the sequence. Question 16 Correct Mark 1.00 out of 1.00 Flag question Question text According to the class slides, why can't transformer decoders automatically understand word order? Question 16 Answer a. They ignore positional information by design. b. They use convolutional layers for order representation. c. They lack information about word positions. d. They process all tokens sequentially. Feedback The correct answer is: They lack information about word positions. Question 17 Correct Mark 1.00 out of 1.00 Flag question Question text According to the class slides, how does frequency relate to positional encoding? Question 17 Answer a. It determines how often attention heads update. b. It reduces the computational complexity of decoding. c. It ensures only high-frequency words are encoded. d. It helps distinguish local and long-range dependencies. Feedback The correct answer is: It helps distinguish local and long-range dependencies. Question 18 Correct Mark 1.00 out of 1.00 Flag question Question text According to the class slides, what is the effect of combining high and low-frequency components in positional encoding? Question 18 Answer a. It ensures that words have fixed, non-relative positions. b. It focuses the model’s attention on the most relevant words. c. It allows the model to understand grammatical rules. d. It captures both local and global relationships in a sequence. Feedback The correct answer is: It captures both local and global relationships in a sequence. Question 19 Correct Mark 1.00 out of 1.00 Flag question Question text According to the class slides, how do learned positional embeddings differ from sinusoidal encodings? Question 19 Answer a. They increase the size of the input sequence. b. They generalize better to unseen data. c. They allow the model to recognize syntactic structures. d. They are learned during training rather than being fixed. Feedback The correct answer is: They are learned during training rather than being fixed. Question 20 Correct Mark 1.00 out of 1.00 Flag question Question text According to the class slides, what does a transformer decoder primarily rely on to generate the next word in a sequence? Question 20 Answer a. The multi-head attention mechanism b. The positional encoding function c. The output of convolutional layers d. The softmax output probability Feedback The correct answer is: The multi-head attention mechanism Having experience of being in the human world, whereas AI does not have any physical experience outside in the world (a description of instructions) We also learn language this way (first words you say are things that you see): pointing at the meaning of a word My visual processes are connected to some kind of reality outside of my brain and that causes me to have the perception of a certain thing, and that perception of something is connected to my word "hat" and I can refer to it and talk about it in that way Therefore, ungrounded: any brain in the vat that doesn't have some kind of connection to some kind of perception system that interacts in some way with the world outside of the vat, it's not grounded We ground our language into the real world and it doesn't mean that is essential to language, that is just what we do – LLMS don't and it's kind of informative to us to see something that's not grounded and still kind of understand Meanings aren't actually grounded, we get the illusion that they are but the real thing is that it actually has everything to do with associations with another words, concepts and nothing to do with our perceptions Not every symbol has its meaning independent from other symbols, really you have to consider the symbols and how they interact with other symbols before you get a meaning from this. Meaning Holism Concept: The meaning of a term is influenced by its relationships with other terms within a theory. For example, the meaning of "light" can change based on its interactions with gravity and space. Meaning: Meaning is just about words depending on one another LLMS (their theory): they define every word in terms of every other word and its relationship with every other word, but not only that but the relationship to every other word and its position in sentences that came before it (does this interdependency thing – correlation between words without any grounding) What they are? LLMs are advanced AI systems that generate human-like text based on patterns learned from vast amounts of data. How They Work: ○ Numerical Vectors: Words are represented as sequences of numbers (vectors). For example, the word "cat" might be represented as [0.1, 0.5, 0.3]. ○ Neural Networks: These models use complex neural networks to generate sentences based on statistical distributions from the training data. Every day interactions/conversations is just talking and relating concepts to one another -------------------------------------------------------------------------------------------------------------------------- ---- What is cognitive science?: Cognitive Science is the study of the mental algorithms and representations involved in human cognition. It seeks to discover the “software” that runs on the brain (biological hardware). Also examines how this software is acquired and evolved. Marr's Levels of Analysis: David Marr proposed a framework for understanding complex information processing systems. Particularly focused on vision, but applicable to other cognitive processes Each level offers unique insights into how the brain processes information Computational Level: What is the problem being solved? (e.g., understanding language) Algorithmic Level: How is the problem being solved? (e.g., using neural networks) Representation affects the choice of algorithm. Implementational Level: How is the solution physically realized? (e.g., in a computer or brain) The physical limits of the “machine” (e.g., brain or computer) affect the efficiency and feasibility of the algorithm. Summary: Marr’s Levels in Addition Computational Level: What is addition? Computing sums. Two numerical representations as inputs. One numerical representation as output. No additional information needed. Algorithmic Level: What procedures are done to implement the computation of a sum. How addition is performed depends on representation (decimal, binary, Roman). Implementational Level: How is addition physically realized in a human brain or a computer? Knowing an algorithm can help us understand the physical system. Knowing the physical system can put constraints on the algorithm. Modularity: Fodor's Theory of Modularity Overview: Jerry Fodor proposed that the mind is composed of domain-specific modules. Each module is responsible for processing specific types of information, such as language or vision. Key Characteristics of Fodor’s Modules: Domain-Specific: Each module processes a particular type of information independently (e.g., language, visual perception). Informationally Encapsulated: Modules operate without influence from other cognitive processes. For example, visual perception can occur without conscious thought. Fast Processing: Modules are designed for quick, efficient processing of information. Mandatory Operation: Modules function automatically and cannot be "turned off." Fixed Neural Architecture: Each module is associated with specific brain regions dedicated to its function. f Empirical Motivations: Evidence for Fodor's theory includes: Visual Illusions: These demonstrate that perceptual modules can operate independently of reasoning. Brain Lesion Studies: Different cognitive abilities can be damaged independently, supporting the idea of modularity. Speed of cognitive tasks: (e.g., speech perception) suggests specialized, fast-acting systems. Chomsky's Theory of Language Faculty Overview: Noam Chomsky's theory posits that language is a modular system, separate from general cognition. He introduced the concept of a "Language Faculty" that is responsible for processing linguistic information. Language is a modular system, separate from general cognition. Because if your brain doesn't allow you to do something, then you'll never produce it. Right? So you can't do something that violates your grammar because your grammar determines what you're going to produce. The Language Faculty is domain-specific and responsible for processing linguistic information. Key Aspects of Chomsky’s Theory: Autonomy: Language acquisition and processing are independent of other cognitive systems. This means that the ability to learn language does not rely on general cognitive abilities. Species-Specific: Chomsky argues that only humans possess this specialized module for language, which is not found in other species. Empirical Motivations: Poverty of the Stimulus: Children acquire complex grammar despite limited linguistic input, suggesting an innate capacity for language. Specific Language Impairment (SLI): Individuals with normal intelligence but impaired language abilities support the idea of a distinct language module. Critical Period Hypothesis: There is a limited time frame during which language acquisition occurs most easily, indicating a specialized mechanism for language learning. Summary Fodor's Theory emphasizes the modularity of the mind, suggesting that different cognitive functions are handled by specialized, independent systems. Chomsky's Theory focuses specifically on language, proposing that it is a unique cognitive faculty that operates separately from other cognitive processes. Why is Linguistics a cognitive science?: 1. Focus on Mental Representations: Linguistics studies the mental representations and algorithms that individuals use to produce and understand speech. This aligns with cognitive science's interest in how the mind processes information. 2. Subfields of Linguistics: ○ Phonology: Examines the mental representation of sounds in language. ○ Syntax: Investigates the structure of sentences and how words combine to form phrases and sentences. ○ Semantics: Studies the meaning of words, phrases, and sentences, focusing on how meaning is represented in the mind. Alan Turing: Computing Machinery and (Artificial) Intelligence So one of the reasons Turing is so famous isn't just because he actually built a computer, but he also designed this theoretical computer. And the theoretical computer is called a Turing machine. It is a really dumb computer in many ways, very simple computer in many ways. But the interesting thing about it is he proved mathematically that any program that was programmable could be run on this machine, and that the only difference between this machine and any machine that's built in the future, right? Any machine is time and memory, and that's it. Things might get faster, things might get more memory, but you'll never be able to implement a different like anything that was implementable on any machine is implementable on a Turing machine, and that was a really huge discovery The important point is, such a machine can carry out any computation, as long as it's represented as a program, a series of really specific instructions. That's all that's required. Such a machine is not tied to any specific physical instantiation of the Turing before. STORAGE: You cannot have infinite memory in any type of machine. It could be limited by air in this space, the thing you implement the machine on, but I don't want to talk about that for him, he was like, let's just make it you have as much memory as you want. Apuntes en ATLAS Turing's machine is a discrete state machine, meaning it occupies one state and then moves to another, with no intermediate states. The Ada Lovelace objection is introduced, suggesting that machines can only do what they are explicitly programmed to do, questioning their true creativity and independent decision-making. The conversation also references historical figures like Ada Lovelace, who is credited with being the first computer programmer. The discussion touches on the idea that while machines do not think exactly like humans, they may still exhibit behaviors that resemble thought processes. What evidence supports the modularity of language in the brain? Evidence Supporting the Modularity of Language in the Brain ○ Brain Lesion Studies: Research has shown that specific areas of the brain, such as Broca's and Wernicke's areas, are associated with distinct language functions. Damage to these areas can lead to specific language impairments while leaving other cognitive abilities relatively intact. ○ Specific Language Impairment (SLI): Individuals with SLI exhibit normal intelligence but have significant difficulties with language acquisition, suggesting that language processing is a distinct cognitive module. ○ Poverty of the Stimulus: Children acquire complex grammatical structures with minimal linguistic input, indicating that language learning is not solely dependent on environmental exposure but involves innate cognitive mechanisms. How does language acquisition differ from other cognitive skills? Differences in Language Acquisition from Other Cognitive Skills ○ Rapid Acquisition: Children typically acquire language skills quickly and reach adult-like competence by around age four, often without explicit instruction. ○ Critical Periods: There are sensitive periods in early childhood during which language acquisition occurs most easily, suggesting a biological basis for language learning that differs from other cognitive skills. ○ Unconscious Learning: Children learn language patterns unconsciously, often without being aware of the grammatical rules they are following, unlike many other cognitive skills that may require conscious effort and instruction. What role did Noam Chomsky play in the development of cognitive science? Noam Chomsky's Role in the Development of Cognitive Science ○ Theory of Universal Grammar: Chomsky proposed that all human languages share a common underlying structure, which he termed "universal grammar." This idea revolutionized the understanding of language and its cognitive underpinnings. ○ Modularity of Mind: He argued that language is a distinct cognitive module, separate from other cognitive processes, which has influenced the study of cognitive science by emphasizing the specialized nature of language processing. ○ Scientific Methodology: Chomsky applied rigorous scientific methodologies to the study of language, focusing on the mental capacities that enable communication, thus laying the groundwork for linguistics as a subfield of cognitive science. Power learning comes from the "Weights" Feed forward propagation (has to happen first so the back can fix the air) ○ The most basic form of this type of propagation Backpropagation: fix the output Neural networks: always lowers the air, the way it learns is by lowering the air Softmax: Calculates probability, and add up to 1 Usually used: output level Inserting probabilities into things Apply to all nodes Takes the raw values in consideration Highest raw value = you have to have the highest probability (happens equally with the lower) Backpropagation: its not heuristic (life hack, you don't tinker, universal global The higher you are in the hill, the higher the air, the lower, the lower the air (outputs want to get to the compute the least amount of air possible) Every point between each of the layers, there's a hill -------------------------------------------------------------------------------------------------------------------------------------- ---- Chollet 2021 Chapter 1 Deep Learning with Python: AI can be described as the effort to automate intellectual tasks normally performed by humans. As such, AI is a general field that encompasses machine learning and deep learning, but that also includes many more approaches that may not involve any learning. ○ Artificial Intelligence (AI): A broad field aimed at automating intellectual tasks typically performed by humans. We just stated that machine learning discovers rules for executing a data processing task, given examples of what’s expected. So, to do machine learning, we need three things: ○ Input data points—For instance, if the task is speech recognition, these data points could be sound files of people speaking. If the task is image tagging, they could be pictures. ○ Examples of the expected output—In a speech-recognition task, these could be human-generated transcripts of sound files. In an image task, expected outputs could be tags such as “dog,” “cat,” and so on. ○ A way to measure whether the algorithm is doing a good job—This is necessary in order to determine the distance between the algorithm’s current output and its expected output. The measurement is used as a feedback signal to adjust the way the algorithm works. This adjustment step is what we call learning. A machine learning model transforms its input data into meaningful outputs, a process that is “learned” from exposure to known examples of inputs and outputs. Therefore, the central problem in machine learning and deep learning is to meaningfully transform data: in other words, to learn useful representations of the input data at hand—representations that get us closer to the expected output. What’s a representation? At its core, it’s a different way to look at data—to represent or encode data. ○ Machine learning models are all about finding appropriate representations for their input data—transformations of the data that make it more amenable to the task at hand. ○ We would then be doing machine learning. Learning, in the context of machine learning, describes an automatic search process for data transformations that produce useful representations of some data, guided by some feedback signal—representations that are amenable to simpler rules solving the task at hand. ○ Machine learning algorithms aren’t usually creative in finding these transformations; they’re merely searching through a predefined set of operations, called a hypothesis space So that’s what machine learning is, concisely: searching for useful representations and rules over some input data, within a predefined space of possibilities, using guidance from a feedback signal. This simple idea allows for solving a remarkably broad range of intellectual tasks, from speech recognition to autonomous driving. ○ Machine Learning: A subset of AI that uses algorithms to learn from data rather than being explicitly programmed. The "deep" in deep learning: Deep learning is a specific subfield of machine learning: a new take on learning representations from data that puts an emphasis on learning successive layers of increasingly meaningful representations. The “deep” in “deep learning” isn’t a reference to any kind of deeper understanding achieved by the approach; rather, it stands for this idea of successive layers of representations. ○ How many layers contribute to a model of the data is called the depth of the model. ○ Modern deep learning often involves tens or even hundreds of successive layers of representations, and they’re all learned automatically from exposure to training data. In deep learning, these layered representations are learned via models called neural networks, structured in literal layers stacked on top of each other. The term “neural network” refers to neurobiology, but although some of the central concepts in deep learning were developed in part by drawing inspiration from our understanding of the brain (in particular, the visual cortex), deep learning models are not models of the brain. Deep learning is a mathematical framework for learning representations from data. The network transforms the digit image into representations that are increasingly different from the original image and increasingly informative about the final result. You can think of a deep network as a multistage information distillation process, where information goes through successive filters and comes out increasingly purified. ○ So that’s what deep learning is, technically: a multistage way to learn data representations. RESUMEN: At this point, you know that machine learning is about mapping inputs (such as images) to targets (such as the label “cat”), which is done by observing many examples of input and targets. You also know that deep neural networks do this input-to-target mapping via a deep sequence of simple data transformations (layers) and that these data representations learned by a digit-classification model Artificial intelligence, machine learning, and deep learning data transformations are learned by exposure to examples. Now let’s look at how this learning happens, concretely. Input Layer Serves as the network’s interface with the external environment. Receives raw data as numbers which represent various types of input (e.g., pixels of an image, words in a sentence, etc.). Passes this data either directly to an output layer, or into an intermediate layer for further processing. Hidden Layer Performs intermediate computations and transformations. Extracts and combines features from input data which aid in pattern recognition. More hidden layers allow for more complex representations, more complex transformations, and more complex pattern recognitions. Output Layer Produces the final predictions or classifications based on the processed input. Translates the network’s internal processing into a usable representation (e.g., identifying an image, generating the next word in a sequence, etc.). Bias: A special type of weight added to each neuron to shift the activation function. Helps the network learn more complex data patterns. Note: To eliminate clutter, these slides and future slides will often omit the contributions of biases, however this should not be taken as a sign that their contributions are not significant. How this learning happens? The specification of what a layer does to its input data is stored in the layer’s weights, which in essence are a bunch of numbers (numerical values assigned to connections between nodes). In technical terms, we’d say that the transformation implemented by a layer is parameterized by its weights. It also determine the importance of the input to a neuron. (Weights are also sometimes called the parameters of a layer.) In this context, learning means finding a set of values for the weights of all layers in a network, such that the network will correctly map example inputs to their associated targets. ○ Weights are updated during training to reduce errors in the network’s predictions. ○ The adjustment process is guided by an algorithm that adjusts the weights by computing the slope (gradient) of a loss function, aiming to optimize the network’s performance. Gradient descent is used to optimize weights systematically, driven by a loss function. Error Calculation (1): ○ The difference between the predicted output and the actual output is computed using a loss function. ○ Common loss functions include Mean Squared Error (MSE) and Cross-Entropy. ○ The goal is to minimize the loss function over time. Gradient Descent (2): ○ Weights are updated in the direction that reduces the error, guided by the gradient of the loss function. ○ The gradient indicates the steepness of the error surface. Learning Rate (3): ○ A critical hyperparameter that needs tuning. ○ Basically determines the “size” of the “steps” down the slope (i.e., how big of an adjustment to the weights). ○ Too large can cause overshooting, too small can cause slow convergence. ○ Adaptive learning rates like Adam optimize this automatically. To control the output of a neural network, you need to be able to measure how far this output is from what you expected. This is the job of the loss function of the network, also sometimes called the objective function or cost function. The loss function takes the predictions of the network and the true target (what you wanted the network to output) and computes a distance score, capturing how well the network has done on this specific example (see figure 1.8). But here’s the thing: a deep neural network can contain tens of millions of parameters. Finding the correct values for all of them may seem like a daunting task, especially given that modifying the value of one parameter (weight) will affect the behavior of all the others! 1. Inefficient Tinkering 1. Individual Weight Adjustments: Imagine adjusting each weight between nodes one at a time. Either raise it a little, or lower it a little. Adjustment Strategy: If the adjustment results in less error, keep it. Otherwise reject it. Repeat, Repeat, Repeat... 2. 2. Benefits and Problems: Benefit: Will reduce the error over time. Problem 1: Will take too much time, especially if there are billions of weights. Problem 2: Involves a lot of guess work, which is inefficient. 3. The fundamental trick in deep learning is to use this score as a feedback signal to adjust the value of the weights a little, in a direction that will lower the loss score for the current example (see figure 1.9). This adjustment is the job of the optimizer, which implements what’s called the Backpropagation algorithm: the central algorithm in deep learning. 1. Activation Functions: Introduce non-linearity, allowing the network to learn complex patterns. 2. Training: Involves repeatedly adjusting weights using gradient descent to minimize a loss function. 3. Learning Process 1. Forward Propagation: Input data is passed through the network, layer by layer, and predictions are made. 2. Backpropagation: The process of adjusting weights to minimize error by propagating the error backward through the network. ○ Automated Weight Adjustment: A systematic method to update weights based on errors. Ensures a more consistent and efficient learning process. ○ Error Backpropagation: The error is propagated backward through the network from the output layer to the input layer. Each layer’s contribution to the error is computed. Weights are adjusted in each layer to minimize this contribution, reducing overall error. RESUMEN: Initially, the weights of the network are assigned random values, so the network merely implements a series of random transformations. Naturally, its output is far from what it should ideally be, and the loss score is accordingly very high. But with every example the network processes, the weights are adjusted a little in the correct direction, and the loss score decreases. This is the training loop, which, repeated a sufficient number of times (typically tens of iterations over thousands of examples), yields weight values that minimize the loss function. A network with a minimal loss is one for which the outputs are as close as they can be to the targets: a trained network. ○ Weight Updates: Weights are updated iteratively (i.e., the network continuously “steps” down) until the network converges to a minimum error. Convergence is achieved when subsequent weight updates (e.g., subsequent “steps”) result in minimal changes to the loss function. Regularization techniques like “dropout” help prevent overfitting during this process. Recuerda que no es hasta llegar al punto perfecto, es hasta el "casi bien, conforme" -------------------------------------------------------------------------------------------------------------------------- -----Backpropagation in Artificial Neural Networks 1. Process: ○ Requires global knowledge of the network’s state, which is mathematically computed and applied uniformly across the network. 2. Effectiveness: ○ Works well in artificial systems due to controlled environments and precise mathematical operations. 3. Memory (I can remember all the past conversations = building the context of conversation), velocidad 4. Behavior of human beings (what machines want to imitate) 5. Machines only understand numbers (0001: where the 1 is, that is the word que va a decir) - remember Softmax (does the probability distribution) 6. You chose a word based on the probabilities (the logits relate to a word) But the machine does not always choose the one que tiene mayor probabilidad 7. Problem: memorization 8. We use logarithms to Weight the penalty, to get the lowest number, to cause smaller errors 9. Words are a list of number for programming a LLMs 10.By words, I mean any token, but like so punctuation, two and commas and the start symbol and the end symbol, all of these I'll call tokens, but let's just say words. --------------------------------------------------------------------------------------------------------- --- What we do when talking is just produce the next word, produce the next word, until it reaches an end. That's what the machines want to imitate. Inside our head, or pretty good idea, in order to build context, of course, it's just to add a little bit of memory, just not real memory. That's kind of like a fake-ish kind of memory, but still memory nonetheless, which is, I don't just remember what was last said. I can remember all the things that were last said. And so as I'm producing new words, I'm kind of building up what the context of the conversation is. So still the same idea. You have a begin sentence symbol, you have an end sentence symbol, and you produce words, one at a time. The big difference here is what gets included in the input to the machine. The idea is that it's a series of words or series of symbols starting off with one the start symbol. o this is the key part of what the engineering problem is now. So that the engineering problem says, Okay, we need to produce sentences. We'll do this one word at a time. It's a little bit different than a word. We're going to switch to talking about one token at a time, because sometimes what it produces is a little sub part of the word. Sometimes it's an entire word. Sometimes it's punctuation. ○ how do we make the machine