IBCS - Paper 3 - Case Study (2025) - Cram Guide - Google Docs.pdf
Document Details
Uploaded by StylishProtactinium
Full Transcript
I B Computer Science Paper 3 Guide: The Perfect Chatbot Table of Contents [Click below to immediately navigate to desired topic] 1. Practice Paper #1 2. Practice Paper #1 Sample Answers 3. Practice Paper #2 4. Practice Paper #2 Sample Answers 5. Vocabular...
I B Computer Science Paper 3 Guide: The Perfect Chatbot Table of Contents [Click below to immediately navigate to desired topic] 1. Practice Paper #1 2. Practice Paper #1 Sample Answers 3. Practice Paper #2 4. Practice Paper #2 Sample Answers 5. Vocabulary Bank 6. Study Guide 7. Useful Links for Further Research For any questions, join the CS Classroom Discord serverhere. 2 Practice Paper 1 1. a. Identify twohyperparametersthat could be adjusted to improve the performance of a recurrent neural network. b. Definelong-term short term memory. 2. a. C urrently, RAKT’s technical team is proposing that the company use a cluster of CPUs, taken from consumer-grade computers in a company warehouse to train the LLM that will produce their new chatbot’s responses. i. Outline one advantage of using TPUs to train the LLM. ii. Outline one disadvantage of using TPUs to train the LLM. b. A fter some research, you have decided to propose that RAKT utilize a Natural Language Understanding (NLU) pipeline in conjunction with an LLM-based neural network. Outline two advantages of using an NLU pipeline in this context. 3. T raining RNNs using BPTT (backpropagation through time) can cause the vanishing gradient problem (Line 91). Explain why this is the case. 4. W hile RAKT is excited to deploy an LLM-based chat bot the management team is extremely concerned about how the chatbot is able to deal with the emotional aspect of communicating with their customers. ustomers often make queries with the intention of filing claims after traumatic or even C life-threatening events and this requires an appropriate, empathetic response, while still communicating any relevant information. iscuss what steps the technical team would have to take to make sure that the chatbot D consistently delivers such a response. 3 Practice Paper 1: Sample Answers 1. a. Identify twohyperparametersthat could be adjusted to improve the performance of a recurrent neural network. wo hyperparameters that could be adjusted include the number of layers and the learning rate T of the network. ther possible answers include the batch size (number of samples processed before model is O updated), number of neurons per layer, type of activation functions used, type of loss function, and the initial values of the hidden state. b. Definelong-term short term memory. ong-term short term memory is a type of recurrent neural network that utilizes layers of LSTM L cells, which have their own state and sets of “gates” that help control the flow of information through their respective cell. These networks help mitigate the vanishing gradient problem that often arises in recurrent neural networks. 2. a. C urrently, RAKT’s technical team is proposing that the company use a cluster of CPUs, taken from consumer-grade computers in a company warehouse to train the LLM that will produce their new chatbot’s responses. i. Outline one advantage of using a cluster TPUs to train the LLM. ne advantage of TPUs is that their architecture is specifically designed and optimized for O matrix-based computations, which are extremely common in neural networks. This means that TPUs could perform calculations much more quickly than the equivalent number of CPUs. ii. Outline one disadvantage of using TPUs to train the LLM. hile miniature versions of Google’s TPU have recently gone on sale, TPUs are only available W at the scale required to train and deploy an LLM through the Google Cloud platform. This means that the company is entirely dependent on a third-party company for the operation of its chatbot. Moreover, any training data or data from queries would have to flow through Google’s computer systems, which could raise privacy concerns, especially if information pertaining to insurance claims is medical in nature. 4 b. A fter some research, you have decided to propose that RAKT utilize a Natural Language Understanding (NLU) pipeline in conjunction with an LLM-based neural network. Outline two advantages of using an NLU pipeline in this context. he first advantage is that the NLU pipeline can provide a sophisticated, contextual T representation of textual input before it even reaches the input layer of the neural network, which can improve the neural network’s ability to correctly process text and produce an appropriate output. he second advantage is in the fact that the NLU pipeline is a modular, highly customizable T system. Different specialized modules, which each represent a different method of analyzing the target text, can be swapped in and out to produce exactly the type of input that is required by the neural network to produce optimal output. 3. T raining RNNs using BPTT (backpropagation through time) can cause thevanishing gradient problem(Line 91). Explain why this is the case. he vanishing gradient occurs when gradients, which represent the extent to which loss T changes relative to weight, bias, or hidden state in an RNN becomes extremely small. This is a problem because the vanishing gradient problem determines how much these parameters will change during the training process and an extremely small gradient means a lack of change in the parameters of the neural network and therefore a lack of training. his occurs in an RNN during BPTT because of the fact that loss and by extension gradients T are calculated at the end of every time step for every layer in the RNN. Moreover, as we progress backwards in the network, each gradient is multiplied by the gradient in the previous layer. As we progress through time steps, the gradient is obtained for each layer by multiplying by the same gradient for the layer from the previous time step. ince the derivatives of common activation functions (like sigmoid or tanh) are often less than 1, S multiplying many such small values results in an exponentially smaller gradient. This leads to the gradients vanishing over many time steps, making it difficult for the network to learn long-term dependencies because the updates to the weights become negligibly small. o summarize, the vanishing gradient problem in RNNs during BPTT is due to the repeated T multiplication of small derivatives over many time steps, causing the gradients to diminish exponentially and resulting in minimal parameter updates and inadequate training. 5 4. W hile RAKT is excited to deploy an LLM-based chat bot the management team is extremely concerned about how the chatbot is able to deal with the emotional aspect of communicating with their customers. ustomers often make queries with the intention of filing claims after traumatic or even C life-threatening events and this requires an appropriate, empathetic response, while still communicating any relevant information. iscuss what steps the technical team would have to take to make sure that chatbot D consistently delivers such a response. here are a number of steps, with regards to the architecture of the machine learning network, T the training dataset, and human oversight of the chatbot that can be taken to ensure the “appropriateness” of the chatbot’s responses to its human users, taking their emotional state into consideration. irstly, from the standpoint of the neural network’s architecture, it’s extremely important for the F neural network to be able to analyze the language of the user’s textual input and responses and not only understand the full scope of the situation, but also the sentiment of the user. Thus, the team needs to utilize tools that will allow for the most precise understanding of this text. There are a couple of ways to accomplish this. irst, the team could establish an NLU (Natural Language Understanding) pipeline - this would F allow them to essentially preprocess any input, but to a much greater extent than a simple “bag-of-words” algorithm. By the end of an NLU pipeline, text could be represented in vector form or by another data structure, but with values that signify linguistic aspects of the text that will allow the neural network to subsequently better understand the overall meaning and context of the text. econd, we would want to utilize a neural network such as a TNN that allows us to understand S the text contextually and by taking the relationships between all the words in the sentence with each other, before generating a response. The reason for a TNN would be the fact that it makes use of a self-attention mechanism (multiple, in fact) that does just that and allows us to best understand the text and accordingly, produce the possible response. Ultimately, its ability to engage in pragmatic analysis will be key. oving beyond the architecture, the training dataset is an equally important part of this task. We M would want a real dataset that is representative of a diverse set of conversations with an emotional dimension to them to train our neural network. Ideally this training dataset should not come from previous exchanges between the chatbot and users, but rather text exchanges between humans, where human operators are showing the kind of empathy and sensitivity to emotions in tragic circumstances that we would like our chatbot to embody. Ultimately, our chatbot can only achieve this, if it is trained on previous examples of optimal behavior. 6 dditionally, once the neural network has been trained on such a dataset, the company should A have employees to routinely audit the chatbot to make sure that it is producing optimal output. This could simply be done by asking the chatbot a diverse array of questions and assessing the resulting response. Based on these responses, the chatbot can be further trained to achieve the desired behavior. verall, correctly engaging with humans in difficult situations is difficult for even other humans, O let alone for chatbots. However, with a combination of correct architectural choices, diverse, but targeted training, and continual human assessment, this is not an impossible task to accomplish for RAKT’s technical team. Practice Paper 2 1. a. Definesynthetic data. b. O utline the use of theself-attention mechanismina transformer neural network. 2. a. A s a summer intern, you are helping RAKT’s technical team plan and prototype the machine learning network that will be used to train and deploy their new LLM-based chatbot. xplain how you could employ the“critical path”algorithmto help them minimize E the latency in their machine learning network. b. T he company has purchased a large quantity ofgraphicalprocessing units GPUs they they will connect to form a cluster. They intend to use this cluster to train the LLM that will power their new chatbot. utline two possible bottlenecks that could prevent full utilization of the cluster’s O capabilities. 3. L ong-term short term memory (LSTM)is a type of recurrentneural network that minimizes thevanishing gradient problem. ith reference to key components in this type of neural network, explain how it is able to W do this. 4. T ransformer neural networks (TNNs) are widely considered to be superior to recurrent neural networks RNNs for a variety of tasks related to Natural Language Processing (NLP). 7 With reference to specific technologies, to what extent is this true? Practice Paper 2: Sample Answers 1. a. Definesynthetic data. ynthetic data is data that is produced through a simulation or algorithm and is meant to reflect S a real-life scenario. Such data is often used to train neural networks, particularly when it is either difficult or costly to obtain sufficient real data. b. O utline the use of theself-attention mechanismina transformer neural network. he self-attention mechanism in transformer neural networks calculates attention weights that T capture the linguistic and contextual relationships between every word in a sentence or phrase. These weights are used to adjust the vector representation of each word, enhancing the overall understanding of the input text. This mechanism allows the model to consider the entire context when processing each word, leading to more accurate and meaningful representations and improved performance in NLP tasks. 2. a. A s a summer intern, you are helping RAKT’s technical team plan and prototype the machine learning network that will be used to train and deploy their new LLM-based chatbot. xplain how you could employ the“critical path”algorithmto help them minimize E the latency in their machine learning network. o minimize latency in the machine learning network for training and deploying RAKT’s T chatbot, first, you will need to break down the process into discrete tasks such as data preprocessing, model training, hyperparameter tuning, deployment setup, etc. hen, determine the dependencies between these tasks, like model training depending T on data preprocessing, and deployment setup depending on model evaluation. nce you’ve done this, you can apply the critical path algorithm to identify the “longest” O path from start to end, representing the minimum time needed for project completion. ocus on optimizing tasks on this critical path by allocating more computational F resources or efficient algorithms, such as powerful GPUs for model training or efficient 8 ata preprocessing techniques. Ensure tasks not on the critical path and without d dependencies are executed in parallel to avoid unnecessary delays. b. T he company has purchased a large quantity ofgraphical processing units GPUs they they will connect to form a cluster. They intend to use this cluster to train the LLM that will power their new chatbot. utline two possible bottlenecks that could prevent full utilization of the cluster’s O capabilities. he first bottleneck would be the network that is used to connect these GPUs. In order to T collectively work together to complete a task, GPUs in a cluster need to be able to exchange information with each other and synchronize their operations over a network. Even if each GPU is individually very efficient, a slow network will inhibit their ability to communicate and therefore slow down the entire cluster’s progress in completing the given task, like training an LLM. he second bottleneck would be the availability of cooling equipment. GPUs generate a lot of T heat when performing mathematical computations. GPUs generally come with a mechanism that throttles their performance and therefore reduces the speed at which they are able to produce computations if they get too hot. The reason behind this is to prevent the GPU from getting damaged. The only way to prevent this throttling and maximize the GPUs’, and therefore the overall GPU cluster’s performance is to use air conditioners or some other method to keep the cluster as cool as possible. 3. L ong-term short term memory (LSTM)is a type of recurrentneural network that minimizes thevanishing gradient problem. ith reference to key components in this type of neural network, explain how it is able to W do this. In LSTMs, the cell state is essential for calculating the hidden state, which is then used to generate the final output. This output is used to calculate the loss, which in turn is used to compute the gradients for the weights and biases of each layer at each time step. The input, forget, and output gates in LSTMs control the flow of information by selectively adding relevant new information and removing irrelevant or outdated information from the cell state. This targeted updating of the cell state ensures that it remains stable and less prone to rapid changes, which helps preserve the magnitude of the gradient during training, thereby mitigating the vanishing gradient problem. 4. T ransformer neural networks (TNNs) are widely considered to be superior to recurrent neural networks RNNs for a variety of tasks related to Natural Language Processing (NLP). With reference to specific technologies, to what extent is this true? 9 I would say that TNNs are indeed superior to RNNs in accomplishing NLP-based tasks for a number of reasons. irstly, TNNs have the ability to simultaneously and holistically analyze an entire sequence of F text, such as a sentence, rather than processing word-by-word as RNNs do. This is achieved through the self-attention mechanism, which allows TNNs to attend to all words in a sequence at once and determine the relevance of each word to every other word. This simultaneous analysis enables TNNs to capture complex dependencies and contextual relationships more effectively than RNNs, which are limited by their sequential nature. econdly, TNNs inherently address the vanishing gradient problem, which is a significant issue S in RNNs. In RNNs, the gradient diminishes exponentially as it backpropagates through many time steps, making it difficult to learn long-term dependencies. TNNs, on the other hand, do not rely on sequential processing and instead use positional encodings to maintain the order of words. This design, combined with the self-attention mechanism, allows TNNs to propagate gradients more effectively, thus mitigating the vanishing gradient problem. hirdly, TNNs can more effectively address long-term dependencies due to their self-attention T mechanism. In RNNs, long-term dependencies are challenging to capture because the information needs to pass through multiple steps, often leading to information loss. TNNs, however, can directly connect distant words in a sequence through self-attention, which evaluates the relevance of each word to all other words regardless of their distance in the sequence. This ability to capture long-range dependencies makes TNNs particularly powerful for tasks such as machine translation and text summarization. astly, TNN operations can be more effectively distributed across multiple processing units, L allowing them to complete tasks more efficiently. The parallelizable nature of the self-attention mechanism means that TNNs can take advantage of modern hardware, such as GPUs and TPUs, to perform computations concurrently. This contrasts with the inherently sequential processing of RNNs, which limits their ability to leverage parallel computation fully. Consequently, TNNs can process large datasets and train much faster, which is crucial for scaling up NLP applications. In conclusion, TNNs are indeed superior to RNNs in accomplishing NLP-based tasks due to their holistic text analysis, inherent mitigation of the vanishing gradient problem, effective handling of long-term dependencies, and efficient parallel computation capabilities. These advantages have made TNNs the preferred choice for a wide range of NLP applications, from translation and summarization to question answering and sentiment analysis. 10 Vocabulary Bank ackpropagation through time (BPTT)- gradient-based technique for training RNNs by B unfolding them in time and applying backpropagation to change all the parameters in the RNN atch size- the number of training examples utilized in one forward/backward pass through the B network, before the loss and subsequently gradients are calculated ag-of-words- A text representation method in NLP where a document is represented as a B vector of word frequencies, ignoring grammar and word order. Biases- Systematic errors in a dataset that can leadto unfair outcomes in a model onfirmation - Bias where data is collected or interpreted to confirm pre-existing beliefs C Historical- Bias that reflects past prejudices orinequalities present in historical data Labeling- Bias introduced by subjective or inconsistent annotation of training data Linguistic- Bias due to the unequal representationor usage of language variations (dialects, register, etc.) in the dataset Sampling- Bias arising from non-representative samplesthat do not reflect the true diversity of the population Selection- Bias introduced when certain data points are preferentially chosen over others, affecting the model's fairness and accuracy Dataset- A collection of data used for training orevaluating machine learning models eep learning- A subset of machine learning involvingneural networks with many layers that D can learn representations of data raphical processing unit (GPU)- A specialized hardwarecomponent designed to handle and G accelerate parallel processing tasks, particularly effective for rendering graphics and training deep learning models by performing simultaneous computations across multiple cores yperparameter tuning- The process of optimizingthe parameters that govern the training H process of machine learning models to improve performance arge language model (LLM)- A type of AI model trainedon vast amounts of text data to L understand and generate human-like text Latency- The delay between the input to a system and the corresponding output. 11 earning rate- controls the size of the steps the model takes when updating its parameters L during training - if the learning rate is increased, weights and biases of the network are updated more significantly in each iteration Long short-term memory (LSTM)- A type of RNN designed to remember information for long periods and mitigate the vanishing gradient problem ong-term dependency- refers to the challenge in sequence models, like Recurrent Neural L Networks (RNNs), of capturing and utilizing information from earlier in the input sequence to make accurate predictions at later time steps oss function- A function that measures the difference between the predicted output and the L actual output, guiding model training. emory cell state- In LSTM networks, the cell state carries long-term memory through the M network, allowing it to retain information across time steps atural language processing- The field of AI focused on the interaction between computers N and human language D iscourse integration- Understanding and maintaining coherence across multiple sentences or turns in conversation Lexical analysis- The process of examining the structureof words. Pragmatic analysis- Understanding language in context,including the intended meaning and implications Semantic analysis- The process of understanding themeaning of words and sentences Syntactical analysis (parsing)- Analyzing the grammatical structure of sentences atural language understanding (NLU)- A modular set of systems that sequentially process N text input to better represent their meaning before they are input into a neural network such as a transformer NN or LSTM Pre-processing- The process of cleaning and preparingraw data for analysis or model training ecurrent neural network (RNN)- A type of neural network designed to handle sequential data R by maintaining a hidden state that captures information from previous time steps elf-attention mechanism- A technique in neural networkswhere each element of the input S sequence considers or focuses on every other element, determining their relevance or importance, which improves the model's ability to capture dependencies and relationships within the sequence Synthetic data- Data that is artificially generatedrather than obtained by direct measurement. 12 ensor processing unit (TPU)- A type of hardware accelerator specifically designed by Google T to speed up machine learning workloads ransformer neural network (transformer NN)- A type of neural network architecture that relies T on self-attention mechanisms to process input data in parallel, rather than sequentially like RNNs anishing gradient- A problem in training deep neural networks where gradients diminish V exponentially as they are backpropagated through the network, impeding learning eights- The parameters in a neural network that are adjusted during training to minimize the W loss function. Study Guide - The Scenario - Insurance company (RAKT) uses a chatbot to handle customer queries - Customer feedback indicates poor chatbot performance - You are a student intern hired to recommend improvements to chatbot based on 6 area of concern - Problems to Be Addresses 1. Latency- The chatbot’s response time is slow anddetracts from the customer experience. 2. Linguistic nuances- The chatbot’s language modelis struggling to respond appropriately to ambiguous statements. 3. Architecture- The chatbot’s architecture is too simplisticand unable to handle complex language. 4. Dataset- The chatbot’s training dataset is not diverseenough, leading to poor accuracy in understanding and responding to customer queries. 5. Processing power- The system’s computational capabilityis a limiting factor. 6. Ethical challenges- The chatbot does not always giveappropriate advice and is prone to revealing personal information from its training dataset. Intro to Machine Learning and Neural Networks - What is machine learning? - Machine learning is when we combine data and algorithms to make predictions about future behavior. - We programmatically analyze past data to find patterns that are indicative of what will happen in the future. - Examples - Image Recognition (FaceID) 13 - peech/Audio Recognition (Siri/Shazam) S - Natural Language Processing (Google Translate) - Recommendation Systems (Netflix) - Pattern Detection/Classification (Fraud Detection/Customer Segmentation) - To do these things, we just need data, a computer, and a programming language. - The Machine Learning Process - Select a machine learning model. - Example: Linear Regression, Decision Tree, k-Nearest Neighbor algorithms, Neural Networks, etc. - Train the model with input data and the result of each input. - Example: Give the algorithm images of animals along with the type of animal, so the algorithm can associate images with specific characteristics with specific animals. - Put your input data into the model. - Example: Provide a set of images of animals - Receive the predicted output. - Depending on the model, use any incorrect predictions to improve the model. - Parts of a Neural Network - Input Layers- the input layer accepts data either for training purposes or to make a prediction - Hidden Layers (Memory)- the hidden layers are responsible for actually deciding what the output is for a given input; this is also where the “training” occurs - Output Layers- outputs the final prediction - Standard (Feedforward) Neural Network Training Process 1. Feeding Data- The network starts by taking in data through the input layer. - In our example, each criteria for an individual student will be input into an individual neuron, corresponding to the given criteria. 2. Making Predictions- Data flows from the input through any hidden layers to the output layer, where the network makes its initial prediction, such as whether or not the student will graduate. 3. Calculating Errors- After making a prediction, the network checks it against the correct answer (known from the training data). - The difference between the prediction and the correct answer is calculated using aloss function. - The loss function measures how wrong the network's predictions are for a single epoch; the goal is to make this error as small as possible. 4. Learning From Mistakes (Backpropagation)- Backpropagation is like the network reflecting on its errors and figuring out how to adjust its neurons' calculations to make better predictions next time. - It updates the settings (weights) inside the network that determine how much influence one neuron has on another 14 5. R epeating the Process- This whole process—inputting data, making predictions, calculating errors, and adjusting using backpropagation—is repeated with many examples (images, in our case). - Each cycle through the data is called an "epoch," and with each epoch, the network gets better at its task. 6. Evaluating Performance- After several epochs, the network's performance is evaluated to see if it’s improving and accurately recognizing cats in different images. - Backpropagation 1. Input training data. 2. For each set of inputs, calculate the loss. 3. For each set of inputs calculate the gradient. 4. Input the gradient into the gradient descent function to update the weights and biases in the neural network. - Hidden Layer - Weights-parameters that adjust the strength of input signals between neurons in different layers of a neural network. They are critical for learning, as they change during training to improve the network’s predictions. - Bias- parameter added to the weighted input that shifts the activation function, allowing the neural network to better fit complex patterns. It acts like an intercept in a linear equation, providing flexibility in the neuron’s output. - Activation Function- parameter added to the weighted input that shifts the activation function, allowing the neural network to better fit complex patterns. It acts like an intercept in a linear equation, providing flexibility in the neuron’s output. - Sigmoid: Outputs between 0 and 1, useful for probabilities. - ReLU:Passes only positive values, enhancing computational efficiency. - Gradient - We calculate the loss in order to understand how much to adjust the weights and biases in the network. - We must calculate thegradientbased on the loss function for each weight and bias. (i.e. 15 gradients) - The gradient is a measure of how sensitive the amount of loss is to changes in weight and bias. - It involves calculating the derivative of the loss function. - Gradient Descent Function - Once we have calculated the gradient, we input this into a gradient descent function. - This isan algorithm that allows us to minimize the loss function, so that we can find the weights and biases for every connection that will give us the result we want. - The algorithm automatically updates these parameters in the NN. 15 - Gradients + Complications with more Layers - Gradient calculations become much more complicated when multiple layers are involved. - Because layers work together to produce output,gradients from a previous layer are taken into account when calculating the gradient of the next layer. - Mathematically, they are multiplied by each other, due to a mathematical principle called the chain rule, but the math is much more complicated. - Vanishing Gradient Problem - Happens when the gradients become very small - Make updates too small, stopping training - Causes - Use of sigmoid function(for activation) - taking derivative of sigmoid functions to calculate gradient can lead to very small gradients - Small initial weights- very small initial weights lead to proportionally smaller losses, which lead to smaller gradients - Many layers- Each new gradient is calculated by multiplying the gradient from the previous layer using a mathematical principle called thechain rule. This means that small gradients (think decimals) can lead to even smaller gradients as we pass through more and more layers. - Datasets - Training- Used to train a neural network to produce desired output - Contains input data with known output - Validation- Used for hypertuning - used to detect adjustments that will improve the performance of the NN - Testing- Used to evaluate the performance of the NN - Also contains known output - Testing data must not overlap with training dataset - Hyperparameters - Aspects of architecture of NN that can be changed to affect performance 1. Number of layers- number of hidden layers in a neural network ➔ More layers can lead more precision ➔ More layers can lead to vanishing gradient problem ➔ More layers require more memory and processing power 2. Learning rate- how dramatically weights are changed in response to calculated gradients ➔ A faster learning rate can lead to the NN learning to produce the correct response more quickly ➔ A faster learning rate can also lead the NN to stop learning too soon, ultimately giving a suboptimal solution - Hyperparameter Tuning - Selection of Hyperparameters - Involves choosing which aspects of the neural network to adjust. 16 - xamples: learning rate, the number of layers, the number of neurons in E each layer, the type of activation functions, and the batch size. - Trial and Error Process -Involves experimenting with different combinations of hyperparameters. This can be a time-consuming process of trial and error, as the optimal settings usually depend heavily on the specific data and task. - Goal of Tuning -Increased accuracy, efficiency, and generalization to new data. - Evaluation -Success measured using a validation set of data, or through cross-validation techniques, separate from the training and test datasets. Recurrent Neural Networks - Why RNNs? - Feed-forward neural networks cannot remember what it learns - Forgets information between iterations - Does not allow for output based on previous input and results - Such capabilities are crucial for text generation - RNNs: The Process - Designed to process sequences of data by maintaining a hidden state that captures information from previous time steps - time step- corresponds to the point at which the RNN reads one element (one word) of the input sequence (sentence) , updates its hidden state, and produces an output - The Process - Input Sequence -The RNN processes one element ofthe input sequence at a time. - Hidden State Update -At each time step, the RNN updatesits hidden state based on the current input and the previous hidden state. - Output Generation -The updated hidden state is usedto generate the output for the current time step. - Propagation Through Time -This process repeats foreach element in the input sequence, allowing information to be passed through time. - RNNs: Use Cases - Autocomplete- can predict the next word in a sentencebased on the context of the previous words - Machine translation- used in neural machine translationsystems to convert text from one language to another by learning the sequence of words and their meanings in both languages - Chatbots- used in conversational agents to generatehuman-like responses based on the context of the conversation history - Hidden State- a vector that is updated after eachtime step using the input and the previous hidden state - Backpropagation Through Time (BPTT) 17 1. F orward Pass-Process the sequence and store hidden states and outputs. 2. Unroll the Network- Visualize the RNN as multiplelayers across time steps. 3. Compute the Loss:Calculate the loss at each timestep and sum them up. - Loss- refers to the difference between expected and actual output for one time step - Loss function- used to calculate loss based on expectedand actual output 4. Backward Pass a) Compute gradients of the loss with respect to outputs, hidden states, weights, and biases, using gradient descent function b) Accumulate (sum) these gradients over the time steps. 5. Update Weights:Adjust the weights and biases inputtingthe accumulated gradients into a gradient descent function. - NNs vs. Standard Neural Network (Feedforward Neural Network) R - BPTT (Backpropagation Through Time) vs. Standard Backpropagation 18 - Vanishing Gradient Problem + RNNs - Gradient Calculation- Due to the chain rule mentionedearlier, we end up multiplying any gradient by the gradient at the same layer in the previous time step - Hidden State- Calculating the gradient involves takingthe derivative of loss with respect to hidden state, which requires taking a derivative of the activation function - Time Steps vs. Layers- Fixed number of layers forgradient calculations in FFNNs,layers*time_stepsin RNNs - RNNs: Pros and Cons - Pros - Sequence Handling- designed to handle sequentialdata - Memory -capability to retain information from previousinputs due to their internal state - Flexibility- can be applied to various types of sequential data, including text, audio, video, and time series data - Cons - Vanishing and Exploding Gradients - Training Time- can be computationally intensive andtime-consuming, particularly for long sequences or large datasets, due to the sequential nature of the data processing - Difficulty to Capture Long-term Dependencies - Complexity in Parallelization- process data sequentially, which makes it challenging to parallelize the training process - leads to slower training time 19 Long short-term Memory (LSTMs) - Description - Type of RNN, addresses vanishing gradient problem - Contains typical input and output layer, but instead of hidden layer neurons found in RNNs and FFNNs, contains “LSTM layers” - Each LSTM layer is made of “LSTM cells” - Each cell contains a series of “gates”, which arereally additional mathematical functions in the hidden layer neurons used to appropriately process information - LSTM Cells - Cell State -acts as the long-term memory of the LSTMcell. It carries relevant information throughout the sequence of data and is modified by gates to add or remove information - Input Gate- Decides which values are updated in thecell state - Forget Gate- Decides what information is discardedfrom the cell state - Output Gate- Decides what information from the cellstate is used to generate the output - How do LSTM cells work? 1. Inputs are hidden state from same layer in previous time step, cell state from same cell in previous time step, and input from previous layer (current hidden state multiplied by weights of connections with cells in the previous layers) 2. Forget gate- Hidden state and input state are combinedin a single vector that passes through the sigmoid function to produce a value between 0 and 1 - the result is multiplied by the previous cell state 3. Input gate- The combined hidden state and input vector passes through both a tanh and sigmoid function and the results of both of these functions are multiplied. This product is added to the cell state. 4. Output gate- The combined hidden state and inputvector passes through a 20 igmoid function and the result is multiplied by the result of the cell state passing s through a tanh function. This product is output to the hidden state. - LSTM + Cell State (Example) - Early in the Sentence The cell state might store information about the subject (e.g., "The cat"). This helps maintain agreement between subject and verb later in the sentence. - Mid-Sentence The cell state might track the structure of the sentence (e.g., "The cat sat on"). This helps predict the next word in the context of the ongoing phrase. - Later in the Sentence The cell state retains relevant details and context that have accumulated (e.g., "The cat sat on the"). This helps in predicting that the next word could be "mat" given the context. - Vanishing Gradient Problem - Cell state is used to calculate the hidden state, which is used to generate the final output, which is used to calculate loss. - Loss is used to calculate the gradient for the weights and biases of each layer at each time step. - The gates (input, forget, and output gates) in LSTMs control the information flow by selectively adding relevant new information and removing irrelevant or outdated information from the cell state. - Because the cell state is updated in a targeted manner, it remains more stable and less prone to rapid changes - This helps preserve the magnitude of the gradient during training. Transformer Neural Networks (TNNs) - Generative Pre-trained Transformers (GPTs) - Original TNNs introduced in 2017 Google paper, “Attention is All You Need” for translating text - Emphasized use of self-attention mechanism, which led to better performance and parallelization (relative to RNNs and LSTMs) - GPT-1 had 117 million parameters, GPT-3 has 175 billion parameters - Transformer Neural Networks (TNNs): Key Aspects - Processes all words in a sentence simultaneously - positional encodings- mathematical values generated through the use of mathematical functions to indicate position of word in a sentence - self-attention mechanism- adds “weight” (a mathematicalmultiplier) to each word in a sentence based on importance 21 - ulti-head attention- applies self-attention mechanism to different parts of m sentence simultaneously during processing, resulting in different perspectives on word relationships and interactions, which are later combined - Consists of multiple layers, including feed-forward networks with are the target of training - TNNs: Architecture - Convert a sentence from English to French (Occurs word–by-word) 1. Input Embeddings- Each word in a sentence is turnedinto a vector that captures its meaning 2. Positional Encoding- Incorporate information aboutposition of each word into same vector 3. Encoder Layers (~6 layers) - Each layer includesself-attention mechanism, feed-forward layer, and output normalization -focuseson accurately representing text 4. Decoder Layers (~6 layers)- Each layer includes self-attentionmechanism, feed-forward layer, and output normalization -onlyfocuses on producing next token 5. Output layer- generates next word in output sentence - Self-Attention Mechanism - Adds “weight” (a mathematical multiplier) to each word in a sentence based on importance - Allows NN to discern the importance of words relative to each other - Also provides insight into relationships between words - Allows different vectors for words to interact with each other - Obtaining Attention Weights - We calculate attention weights based on the relationship of each word with every other word (grammatical, contextual, etc.) - We use vectors representing different grammatical patterns and relationships, as well as the words themselves to make these calculations. - We modify each vector based on this attention weight, all of which are usually generated in a table. 22 - Attention Weights (Example) - Residual Connections - a shortcut path that skips one or more layers in the network and adds the input of the skipped layer directly to its output - used extensively within both encoder and decoder layers - provide a shortcut path for gradients that bypasses one or more layers - gradient can flow directly through these connections without being diminished by the transformations (activations, weight multiplications) in the intermediate layers - Choice of layers to skip is dictated by experimentation,sort of like hypertuning - TNN Advantages over RNNs - Parallelization: - Why True: RNNs process input data sequentially, whereeach step depends on the output of the previous step, making it impossible to parallelize effectively. In contrast, TNNs, particularly Transformer models, use self-attention mechanisms that allow each token to be processed independently of the others. This independence enables parallel 23 rocessing of the entire sequence. p - Benefit: This significantly speeds up training andinference compared to RNNs. - Long-term Dependencies: - Why True: RNNs struggle with long-term dependenciesdue to the vanishing gradient problem, where gradients diminish as they are backpropagated through many layers. TNNs, with their self-attention mechanisms, can directly connect distant tokens in the sequence, making it easier to capture long-range dependencies. - Benefit: This improves the model's ability to learnrelationships in data that span long distances. - Reduced Training Times: - Why True: Due to parallelization and efficient handlingof dependencies, TNNs can process multiple tokens simultaneously, reducing the time needed for training. RNNs' sequential nature inherently limits their training speed. - Benefit: This efficiency is crucial for training largemodels on large datasets. - Scalability - Why True: TNNs can scale more effectively becausetheir architecture allows for more straightforward parallelization and optimization. - The independent computation of attention scores across tokens and layers means that the workload can be distributed across multiple GPUs or TPUs - Benefit: This scalability enables TNNs to tackle largedatasets and complex tasks more effectively than RNNs. Processing Power - What is processing power? - Computational Capacity- The ability of the hardware(CPU, GPU, TPU) to perform a large number of complex calculations quickly, measured in terms of FLOPS (floating-point operations per second) - Memory Resources- The availability of sufficientRAM and VRAM to handle large models and data efficiently, ensuring smooth processing and quick access to necessary information - Efficiency and Speed- The capability to manage highthroughput and low latency, allowing for rapid data processing and real-time response generation while optimizing energy consumption - What are LLMs? - Massive Neural Network:An LLM is a neural network with billions of parameters designed to understand and generate human-like text from vast amounts of data. 24 - atural Language Processing (NLP):LLMs are essential for NLP tasks like text N completion, translation, summarization, sentiment analysis, and answering questions by leveraging learned patterns and structures. - Contextual Understanding:These models generate contextuallyrelevant responses, maintaining coherent conversations and producing human-like text based on prompts. - Neural Networks: Main Tasks - Preprocessing- Preparing raw data for training theLLM by cleaning, transforming, and organizing it into a suitable format. - Cleaning - Description:Removing noise and irrelevant informationfrom the dataset. - Example:Eliminating missing values, correcting inconsistencies, and removing duplicate entries to ensure data quality. - Selection - Description:Choosing relevant data and features for analysis and model training. - Example:Filtering out unimportant features and selectinga subset of the data that is representative and relevant to the problem being solved. - Transformation - Description:Converting data into a suitable formatfor analysis and model training. - Example:Normalizing numerical values, encoding categorical variables, and applying feature engineering techniques to create new features. - Reduction of Data - Description:Decreasing the volume of data while retaining important information. - Example: Selecting a smaller subset of data samplesto speed up processing and reduce computational costs - Training the Model- Teaching the LLM to understandand generate human-like text by optimizing its parameters on a large dataset - Deploying the Model- Making the trained LLM availablefor use in real-world applications. - Bag-of-Words Algorithms - Tokenization -text is split into individual words(tokens), often removing punctuation and common stop words like "and" and "the" to focus on meaningful words - Vocabulary Creation -A collection of all unique wordsin the corpus (text) is created, with each word assigned a unique index - Vectorization -Each document is represented as avector of word counts, where 25 t he vector length equals the vocabulary size, and each element corresponds to the count of a specific word in the document - Advantages - Straightforward:The BoW algorithm is simple to understandand easy to implement. It involves basic operations such as tokenization and counting word occurrences. - Minimal Preprocessing:Requires minimal preprocessingof text data, making it accessible and quick to deploy in various applications. - No Need for Grammar Knowledge:Does not require knowledgeof grammar or language structure, which simplifies its application across different languages. - Low Computational Complexity:Involves simple counting operations and vector representations, making it computationally efficient. - Handles Large Datasets:Efficiently handles largedatasets due to its simplicity and use of sparse matrix representations. - Parallel Processing:Can easily be parallelized, withdifferent parts of the text processed simultaneously to enhance speed. - Disadvantages - No Order Information:BoW ignores the order of wordsin the text, leading to a loss of syntactic and semantic information. For example, "dog bites man" and "man bites dog" would have the 26 ame representation. s - No Contextual Understanding:The algorithm fails tocapture the context in which words appear, which can be critical for understanding meaning in natural language - Resource Intensive:For large corpora, the vocabularycan become extremely large, leading to high-dimensional vectors. This can make the model computationally expensive and memory-intensive. - Sensitivity to Irrelevant Words:High-frequency wordsthat are not meaningful (e.g., "the", "and") can dominate the vector representation unless explicitly removed. - Graphical Processing Units (GPUs) - Multiple Cores -GPUs have thousands of specializedcores for handling many tasks simultaneously, excelling in parallel processing compared to CPUs with fewer, more powerful cores for sequential tasks. - Fast Data Transfer -High memory bandwidth allowsrapid data transfer between the GPU and its memory, essential for large datasets and complex computations in deep learning and simulations. - Large VRAM -GPUs feature large VRAM for storing andquickly accessing data, reducing latency and enhancing performance. - Programmability -Frameworks like NVIDIA's CUDA andOpenCL enable custom coding to leverage GPU parallel processing for various applications beyond graphics. - Tensor Processing Units (TPUs) - Custom-designed application-specific integrated circuits (ASICs) developed by Google specifically to accelerate machine learning workloads, particularly deep learning tasks - Each TPU unit has 8 cores - Each core has between 8 and 32GB of RAM associated with it - Optimized Architecture:TPUs have a unique architecturetailored to perform large matrix multiplicationsand other operationscommon in deep learning efficiently. - Parallelism:TPUs can handle massive amounts of parallelcomputations, which significantly speeds up the training and inference of large machine learning models. - High Bandwidth Memory (HBM): TPUs use high-speed memoryto store large amounts of data close to the processing units, reducing latency and increasing throughput. - Power Consumption:TPUs are designed to deliver highperformance with lower energy consumption, making them more power-efficient for intensive machine learning workloads. - Thermal Design:Their specialized design often leadsto better thermal 27 fficiency, allowing them to perform heavy computations with less heat e generation. - Distributed Processing:TPUs are designed to workin large-scale clusters (“pods”), allowing for the distribution of training tasks across many TPUs. This scalability supports the training of extremely large models on massive datasets. - Usage - Large-Scale Model Training -TPUs are used to trainvery large neural networks, such as those in natural language processing (NLP) and computer vision, much faster than would be possible with GPUs or CPUs. - Real-Time Inference -TPUs provide low-latency inferencefor deployed machine learning models, making them suitable for applications that require real-time decision-making, such as autonomous driving and live video analysis. - Research and Development -Researchers use TPUs to experiment with new model architectures and training techniques, taking advantage of their high computational power to iterate quickly. - PUs vs. TPUs G - Clustering + LLMs - Advantages - Increased Computational Power -Clustering multipleGPUs or TPUs provides substantial computational power, enabling the training of very large language models that would be infeasible on 28 single unit a - Scalability -Clusters can be scaled up or down basedon workload requirements, allowing for flexible resource management and efficient handling of varying demands - Reduced Training Time -Distributing the trainingprocess across multiple units significantly reduces the time required to train large models by parallelizing computations - High Throughput -Clusters can handle large volumesof data simultaneously, improving throughput for both training and inference tasks - Fault Tolerance -Clusters can provide redundancy,where the failure of a single unit does not halt the entire training process, thus improving reliability and uptime - Disadvantages - Complexity in Setup and Management -Setting up andmanaging a cluster of GPUs or TPUs involves significant complexity, including configuring networking, synchronization, and software environments. - High Cost -Clustering multiple high-end GPUs or TPUscan be very expensive, both in terms of initial hardware investment and ongoing operational costs, such as power and cooling. - Communication Overhead -Distributing tasks acrossmultiple units introduces communication overhead, which can limit the efficiency gains from parallel processing, especially if the network bandwidth is insufficient. - Software and Framework Compatibility -Ensuring compatibility and optimizing performance across all units in the cluster can be challenging, requiring specialized knowledge and effort to tune software and frameworks. - Energy Consumption -Running large clusters consumesa significant amount of power, contributing to higher operational costs and potential environmental impact. - Training + Processing Power - Model Complexity and Size - Number of Parameters:Larger models with more parameters (e.g., billions in LLMs) require significantly more computational resources. - Architectural Complexity:Advanced architectures withmore layers and sophisticated components, such as transformers, increase processing power requirements. - Dataset Characteristics: - Size:Larger datasets necessitate more processing power to 29 andle and process the increased volume of data. h - Quality:High-quality datasets that need extensivepreprocessing, cleaning, and augmentation can add to computational demands. - Hardware Utilization - GPU/TPU Availability:The number and quality of GPUs/TPUs available directly affect the processing power and training speed. - Efficiency:Utilizing hardware-specific optimizationsand accelerators can significantly reduce processing power requirements. - Model Architecture - Transformer Variants: Different architectures (e.g.,BERT, GPT, T5) have varying computational requirements. The design of attention mechanisms, feedforward layers, and other components impacts processing power. - Custom Layers and Operations: Inclusion of specializedlayers or operations can add to the computational burden. - Deployment + Processing Power - Inference Latency and Throughput - Latency:Thetime requiredto produce a result after receiving an input. Low-latency requirements demand more processing power for real-time responses. - Throughput:Thenumber of inferencesthe model can handle per second. High-throughput applications require significant computational resources to maintain performance. - Model Size and Complexity: - Parameters:Larger models with more parameters require more processing power and memory for inference. - Architecture:More complex architectures may involve additional computations, increasing the processing power needed. - Hardware Utilization - GPUs/TPUs:Effective use of specialized hardware can significantly reduce inference time and processing power needs. - Accelerators:Utilizing hardware accelerators designedfor specific tasks can improve efficiency and performance. - Batch Size -The number of inputs processed simultaneously affects computational load - Larger batch sizes can improve throughput but also increase the memory and processing power needed 30 Datasets - Real Data vs. Synthetic Data - Real Data - Description:Collected from real-world events, transactions,or observations. - Example:Customer transaction records, sensor readings,user interactions on a website. - Synthetic Data - Description:Generated artificially using algorithmsor simulations, designed to mimic the statistical properties of real data. - Example:Simulated user behavior in a website, generatedmedical records for training purposes. - Real Data: Advantages and Disadvantages - Advantages - Authenticity and Relevance -Real data accuratelyreflects real-world scenarios, providing genuine insights for analysis and model training. - Diverse and Complex -Captures natural variabilityand complexity, including rare events and edge cases, which are crucial for robust model performance. - Credibility and Trust -Higher confidence in resultsand insights derived from real data, as it is based on actual observations and experiences. - Disadvantages - Collection Challenges -Gathering real data can beexpensive and time-consuming, requiring significant resources for data collection, storage, and management. - Quality Issues -Real data can contain inaccuracies,inconsistencies, and noise, requiring extensive cleaning and preprocessing to ensure quality. - Privacy and Legal Concerns -Access to real data maybe restricted due to privacy concerns, legal regulations, or proprietary restrictions, limiting its availability. - Synthetic Data: Advantages and Disadvantages - Advantages - Cost-Effective -Generating synthetic data is oftenless expensive than collecting and labeling real-world data, allowing for budget-friendly scalability and rapid data production. - Privacy-Safe -Synthetic data does not represent realindividuals, eliminating privacy concerns and enabling easier data sharing and compliance with data protection regulations. - Customizable and Balanced -It can be tailored tospecific needs, 31 nsuring balanced datasets and inclusion of rare or extreme cases, which e helps in building more robust machine learning models. - Disadvantages - Lack of Realism -Synthetic data may not fully capturethe complexity and nuance of real-world scenarios, leading to models that might not generalize well to real-world applications. - Complex Generation Process -Creating high-qualitysynthetic data requires sophisticated algorithms and domain expertise, which can be technically challenging and resource-intensive. - Skepticism and Regulatory Hurdles -Stakeholders maybe skeptical of models trained on synthetic data, and regulatory bodies may not accept synthetic data for compliance purposes in certain industries like healthcare and finance. - Biases - Confirmation Bias - Description:Confirmation bias occurs when the datasetfavors a particular viewpoint or hypothesis, leading to skewed model predictions. - Example:A customer service chatbot is trained onlyon queries related to a specific type of insurance policy, leading it to poorly handle queries about other policies. - Solution:Ensure the training data is diverse and representative of all possible viewpoints or scenarios. Incorporate data augmentation techniques and perform regular audits to identify and mitigate biases. - Historical Bias - Description:Historical bias arises when the trainingdata reflect outdated information, failing to account for recent changes or trends. - Example:An NLP model trained on customer servicequeries from five years ago may not understand or accurately respond to current slang or new types of customer inquiries. - Solution:Regularly update the training data to includerecent information and trends. Use techniques such as transfer learning to adapt models to new data efficiently. - Labeling Bias - Description:Labeling bias occurs when the labelsapplied to data are subjective, inaccurate, or incomplete, affecting the model's performance. - Example:Customer queries labeled too generically(e.g., "general inquiry") prevent the model from learning specific intents, leading to poor prediction accuracy. - Solution:Implement a detailed and consistent labelingprocess, involving multiple annotators to cross-validate labels. Use tools to detect and correct labeling inconsistencies. - Linguistic Bias 32 - escription:Linguistic bias happens when the dataset is biased toward D specific linguistic features, such as formal language, neglecting variations in dialects or vocabulary. - Example:A dataset composed mainly of formal writtenlanguage may cause a model to struggle with interpreting informal speech or regional dialects. - Solution:Include diverse linguistic styles and dialectsin the training data. Utilize techniques like data augmentation to simulate informal language and dialects. - Sampling Bias - Description:Sampling bias occurs when the trainingdataset is not representative of the entire population, leading to biased model outcomes. - Example:Training data that only include queries fromyoung adults may cause a model to perform poorly with queries from older adults. - Solution:Ensure the training dataset is representativeof the entire target population. Use stratified sampling to maintain diversity across various