IBCS - Paper 3 - Case Study (2025) - Cram Guide - Google Docs.pdf

Full Transcript

I‭ B Computer Science‬ ‭Paper 3 Guide: The Perfect Chatbot‬ ‭Table of Contents‬ ‭[Click below to immediately navigate to desired topic]‬ ‭1.‬ ‭Practice Paper #1‬ ‭2.‬ ‭Practice Paper #1 Sample Answers‬ ‭3.‬ ‭Practice Paper #2‬ ‭4.‬ ‭Practice Paper #2 Sample Answers‬ ‭5.‬ ‭Vocabular...

I‭ B Computer Science‬ ‭Paper 3 Guide: The Perfect Chatbot‬ ‭Table of Contents‬ ‭[Click below to immediately navigate to desired topic]‬ ‭1.‬ ‭Practice Paper #1‬ ‭2.‬ ‭Practice Paper #1 Sample Answers‬ ‭3.‬ ‭Practice Paper #2‬ ‭4.‬ ‭Practice Paper #2 Sample Answers‬ ‭5.‬ ‭Vocabulary Bank‬ ‭6.‬ ‭Study Guide‬ ‭7.‬ ‭Useful Links for Further Research‬ ‭For any questions, join the CS Classroom Discord server‬‭here‬‭.‬ ‭2‬ ‭Practice Paper 1‬ ‭1.‬ ‭a.‬ I‭dentify two‬‭hyperparameters‬‭that could be adjusted to improve the performance‬ ‭of a recurrent neural network. ‬ ‭b.‬ ‭Define‬‭long-term short term memory‬‭. ‬ ‭2.‬ ‭a.‬ C ‭ urrently, RAKT’s technical team is proposing that the company use a cluster of‬ ‭CPUs, taken from consumer-grade computers in a company warehouse to train‬ ‭the LLM that will produce their new chatbot’s responses.‬ ‭i.‬ ‭Outline one advantage of using TPUs to train the LLM.‬ ‭ii.‬ ‭Outline one disadvantage of using TPUs to train the LLM.‬ ‭b.‬ A ‭ fter some research, you have decided to propose that RAKT utilize a Natural‬ ‭Language Understanding (NLU) pipeline in conjunction with an LLM-based‬ ‭neural network.‬ ‭Outline two advantages of using an NLU pipeline in this context.‬ ‭3.‬ T ‭ raining RNNs using BPTT (backpropagation through time) can cause the vanishing‬ ‭gradient problem (Line 91).‬ ‭Explain why this is the case. ‬ ‭4.‬ W‭ hile RAKT is excited to deploy an LLM-based chat bot the management team is‬ ‭extremely concerned about how the chatbot is able to deal with the emotional aspect of‬ ‭communicating with their customers.‬ ‭ ustomers often make queries with the intention of filing claims after traumatic or even‬ C ‭life-threatening events and this requires an appropriate, empathetic response, while still‬ ‭communicating any relevant information.‬ ‭ iscuss what steps the technical team would have to take to make sure that the chatbot‬ D ‭consistently delivers such a response.‬ ‭3‬ ‭Practice Paper 1: Sample Answers‬ ‭1.‬ ‭a.‬ I‭dentify two‬‭hyperparameters‬‭that could be adjusted to improve the performance‬ ‭of a recurrent neural network. ‬ ‭ wo hyperparameters that could be adjusted include the number of layers and the learning rate‬ T ‭of the network.‬ ‭ ther possible answers include the batch size (number of samples processed before model is‬ O ‭updated), number of neurons per layer, type of activation functions used, type of loss function,‬ ‭and the initial values of the hidden state.‬ ‭b.‬ ‭Define‬‭long-term short term memory‬‭. ‬ ‭ ong-term short term memory is a type of recurrent neural network that utilizes layers of LSTM‬ L ‭cells, which have their own state and sets of “gates” that help control the flow of information‬ ‭through their respective cell. These networks help mitigate the vanishing gradient problem that‬ ‭often arises in recurrent neural networks.‬ ‭2.‬ ‭a.‬ C ‭ urrently, RAKT’s technical team is proposing that the company use a cluster of‬ ‭CPUs, taken from consumer-grade computers in a company warehouse to train‬ ‭the LLM that will produce their new chatbot’s responses.‬ ‭i.‬ ‭Outline one advantage of using a cluster TPUs to train the LLM. ‬ ‭ ne advantage of TPUs is that their architecture is specifically designed and optimized for‬ O ‭matrix-based computations, which are extremely common in neural networks. This means that‬ ‭TPUs could perform calculations much more quickly than the equivalent number of CPUs.‬ ‭ii.‬ ‭Outline one disadvantage of using TPUs to train the LLM. ‬ ‭ hile miniature versions of Google’s TPU have recently gone on sale, TPUs are only available‬ W ‭at the scale required to train and deploy an LLM through the Google Cloud platform. This means‬ ‭that the company is entirely dependent on a third-party company for the operation of its chatbot.‬ ‭Moreover, any training data or data from queries would have to flow through Google’s computer‬ ‭systems, which could raise privacy concerns, especially if information pertaining to insurance‬ ‭claims is medical in nature.‬ ‭4‬ ‭b.‬ A ‭ fter some research, you have decided to propose that RAKT utilize a Natural‬ ‭Language Understanding (NLU) pipeline in conjunction with an LLM-based‬ ‭neural network.‬ ‭Outline two advantages of using an NLU pipeline in this context. ‬ ‭ he first advantage is that the NLU pipeline can provide a sophisticated, contextual‬ T ‭representation of textual input before it even reaches the input layer of the neural network,‬ ‭which can improve the neural network’s ability to correctly process text and produce an‬ ‭appropriate output.‬ ‭ he second advantage is in the fact that the NLU pipeline is a modular, highly customizable‬ T ‭system. Different specialized modules, which each represent a different method of analyzing the‬ ‭target text, can be swapped in and out to produce exactly the type of input that is required by‬ ‭the neural network to produce optimal output.‬ ‭3.‬ T ‭ raining RNNs using BPTT (backpropagation through time) can cause the‬‭vanishing‬ ‭gradient problem‬‭(Line 91).‬ ‭Explain why this is the case. ‬ ‭ he vanishing gradient occurs when gradients, which represent the extent to which loss‬ T ‭changes relative to weight, bias, or hidden state in an RNN becomes extremely small. This is a‬ ‭problem because the vanishing gradient problem determines how much these parameters will‬ ‭change during the training process and an extremely small gradient means a lack of change in‬ ‭the parameters of the neural network and therefore a lack of training.‬ ‭ his occurs in an RNN during BPTT because of the fact that loss and by extension gradients‬ T ‭are calculated at the end of every time step for every layer in the RNN. Moreover, as we‬ ‭progress backwards in the network, each gradient is multiplied by the gradient in the previous‬ ‭layer. As we progress through time steps, the gradient is obtained for each layer by multiplying‬ ‭by the same gradient for the layer from the previous time step.‬ ‭ ince the derivatives of common activation functions (like sigmoid or tanh) are often less than 1,‬ S ‭multiplying many such small values results in an exponentially smaller gradient. This leads to‬ ‭the gradients vanishing over many time steps, making it difficult for the network to learn‬ ‭long-term dependencies because the updates to the weights become negligibly small.‬ ‭ o summarize, the vanishing gradient problem in RNNs during BPTT is due to the repeated‬ T ‭multiplication of small derivatives over many time steps, causing the gradients to diminish‬ ‭exponentially and resulting in minimal parameter updates and inadequate training.‬ ‭5‬ ‭4.‬ W ‭ hile RAKT is excited to deploy an LLM-based chat bot the management team is‬ ‭extremely concerned about how the chatbot is able to deal with the emotional aspect of‬ ‭communicating with their customers.‬ ‭ ustomers often make queries with the intention of filing claims after traumatic or even‬ C ‭life-threatening events and this requires an appropriate, empathetic response, while still‬ ‭communicating any relevant information.‬ ‭ iscuss what steps the technical team would have to take to make sure that chatbot‬ D ‭consistently delivers such a response.‬ ‭ here are a number of steps, with regards to the architecture of the machine learning network,‬ T ‭the training dataset, and human oversight of the chatbot that can be taken to ensure the‬ ‭“appropriateness” of the chatbot’s responses to its human users, taking their emotional state‬ ‭into consideration.‬ ‭ irstly, from the standpoint of the neural network’s architecture, it’s extremely important for the‬ F ‭neural network to be able to analyze the language of the user’s textual input and responses and‬ ‭not only understand the full scope of the situation, but also the sentiment of the user. Thus, the‬ ‭team needs to utilize tools that will allow for the most precise understanding of this text.‬ ‭There are a couple of ways to accomplish this.‬ ‭ irst, the team could establish an NLU (Natural Language Understanding) pipeline - this would‬ F ‭allow them to essentially preprocess any input, but to a much greater extent than a simple‬ ‭“bag-of-words” algorithm. By the end of an NLU pipeline, text could be represented in vector‬ ‭form or by another data structure, but with values that signify linguistic aspects of the text that‬ ‭will allow the neural network to subsequently better understand the overall meaning and context‬ ‭of the text.‬ ‭ econd, we would want to utilize a neural network such as a TNN that allows us to understand‬ S ‭the text contextually and by taking the relationships between all the words in the sentence with‬ ‭each other, before generating a response. The reason for a TNN would be the fact that it makes‬ ‭use of a self-attention mechanism (multiple, in fact) that does just that and allows us to best‬ ‭understand the text and accordingly, produce the possible response. Ultimately, its ability to‬ ‭engage in pragmatic analysis will be key.‬ ‭ oving beyond the architecture, the training dataset is an equally important part of this task. We‬ M ‭would want a real dataset that is representative of a diverse set of conversations with an‬ ‭emotional dimension to them to train our neural network. Ideally this training dataset should not‬ ‭come from previous exchanges between the chatbot and users, but rather text exchanges‬ ‭between humans, where human operators are showing the kind of empathy and sensitivity to‬ ‭emotions in tragic circumstances that we would like our chatbot to embody. Ultimately, our‬ ‭chatbot can only achieve this, if it is trained on previous examples of optimal behavior.‬ ‭6‬ ‭ dditionally, once the neural network has been trained on such a dataset, the company should‬ A ‭have employees to routinely audit the chatbot to make sure that it is producing optimal output.‬ ‭This could simply be done by asking the chatbot a diverse array of questions and assessing the‬ ‭resulting response. Based on these responses, the chatbot can be further trained to achieve the‬ ‭desired behavior.‬ ‭ verall, correctly engaging with humans in difficult situations is difficult for even other humans,‬ O ‭let alone for chatbots. However, with a combination of correct architectural choices, diverse, but‬ ‭targeted training, and continual human assessment, this is not an impossible task to accomplish‬ ‭for RAKT’s technical team.‬ ‭Practice Paper 2‬ ‭1.‬ ‭a.‬ ‭Define‬‭synthetic data‬‭. ‬ ‭b.‬ O ‭ utline the use of the‬‭self-attention mechanism‬‭in‬‭a transformer neural network.‬ ‭‬ ‭2.‬ ‭a.‬ A ‭ s a summer intern, you are helping RAKT’s technical team plan and prototype‬ ‭the machine learning network that will be used to train and deploy their new‬ ‭LLM-based chatbot.‬ ‭ xplain how you could employ the‬‭“critical path”‬‭algorithm‬‭to help them minimize‬ E ‭the latency in their machine learning network.‬ ‭b.‬ T ‭ he company has purchased a large quantity of‬‭graphical‬‭processing units GPUs‬ ‭they they will connect to form a cluster. They intend to use this cluster to train the‬ ‭LLM that will power their new chatbot.‬ ‭ utline two possible bottlenecks that could prevent full utilization of the cluster’s‬ O ‭capabilities.‬ ‭3.‬ L ‭ ong-term short term memory (LSTM)‬‭is a type of recurrent‬‭neural network that‬ ‭minimizes the‬‭vanishing gradient problem‬‭.‬ ‭ ith reference to key components in this type of neural network, explain how it is able to‬ W ‭do this. ‬ ‭4.‬ T ‭ ransformer neural networks (TNNs) are widely considered to be superior to recurrent‬ ‭neural networks RNNs for a variety of tasks related to Natural Language Processing‬ ‭(NLP).‬ ‭7‬ ‭With reference to specific technologies, to what extent is this true? ‬ ‭Practice Paper 2: Sample Answers‬ ‭1.‬ ‭a.‬ ‭Define‬‭synthetic data‬‭. ‬ ‭ ynthetic data is data that is produced through a simulation or algorithm and is meant to reflect‬ S ‭a real-life scenario. Such data is often used to train neural networks, particularly when it is either‬ ‭difficult or costly to obtain sufficient real data.‬ ‭b.‬ O ‭ utline the use of the‬‭self-attention mechanism‬‭in‬‭a transformer neural network.‬ ‭‬ ‭ he self-attention mechanism in transformer neural networks calculates attention weights that‬ T ‭capture the linguistic and contextual relationships between every word in a sentence or phrase.‬ ‭These weights are used to adjust the vector representation of each word, enhancing the overall‬ ‭understanding of the input text. This mechanism allows the model to consider the entire context‬ ‭when processing each word, leading to more accurate and meaningful representations and‬ ‭improved performance in NLP tasks.‬ ‭2.‬ ‭a.‬ A ‭ s a summer intern, you are helping RAKT’s technical team plan and prototype‬ ‭the machine learning network that will be used to train and deploy their new‬ ‭LLM-based chatbot.‬ ‭ xplain how you could employ the‬‭“critical path”‬‭algorithm‬‭to help them minimize‬ E ‭the latency in their machine learning network.‬ ‭ o minimize latency in the machine learning network for training and deploying RAKT’s‬ T ‭chatbot, first, you will need to break down the process into discrete tasks such as data‬ ‭preprocessing, model training, hyperparameter tuning, deployment setup, etc.‬ ‭ hen, determine the dependencies between these tasks, like model training depending‬ T ‭on data preprocessing, and deployment setup depending on model evaluation.‬ ‭ nce you’ve done this, you can apply the critical path algorithm to identify the “longest”‬ O ‭path from start to end, representing the minimum time needed for project completion.‬ ‭ ocus on optimizing tasks on this critical path by allocating more computational‬ F ‭resources or efficient algorithms, such as powerful GPUs for model training or efficient‬ ‭8‬ ‭ ata preprocessing techniques. Ensure tasks not on the critical path and without‬ d ‭dependencies are executed in parallel to avoid unnecessary delays.‬ ‭b.‬ T ‭ he company has purchased a large quantity of‬‭graphical processing units GPUs‬ ‭they they will connect to form a cluster. They intend to use this cluster to train the‬ ‭LLM that will power their new chatbot.‬ ‭ utline two possible bottlenecks that could prevent full utilization of the cluster’s‬ O ‭capabilities.‬ ‭ he first bottleneck would be the network that is used to connect these GPUs. In order to‬ T ‭collectively work together to complete a task, GPUs in a cluster need to be able to exchange‬ ‭information with each other and synchronize their operations over a network. Even if each GPU‬ ‭is individually very efficient, a slow network will inhibit their ability to communicate and therefore‬ ‭slow down the entire cluster’s progress in completing the given task, like training an LLM.‬ ‭ he second bottleneck would be the availability of cooling equipment. GPUs generate a lot of‬ T ‭heat when performing mathematical computations. GPUs generally come with a mechanism‬ ‭that throttles their performance and therefore reduces the speed at which they are able to‬ ‭produce computations if they get too hot. The reason behind this is to prevent the GPU from‬ ‭getting damaged. The only way to prevent this throttling and maximize the GPUs’, and therefore‬ ‭the overall GPU cluster’s performance is to use air conditioners or some other method to keep‬ ‭the cluster as cool as possible.‬ ‭3.‬ L ‭ ong-term short term memory (LSTM)‬‭is a type of recurrent‬‭neural network that‬ ‭minimizes the‬‭vanishing gradient problem‬‭.‬ ‭ ith reference to key components in this type of neural network, explain how it is able to‬ W ‭do this. ‬ I‭n LSTMs, the cell state is essential for calculating the hidden state, which is then used to‬ ‭generate the final output. This output is used to calculate the loss, which in turn is used to‬ ‭compute the gradients for the weights and biases of each layer at each time step. The input,‬ ‭forget, and output gates in LSTMs control the flow of information by selectively adding relevant‬ ‭new information and removing irrelevant or outdated information from the cell state. This‬ ‭targeted updating of the cell state ensures that it remains stable and less prone to rapid‬ ‭changes, which helps preserve the magnitude of the gradient during training, thereby mitigating‬ ‭the vanishing gradient problem.‬ ‭4.‬ T ‭ ransformer neural networks (TNNs) are widely considered to be superior to recurrent‬ ‭neural networks RNNs for a variety of tasks related to Natural Language Processing‬ ‭(NLP).‬ ‭With reference to specific technologies, to what extent is this true? ‬ ‭9‬ I‭ would say that TNNs are indeed superior to RNNs in accomplishing NLP-based tasks for a‬ ‭number of reasons.‬ ‭ irstly, TNNs have the ability to simultaneously and holistically analyze an entire sequence of‬ F ‭text, such as a sentence, rather than processing word-by-word as RNNs do. This is achieved‬ ‭through the self-attention mechanism, which allows TNNs to attend to all words in a sequence‬ ‭at once and determine the relevance of each word to every other word. This simultaneous‬ ‭analysis enables TNNs to capture complex dependencies and contextual relationships more‬ ‭effectively than RNNs, which are limited by their sequential nature.‬ ‭ econdly, TNNs inherently address the vanishing gradient problem, which is a significant issue‬ S ‭in RNNs. In RNNs, the gradient diminishes exponentially as it backpropagates through many‬ ‭time steps, making it difficult to learn long-term dependencies. TNNs, on the other hand, do not‬ ‭rely on sequential processing and instead use positional encodings to maintain the order of‬ ‭words. This design, combined with the self-attention mechanism, allows TNNs to propagate‬ ‭gradients more effectively, thus mitigating the vanishing gradient problem.‬ ‭ hirdly, TNNs can more effectively address long-term dependencies due to their self-attention‬ T ‭mechanism. In RNNs, long-term dependencies are challenging to capture because the‬ ‭information needs to pass through multiple steps, often leading to information loss. TNNs,‬ ‭however, can directly connect distant words in a sequence through self-attention, which‬ ‭evaluates the relevance of each word to all other words regardless of their distance in the‬ ‭sequence. This ability to capture long-range dependencies makes TNNs particularly powerful‬ ‭for tasks such as machine translation and text summarization.‬ ‭ astly, TNN operations can be more effectively distributed across multiple processing units,‬ L ‭allowing them to complete tasks more efficiently. The parallelizable nature of the self-attention‬ ‭mechanism means that TNNs can take advantage of modern hardware, such as GPUs and‬ ‭TPUs, to perform computations concurrently. This contrasts with the inherently sequential‬ ‭processing of RNNs, which limits their ability to leverage parallel computation fully.‬ ‭Consequently, TNNs can process large datasets and train much faster, which is crucial for‬ ‭scaling up NLP applications.‬ I‭n conclusion, TNNs are indeed superior to RNNs in accomplishing NLP-based tasks due to‬ ‭their holistic text analysis, inherent mitigation of the vanishing gradient problem, effective‬ ‭handling of long-term dependencies, and efficient parallel computation capabilities. These‬ ‭advantages have made TNNs the preferred choice for a wide range of NLP applications, from‬ ‭translation and summarization to question answering and sentiment analysis.‬ ‭10‬ ‭Vocabulary Bank‬ ‭ ackpropagation through time (BPTT)‬‭- gradient-based technique for training RNNs by‬ B ‭unfolding them in time and applying backpropagation to change all the parameters in the RNN‬ ‭ atch size‬‭- the number of training examples utilized in one forward/backward pass through the‬ B ‭network, before the loss and subsequently gradients are calculated‬ ‭ ag-of-words‬‭- A text representation method in NLP where a document is represented as a‬ B ‭vector of word frequencies, ignoring grammar and word order.‬ ‭Biases‬‭- Systematic errors in a dataset that can lead‬‭to unfair outcomes in a model‬ ‭‬ ‭ onfirmation‬ ‭- Bias where data is collected or interpreted to confirm pre-existing beliefs‬ C ‭‬ ‭Historical‬‭- Bias that reflects past prejudices or‬‭inequalities present in historical data‬ ‭‬ ‭Labeling‬‭- Bias introduced by subjective or inconsistent annotation of training data‬ ‭‬ ‭Linguistic‬‭- Bias due to the unequal representation‬‭or usage of language variations‬ ‭(dialects, register, etc.) in the dataset‬ ‭‬ ‭Sampling‬‭- Bias arising from non-representative samples‬‭that do not reflect the true‬ ‭diversity of the population‬ ‭ ‬ ‭Selection‬‭- Bias introduced when certain data points are preferentially chosen over‬ ‭others, affecting the model's fairness and accuracy‬ ‭Dataset‬‭- A collection of data used for training or‬‭evaluating machine learning models‬ ‭ eep learning‬‭- A subset of machine learning involving‬‭neural networks with many layers that‬ D ‭can learn representations of data‬ ‭ raphical processing unit (GPU)‬‭- A specialized hardware‬‭component designed to handle and‬ G ‭accelerate parallel processing tasks, particularly effective for rendering graphics and training‬ ‭deep learning models by performing simultaneous computations across multiple cores‬ ‭ yperparameter tuning‬‭- The process of optimizing‬‭the parameters that govern the training‬ H ‭process of machine learning models to improve performance‬ ‭ arge language model (LLM)‬‭- A type of AI model trained‬‭on vast amounts of text data to‬ L ‭understand and generate human-like text‬ ‭Latency‬‭- The delay between the input to a system and the corresponding output.‬ ‭11‬ ‭ earning rate‬‭- controls the size of the steps the model takes when updating its parameters‬ L ‭during training - if the learning rate is increased, weights and biases of the network are updated‬ ‭more significantly in each iteration‬ ‭Long short-term memory (LSTM)‬‭- A type of RNN designed to remember information for long‬ ‭periods and mitigate the vanishing gradient problem‬ ‭ ong-term dependency‬‭- refers to the challenge in sequence models, like Recurrent Neural‬ L ‭Networks (RNNs), of capturing and utilizing information from earlier in the input sequence to‬ ‭make accurate predictions at later time steps‬ ‭ oss function‬‭- A function that measures the difference between the predicted output and the‬ L ‭actual output, guiding model training.‬ ‭ emory cell state‬‭- In LSTM networks, the cell state carries long-term memory through the‬ M ‭network, allowing it to retain information across time steps‬ ‭ atural language processing‬‭- The field of AI focused on the interaction between computers‬ N ‭and human language‬ ‭‬ D ‭ iscourse integration‬‭- Understanding and maintaining coherence across multiple‬ ‭sentences or turns in conversation‬ ‭‬ ‭Lexical analysis‬‭- The process of examining the structure‬‭of words.‬ ‭‬ ‭Pragmatic analysis‬‭- Understanding language in context,‬‭including the intended‬ ‭meaning and implications‬ ‭‬ ‭Semantic analysis‬‭- The process of understanding the‬‭meaning of words and sentences‬ ‭‬ ‭Syntactical analysis (parsing)‬‭- Analyzing the grammatical structure of sentences‬ ‭ atural language understanding (NLU)‬‭- A modular set of systems that sequentially process‬ N ‭text input to better represent their meaning before they are input into a neural network such as a‬ ‭transformer NN or LSTM‬ ‭Pre-processing‬‭- The process of cleaning and preparing‬‭raw data for analysis or model training‬ ‭ ecurrent neural network (RNN)‬‭- A type of neural network designed to handle sequential data‬ R ‭by maintaining a hidden state that captures information from previous time steps‬ ‭ elf-attention mechanism‬‭- A technique in neural networks‬‭where each element of the input‬ S ‭sequence considers or focuses on every other element, determining their relevance or‬ ‭importance, which improves the model's ability to capture dependencies and relationships‬ ‭within the sequence‬ ‭Synthetic data‬‭- Data that is artificially generated‬‭rather than obtained by direct measurement.‬ ‭12‬ ‭ ensor processing unit (TPU)‬‭- A type of hardware accelerator specifically designed by Google‬ T ‭to speed up machine learning workloads‬ ‭ ransformer neural network (transformer NN)‬‭- A type of neural network architecture that relies‬ T ‭on self-attention mechanisms to process input data in parallel, rather than sequentially like‬ ‭RNNs‬ ‭ anishing gradient‬‭- A problem in training deep neural networks where gradients diminish‬ V ‭exponentially as they are backpropagated through the network, impeding learning‬ ‭ eights‬‭- The parameters in a neural network that are adjusted during training to minimize the‬ W ‭loss function.‬ ‭Study Guide‬ ‭-‬ ‭The Scenario‬ ‭-‬ ‭Insurance company (RAKT) uses a chatbot to handle customer queries‬ ‭-‬ ‭Customer feedback indicates poor chatbot performance‬ ‭-‬ ‭You are a student intern hired to recommend improvements to chatbot based on‬ ‭6 area of concern‬ ‭-‬ ‭Problems to Be Addresses‬ ‭1.‬ ‭Latency‬‭- The chatbot’s response time is slow and‬‭detracts from the customer‬ ‭experience.‬ ‭2.‬ ‭Linguistic nuances‬‭- The chatbot’s language model‬‭is struggling to respond‬ ‭appropriately to ambiguous statements.‬ ‭3.‬ ‭Architecture‬‭- The chatbot’s architecture is too simplistic‬‭and unable to handle‬ ‭complex language.‬ ‭4.‬ ‭Dataset‬‭- The chatbot’s training dataset is not diverse‬‭enough, leading to poor‬ ‭accuracy in understanding and responding to customer queries.‬ ‭5.‬ ‭Processing power‬‭- The system’s computational capability‬‭is a limiting factor.‬ ‭6.‬ ‭Ethical challenges‬‭- The chatbot does not always give‬‭appropriate advice and is‬ ‭prone to revealing personal information from its training dataset.‬ ‭Intro to Machine Learning and Neural Networks‬ ‭-‬ ‭What is machine learning?‬ ‭-‬ ‭Machine learning is when we combine data and algorithms to make predictions‬ ‭about future behavior.‬ ‭-‬ ‭We programmatically analyze past data to find patterns that are indicative of‬ ‭what will happen in the future.‬ ‭-‬ ‭Examples‬ ‭-‬ ‭Image Recognition (FaceID)‬ ‭13‬ -‭‬ ‭ peech/Audio Recognition (Siri/Shazam)‬ S ‭-‬ ‭Natural Language Processing (Google Translate)‬ ‭-‬ ‭Recommendation Systems (Netflix)‬ ‭-‬ ‭Pattern Detection/Classification (Fraud Detection/Customer‬ ‭Segmentation)‬ ‭-‬ ‭To do these things, we just need data, a computer, and a programming language.‬ ‭-‬ ‭The Machine Learning Process‬ ‭-‬ ‭Select a machine learning model.‬ ‭-‬ ‭Example‬‭: Linear Regression, Decision Tree, k-Nearest Neighbor‬ ‭algorithms, Neural Networks, etc.‬ ‭-‬ ‭Train the model with input data and the result of each input.‬ ‭-‬ ‭Example‬‭: Give the algorithm images of animals along with the type of‬ ‭animal, so the algorithm can associate images with specific‬ ‭characteristics with specific animals.‬ ‭-‬ ‭Put your input data into the model.‬ ‭-‬ ‭Example‬‭: Provide a set of images of animals‬ ‭-‬ ‭Receive the predicted output.‬ ‭-‬ ‭Depending on the model, use any incorrect predictions to improve the model.‬ ‭-‬ ‭Parts of a Neural Network‬ ‭-‬ ‭Input Layers‬‭- the input layer accepts data either for training purposes or to make‬ ‭a prediction‬ ‭-‬ ‭Hidden Layers (Memory)‬‭- the hidden layers are responsible for actually deciding‬ ‭what the output is for a given input; this is also where the “training” occurs‬ ‭-‬ ‭Output Layers‬‭- outputs the final prediction‬ ‭-‬ ‭Standard (Feedforward) Neural Network Training Process‬ ‭1.‬ ‭Feeding Data‬‭- The network starts by taking in data through the input layer.‬ ‭-‬ ‭In our example, each criteria for an individual student will be input into an‬ ‭individual neuron, corresponding to the given criteria.‬ ‭2.‬ ‭Making Predictions‬‭- Data flows from the input through any hidden layers to the‬ ‭output layer, where the network makes its initial prediction, such as whether or‬ ‭not the student will graduate.‬ ‭3.‬ ‭Calculating Errors‬‭- After making a prediction, the network checks it against the‬ ‭correct answer (known from the training data).‬ ‭-‬ ‭The difference between the prediction and the correct answer is‬ ‭calculated using a‬‭loss function‬‭.‬ ‭-‬ ‭The loss function measures how wrong the network's predictions are for a‬ ‭single epoch; the goal is to make this error as small as possible.‬ ‭4.‬ ‭Learning From Mistakes (Backpropagation)‬‭- Backpropagation is like the‬ ‭network reflecting on its errors and figuring out how to adjust its neurons'‬ ‭calculations to make better predictions next time.‬ ‭-‬ ‭It updates the settings (weights) inside the network that determine how‬ ‭much influence one neuron has on another‬ ‭14‬ ‭5.‬ R ‭ epeating the Process‬‭- This whole process—inputting data, making predictions,‬ ‭calculating errors, and adjusting using backpropagation—is repeated with many‬ ‭examples (images, in our case).‬ ‭-‬ ‭Each cycle through the data is called an "epoch," and with each epoch, the‬ ‭network gets better at its task.‬ ‭6.‬ ‭Evaluating Performance‬‭- After several epochs, the network's performance is‬ ‭evaluated to see if it’s improving and accurately recognizing cats in different‬ ‭images.‬ ‭-‬ ‭Backpropagation‬ ‭1.‬ ‭Input training data.‬ ‭2.‬ ‭For each set of inputs, calculate the loss.‬ ‭3.‬ ‭For each set of inputs calculate the gradient.‬ ‭4.‬ ‭Input the gradient into the gradient descent function to update the weights and‬ ‭biases in the neural network.‬ ‭-‬ ‭Hidden Layer‬ ‭-‬ ‭Weights‬‭-‬‭parameters that adjust the strength of input signals between neurons‬ ‭in different layers of a neural network. They are critical for learning, as they‬ ‭change during training to improve the network’s predictions.‬ ‭-‬ ‭Bias‬‭- parameter added to the weighted input that shifts the activation function,‬ ‭allowing the neural network to better fit complex patterns. It acts like an intercept‬ ‭in a linear equation, providing flexibility in the neuron’s output.‬ ‭-‬ ‭Activation Function‬‭- parameter added to the weighted input that shifts the‬ ‭activation function, allowing the neural network to better fit complex patterns. It‬ ‭acts like an intercept in a linear equation, providing flexibility in the neuron’s‬ ‭output.‬ ‭-‬ ‭Sigmoid‬‭: Outputs between 0 and 1, useful for probabilities.‬ ‭-‬ ‭ReLU:‬‭Passes only positive values, enhancing computational efficiency.‬ ‭-‬ ‭Gradient‬ ‭-‬ ‭We calculate the loss in order to understand how much to adjust the weights and‬ ‭biases in the network.‬ ‭-‬ ‭We must calculate the‬‭gradient‬‭based on the loss function for each weight and‬ ‭bias. (i.e. 15 gradients)‬ ‭-‬ ‭The gradient is a measure of how sensitive the amount of loss is to changes in‬ ‭weight and bias.‬ ‭-‬ ‭It involves calculating the derivative of the loss function.‬ ‭-‬ ‭Gradient Descent Function‬ ‭-‬ ‭Once we have calculated the gradient, we input this into a gradient descent‬ ‭function.‬ ‭-‬ ‭This is‬‭an algorithm that allows us to minimize the loss function‬‭, so that we can‬ ‭find the weights and biases for every connection that will give us the result we‬ ‭want.‬ ‭-‬ ‭The algorithm automatically updates these parameters in the NN.‬ ‭15‬ ‭-‬ ‭Gradients + Complications with more Layers‬ ‭-‬ ‭Gradient calculations become much more complicated when multiple layers are‬ ‭involved.‬ ‭-‬ ‭Because layers work together to produce output,‬‭gradients from a previous layer‬ ‭are taken into account when calculating the gradient of the next layer‬‭.‬ ‭-‬ ‭Mathematically, they are multiplied by each other, due to a mathematical principle‬ ‭called the chain rule, but the math is much more complicated.‬ ‭-‬ ‭Vanishing Gradient Problem‬ ‭-‬ ‭Happens when the gradients become very small‬ ‭-‬ ‭Make updates too small, stopping training‬ ‭-‬ ‭Causes‬ ‭-‬ ‭Use of sigmoid function‬‭(for activation) - taking derivative of sigmoid‬ ‭functions to calculate gradient can lead to very small gradients‬ ‭-‬ ‭Small initial weights‬‭- very small initial weights lead to proportionally‬ ‭smaller losses, which lead to smaller gradients‬ ‭-‬ ‭Many layers‬‭- Each new gradient is calculated by multiplying the gradient‬ ‭from the previous layer using a mathematical principle called the‬‭chain‬ ‭rule‬‭. This means that small gradients (think decimals) can lead to even‬ ‭smaller gradients as we pass through more and more layers.‬ ‭-‬ ‭Datasets‬ ‭-‬ ‭Training‬‭- Used to train a neural network to produce desired output‬ ‭-‬ ‭Contains input data with known output‬ ‭-‬ ‭Validation‬‭- Used for hypertuning - used to detect adjustments that will improve‬ ‭the performance of the NN‬ ‭-‬ ‭Testing‬‭- Used to evaluate the performance of the NN‬ ‭-‬ ‭Also contains known output‬ ‭-‬ ‭Testing data must not overlap with training dataset‬ ‭-‬ ‭Hyperparameters‬ ‭-‬ ‭Aspects of architecture of NN that can be changed to affect performance‬ ‭1.‬ ‭Number of layers‬‭- number of hidden layers in a neural network‬ ‭➔‬ ‭More layers can lead more precision‬ ‭➔‬ ‭More layers can lead to vanishing gradient problem‬ ‭➔‬ ‭More layers require more memory and processing power‬ ‭2.‬ ‭Learning rate‬‭- how dramatically weights are changed in response to calculated‬ ‭gradients‬ ‭➔‬ ‭A faster learning rate can lead to the NN learning to produce the correct‬ ‭response more quickly‬ ‭➔‬ ‭A faster learning rate can also lead the NN to stop learning too soon,‬ ‭ultimately giving a suboptimal solution‬ ‭-‬ ‭Hyperparameter Tuning‬ ‭-‬ ‭Selection of Hyperparameters‬ ‭-‬ ‭Involves choosing which aspects of the neural network to adjust.‬ ‭16‬ ‭-‬ ‭ xamples: learning rate, the number of layers, the number of neurons in‬ E ‭each layer, the type of activation functions, and the batch size.‬ ‭-‬ ‭Trial and Error Process -‬‭Involves experimenting with different combinations of‬ ‭hyperparameters. This can be a time-consuming process of trial and error, as the‬ ‭optimal settings usually depend heavily on the specific data and task.‬ -‭ ‬ ‭Goal of Tuning -‬‭Increased accuracy, efficiency, and generalization to new data.‬ ‭-‬ ‭Evaluation -‬‭Success measured using a validation set of data, or through‬ ‭cross-validation techniques, separate from the training and test datasets.‬ ‭Recurrent Neural Networks‬ ‭-‬ ‭Why RNNs?‬ ‭-‬ ‭Feed-forward neural networks cannot remember what it learns‬ ‭-‬ ‭Forgets information between iterations‬ ‭-‬ ‭Does not allow for output based on previous input and results‬ ‭-‬ ‭Such capabilities are crucial for text generation‬ ‭-‬ ‭RNNs: The Process‬ ‭-‬ ‭Designed to process sequences of data by maintaining a hidden state that‬ ‭captures information from previous time steps‬ ‭-‬ ‭time step‬‭- corresponds to the point at which the RNN reads one element (one‬ ‭word) of the input sequence (sentence) , updates its hidden state, and produces‬ ‭an output‬ ‭-‬ ‭The Process‬ ‭-‬ ‭Input Sequence -‬‭The RNN processes one element of‬‭the input sequence‬ ‭at a time.‬ ‭-‬ ‭Hidden State Update -‬‭At each time step, the RNN updates‬‭its hidden‬ ‭state based on the current input and the previous hidden state.‬ ‭-‬ ‭Output Generation -‬‭The updated hidden state is used‬‭to generate the‬ ‭output for the current time step.‬ ‭-‬ ‭Propagation Through Time -‬‭This process repeats for‬‭each element in the‬ ‭input sequence, allowing information to be passed through time.‬ ‭-‬ ‭RNNs: Use Cases‬ ‭-‬ ‭Autocomplete‬‭- can predict the next word in a sentence‬‭based on the context of‬ ‭the previous words‬ ‭-‬ ‭Machine translation‬‭- used in neural machine translation‬‭systems to convert text‬ ‭from one language to another by learning the sequence of words and their‬ ‭meanings in both languages‬ ‭-‬ ‭Chatbots‬‭- used in conversational agents to generate‬‭human-like responses‬ ‭based on the context of the conversation history‬ ‭-‬ ‭Hidden State‬‭- a vector that is updated after each‬‭time step using the input and the‬ ‭previous hidden state‬ ‭-‬ ‭Backpropagation Through Time (BPTT)‬ ‭17‬ 1‭.‬ F ‭ orward Pass‬‭-‬‭Process the sequence and store hidden states and outputs.‬ ‭2.‬ ‭Unroll the Network‬‭- Visualize the RNN as multiple‬‭layers across time steps.‬ ‭3.‬ ‭Compute the Loss:‬‭Calculate the loss at each time‬‭step and sum them up.‬ ‭-‬ ‭Loss‬‭- refers to the difference between expected and actual output for‬ ‭one time step‬ ‭-‬ ‭Loss function‬‭- used to calculate loss based on expected‬‭and actual‬ ‭output‬ ‭4.‬ ‭Backward Pass‬ ‭a)‬ ‭Compute gradients of the loss with respect to outputs, hidden states,‬ ‭weights, and biases, using gradient descent function‬ ‭b)‬ ‭Accumulate (sum) these gradients over the time steps.‬ ‭5.‬ ‭Update Weights:‬‭Adjust the weights and biases inputting‬‭the accumulated‬ ‭gradients into a gradient descent function.‬ ‭-‬ ‭ NNs vs. Standard Neural Network (Feedforward Neural Network)‬ R ‭-‬ ‭BPTT (Backpropagation Through Time) vs. Standard Backpropagation‬ ‭18‬ ‭-‬ ‭Vanishing Gradient Problem + RNNs‬ ‭-‬ ‭Gradient Calculation‬‭- Due to the chain rule mentioned‬‭earlier, we end up‬ ‭multiplying any gradient by the gradient at the same layer in the previous time‬ ‭step‬ ‭-‬ ‭Hidden State‬‭- Calculating the gradient involves taking‬‭the derivative of loss with‬ ‭respect to hidden state, which requires taking a derivative of the activation‬ ‭function‬ ‭-‬ ‭Time Steps vs. Layers‬‭- Fixed number of layers for‬‭gradient calculations in‬ ‭FFNNs,‬‭layers*time_steps‬‭in RNNs‬ ‭-‬ ‭RNNs: Pros and Cons‬ ‭-‬ ‭Pros‬ ‭-‬ ‭Sequence Handling‬‭- designed to handle sequential‬‭data‬ ‭-‬ ‭Memory -‬‭capability to retain information from previous‬‭inputs due to‬ ‭their internal state‬ ‭-‬ ‭Flexibility‬‭- can be applied to various types of sequential data, including‬ ‭text, audio, video, and time series data‬ ‭-‬ ‭Cons‬ ‭-‬ ‭Vanishing and Exploding Gradients‬ ‭-‬ ‭Training Time‬‭- can be computationally intensive and‬‭time-consuming,‬ ‭particularly for long sequences or large datasets, due to the sequential‬ ‭nature of the data processing‬ ‭-‬ ‭Difficulty to Capture Long-term Dependencies‬ ‭-‬ ‭Complexity in Parallelization‬‭- process data sequentially, which makes it‬ ‭challenging to parallelize the training process - leads to slower training‬ ‭time‬ ‭19‬ ‭Long short-term Memory (LSTMs)‬ ‭-‬ ‭Description‬ ‭-‬ ‭Type of RNN, addresses vanishing gradient problem‬ ‭-‬ ‭Contains typical input and output layer, but instead of hidden layer neurons found‬ ‭in RNNs and FFNNs, contains “‬‭LSTM layers‬‭”‬ ‭-‬ ‭Each LSTM layer is made of “‬‭LSTM cells‬‭”‬ ‭-‬ ‭Each cell contains a series of “‬‭gates‬‭”, which are‬‭really additional mathematical‬ ‭functions in the hidden layer neurons used to appropriately process information‬ ‭-‬ ‭LSTM Cells‬ ‭-‬ ‭Cell State -‬‭acts as the long-term memory of the LSTM‬‭cell. It carries relevant‬ ‭information throughout the sequence of data and is modified by gates to add or‬ ‭remove information‬ ‭-‬ ‭Input Gate‬‭- Decides which values are updated in the‬‭cell state‬ ‭-‬ ‭Forget Gate‬‭- Decides what information is discarded‬‭from the cell state‬ ‭-‬ ‭Output Gate‬‭- Decides what information from the cell‬‭state is used to generate‬ ‭the output‬ ‭-‬ ‭How do LSTM cells work?‬ ‭1.‬ ‭Inputs are hidden state from same layer in previous time step, cell state from‬ ‭same cell in previous time step, and input from previous layer (current hidden‬ ‭state multiplied by weights of connections with cells in the previous layers)‬ ‭2.‬ ‭Forget gate‬‭- Hidden state and input state are combined‬‭in a single vector that‬ ‭passes through the sigmoid function to produce a value between 0 and 1 - the‬ ‭result is multiplied by the previous cell state‬ ‭3.‬ ‭Input gate‬‭- The combined hidden state and input vector passes through both a‬ ‭tanh and sigmoid function and the results of both of these functions are‬ ‭multiplied. This product is added to the cell state.‬ ‭4.‬ ‭Output gate‬‭- The combined hidden state and input‬‭vector passes through a‬ ‭20‬ ‭ igmoid function and the result is multiplied by the result of the cell state passing‬ s ‭through a tanh function. This product is output to the hidden state.‬ ‭-‬ ‭LSTM + Cell State (Example)‬ ‭-‬ ‭Early in the Sentence‬ ‭‬ ‭The cell state might store information about the subject (e.g., "The cat").‬ ‭‬ ‭This helps maintain agreement between subject and verb later in the‬ ‭sentence.‬ ‭-‬ ‭Mid-Sentence‬ ‭‬ ‭The cell state might track the structure of the sentence (e.g., "The cat sat‬ ‭on").‬ ‭‬ ‭This helps predict the next word in the context of the ongoing phrase.‬ ‭-‬ ‭Later in the Sentence‬ ‭‬ ‭The cell state retains relevant details and context that have accumulated‬ ‭(e.g., "The cat sat on the").‬ ‭‬ ‭This helps in predicting that the next word could be "mat" given the‬ ‭context.‬ ‭-‬ ‭Vanishing Gradient Problem‬ ‭-‬ ‭Cell state is used to calculate the hidden state, which is used to generate the final‬ ‭output, which is used to calculate loss.‬ ‭-‬ ‭Loss is used to calculate the gradient for the weights and biases of each layer at‬ ‭each time step.‬ ‭-‬ ‭The gates (input, forget, and output gates) in LSTMs control the information flow‬ ‭by selectively adding relevant new information and removing irrelevant or‬ ‭outdated information from the cell state.‬ ‭-‬ ‭Because the cell state is updated in a targeted manner, it remains more stable‬ ‭and less prone to rapid changes‬ ‭-‬ ‭This helps preserve the magnitude of the gradient during training.‬ ‭Transformer Neural Networks (TNNs)‬ ‭-‬ ‭Generative Pre-trained Transformers (GPTs)‬ ‭-‬ ‭Original TNNs introduced in 2017 Google paper, “Attention is All You Need” for‬ ‭translating text‬ ‭-‬ ‭Emphasized use of self-attention mechanism, which led to better performance‬ ‭and parallelization (relative to RNNs and LSTMs)‬ ‭-‬ ‭GPT-1 had 117 million parameters, GPT-3 has 175 billion parameters‬ ‭-‬ ‭Transformer Neural Networks (TNNs): Key Aspects‬ ‭-‬ ‭Processes all words in a sentence simultaneously‬ ‭-‬ ‭positional encodings‬‭- mathematical values generated through the use of‬ ‭mathematical functions to indicate position of word in a sentence‬ ‭-‬ ‭self-attention mechanism‬‭- adds “weight” (a mathematical‬‭multiplier) to each‬ ‭word in a sentence based on importance‬ ‭21‬ ‭-‬ ‭ ulti-head attention‬‭- applies self-attention mechanism to different parts of‬ m ‭sentence simultaneously during processing, resulting in different perspectives on‬ ‭word relationships and interactions, which are later combined‬ ‭-‬ ‭Consists of multiple layers, including feed-forward networks with are the target of‬ ‭training‬ ‭-‬ ‭TNNs: Architecture‬ ‭-‬ ‭Convert a sentence from English to French (Occurs word–by-word)‬ ‭1.‬ ‭Input Embeddings‬‭- Each word in a sentence is turned‬‭into a vector that captures‬ ‭its meaning‬ ‭2.‬ ‭Positional Encoding‬‭- Incorporate information about‬‭position of each word into‬ ‭same vector‬ ‭3.‬ ‭Encoder Layers (~6 layers)‬ ‭- Each layer includes‬‭self-attention mechanism,‬ ‭feed-forward layer, and output normalization -‬‭focuses‬‭on accurately representing‬ ‭text‬ ‭4.‬ ‭Decoder Layers (~6 layers)‬‭- Each layer includes self-attention‬‭mechanism,‬ ‭feed-forward layer, and output normalization -‬‭only‬‭focuses on producing next‬ ‭token‬ ‭5.‬ ‭Output layer‬‭- generates next word in output sentence‬ ‭-‬ ‭Self-Attention Mechanism‬ ‭-‬ ‭Adds “weight” (a mathematical multiplier) to each word in a sentence based on‬ ‭importance‬ ‭-‬ ‭Allows NN to discern the importance of words relative to each other‬ ‭-‬ ‭Also provides insight into relationships between words‬ ‭-‬ ‭Allows different vectors for words to interact with each other‬ ‭-‬ ‭Obtaining Attention Weights‬ ‭-‬ ‭We calculate attention weights based on the relationship of each word‬ ‭with every other word (grammatical, contextual, etc.)‬ ‭-‬ ‭We use vectors representing different grammatical patterns and‬ ‭relationships, as well as the words themselves to make these‬ ‭calculations.‬ ‭-‬ ‭We modify each vector based on this attention weight, all of which are‬ ‭usually generated in a table.‬ ‭22‬ ‭-‬ ‭Attention Weights (Example)‬ ‭-‬ ‭Residual Connections‬ ‭-‬ ‭a shortcut path that skips one or more layers in the network and adds the input of‬ ‭the skipped layer directly to its output‬ ‭-‬ ‭used extensively within both encoder and decoder layers‬ ‭-‬ ‭provide a shortcut path for gradients that bypasses one or more layers‬ ‭-‬ ‭gradient can flow directly through these connections without being diminished by‬ ‭the transformations (activations, weight multiplications) in the intermediate‬ ‭layers‬ ‭-‬ ‭Choice of layers to skip is dictated by experimentation‬‭,‬‭sort of like hypertuning‬ ‭-‬ ‭TNN Advantages over RNNs‬ ‭-‬ ‭Parallelization‬‭:‬ ‭-‬ ‭Why True‬‭: RNNs process input data sequentially, where‬‭each step‬ ‭depends on the output of the previous step, making it impossible to‬ ‭parallelize effectively. In contrast, TNNs, particularly Transformer models,‬ ‭use self-attention mechanisms that allow each token to be processed‬ ‭independently of the others. This independence enables parallel‬ ‭23‬ ‭ rocessing of the entire sequence.‬ p ‭-‬ ‭Benefit‬‭: This significantly speeds up training and‬‭inference compared to‬ ‭RNNs.‬ ‭-‬ ‭Long-term Dependencies‬‭:‬ ‭-‬ ‭Why True‬‭: RNNs struggle with long-term dependencies‬‭due to the‬ ‭vanishing gradient problem, where gradients diminish as they are‬ ‭backpropagated through many layers. TNNs, with their self-attention‬ ‭mechanisms, can directly connect distant tokens in the sequence, making‬ ‭it easier to capture long-range dependencies.‬ ‭-‬ ‭Benefit‬‭: This improves the model's ability to learn‬‭relationships in data‬ ‭that span long distances.‬ ‭-‬ ‭Reduced Training Times‬‭:‬ ‭-‬ ‭Why True‬‭: Due to parallelization and efficient handling‬‭of dependencies,‬ ‭TNNs can process multiple tokens simultaneously, reducing the time‬ ‭needed for training. RNNs' sequential nature inherently limits their training‬ ‭speed.‬ ‭-‬ ‭Benefit‬‭: This efficiency is crucial for training large‬‭models on large‬ ‭datasets.‬ ‭-‬ ‭Scalability‬ ‭-‬ ‭Why True‬‭: TNNs can scale more effectively because‬‭their architecture‬ ‭allows for more straightforward parallelization and optimization.‬ ‭-‬ ‭The independent computation of attention scores across tokens‬ ‭and layers means that the workload can be distributed across‬ ‭multiple GPUs or TPUs‬ ‭-‬ ‭Benefit‬‭: This scalability enables TNNs to tackle large‬‭datasets and‬ ‭complex tasks more effectively than RNNs.‬ ‭Processing Power‬ ‭-‬ ‭What is processing power?‬ ‭-‬ ‭Computational Capacity‬‭- The ability of the hardware‬‭(CPU, GPU, TPU) to perform‬ ‭a large number of complex calculations quickly, measured in terms of FLOPS‬ ‭(floating-point operations per second)‬ ‭-‬ ‭Memory Resources‬‭- The availability of sufficient‬‭RAM and VRAM to handle large‬ ‭models and data efficiently, ensuring smooth processing and quick access to‬ ‭necessary information‬ ‭-‬ ‭Efficiency and Speed‬‭- The capability to manage high‬‭throughput and low latency,‬ ‭allowing for rapid data processing and real-time response generation while‬ ‭optimizing energy consumption‬ ‭-‬ ‭What are LLMs?‬ ‭-‬ ‭Massive Neural Network:‬‭An LLM is a neural network with billions of parameters‬ ‭designed to understand and generate human-like text from vast amounts of data.‬ ‭24‬ ‭-‬ ‭ atural Language Processing (NLP):‬‭LLMs are essential for NLP tasks like text‬ N ‭completion, translation, summarization, sentiment analysis, and answering‬ ‭questions by leveraging learned patterns and structures.‬ ‭-‬ ‭Contextual Understanding:‬‭These models generate contextually‬‭relevant‬ ‭responses, maintaining coherent conversations and producing human-like text‬ ‭based on prompts.‬ ‭-‬ ‭Neural Networks: Main Tasks‬ ‭-‬ ‭Preprocessing‬‭- Preparing raw data for training the‬‭LLM by cleaning,‬ ‭transforming, and organizing it into a suitable format.‬ ‭-‬ ‭Cleaning‬ ‭-‬ ‭Description:‬‭Removing noise and irrelevant information‬‭from the‬ ‭dataset.‬ ‭-‬ ‭Example:‬‭Eliminating missing values, correcting inconsistencies,‬ ‭and removing duplicate entries to ensure data quality.‬ ‭-‬ ‭Selection‬ ‭-‬ ‭Description:‬‭Choosing relevant data and features for analysis and‬ ‭model training.‬ ‭-‬ ‭Example:‬‭Filtering out unimportant features and selecting‬‭a‬ ‭subset of the data that is representative and relevant to the‬ ‭problem being solved.‬ ‭-‬ ‭Transformation‬ ‭-‬ ‭Description:‬‭Converting data into a suitable format‬‭for analysis‬ ‭and model training.‬ ‭-‬ ‭Example:‬‭Normalizing numerical values, encoding categorical‬ ‭variables, and applying feature engineering techniques to create‬ ‭new features.‬ ‭-‬ ‭Reduction of Data‬ ‭-‬ ‭Description:‬‭Decreasing the volume of data while retaining‬ ‭important information.‬ ‭-‬ ‭Example‬‭: Selecting a smaller subset of data samples‬‭to speed up‬ ‭processing and reduce computational costs‬ ‭-‬ ‭Training the Model‬‭- Teaching the LLM to understand‬‭and generate human-like‬ ‭text by optimizing its parameters on a large dataset‬ ‭-‬ ‭Deploying the Model‬‭- Making the trained LLM available‬‭for use in real-world‬ ‭applications.‬ ‭-‬ ‭Bag-of-Words Algorithms‬ ‭-‬ ‭Tokenization -‬‭text is split into individual words‬‭(tokens), often removing‬ ‭punctuation and common stop words like "and" and "the" to focus on meaningful‬ ‭words‬ ‭-‬ ‭Vocabulary Creation -‬‭A collection of all unique words‬‭in the corpus (text) is‬ ‭created, with each word assigned a unique index‬ ‭-‬ ‭Vectorization -‬‭Each document is represented as a‬‭vector of word counts, where‬ ‭25‬ t‭ he vector length equals the vocabulary size, and each element corresponds to‬ ‭the count of a specific word in the document‬ ‭-‬ ‭Advantages‬ ‭-‬ ‭Straightforward:‬‭The BoW algorithm is simple to understand‬‭and‬ ‭easy to implement. It involves basic operations such as‬ ‭tokenization and counting word occurrences.‬ ‭-‬ ‭Minimal Preprocessing:‬‭Requires minimal preprocessing‬‭of text‬ ‭data, making it accessible and quick to deploy in various‬ ‭applications.‬ ‭-‬ ‭No Need for Grammar Knowledge:‬‭Does not require knowledge‬‭of‬ ‭grammar or language structure, which simplifies its application‬ ‭across different languages.‬ ‭-‬ ‭Low Computational Complexity:‬‭Involves simple counting‬ ‭operations and vector representations, making it computationally‬ ‭efficient.‬ ‭-‬ ‭Handles Large Datasets:‬‭Efficiently handles large‬‭datasets due to‬ ‭its simplicity and use of sparse matrix representations.‬ ‭-‬ ‭Parallel Processing:‬‭Can easily be parallelized, with‬‭different parts‬ ‭of the text processed simultaneously to enhance speed.‬ ‭-‬ ‭Disadvantages‬ ‭-‬ ‭No Order Information:‬‭BoW ignores the order of words‬‭in the text,‬ ‭leading to a loss of syntactic and semantic information. For‬ ‭example, "dog bites man" and "man bites dog" would have the‬ ‭26‬ ‭ ame representation.‬ s ‭-‬ ‭No Contextual Understanding:‬‭The algorithm fails to‬‭capture the‬ ‭context in which words appear, which can be critical for‬ ‭understanding meaning in natural language‬ ‭-‬ ‭Resource Intensive:‬‭For large corpora, the vocabulary‬‭can become‬ ‭extremely large, leading to high-dimensional vectors. This can‬ ‭make the model computationally expensive and‬ ‭memory-intensive.‬ ‭-‬ ‭Sensitivity to Irrelevant Words:‬‭High-frequency words‬‭that are not‬ ‭meaningful (e.g., "the", "and") can dominate the vector‬ ‭representation unless explicitly removed.‬ ‭-‬ ‭Graphical Processing Units (GPUs)‬ ‭-‬ ‭Multiple Cores -‬‭GPUs have thousands of specialized‬‭cores for handling‬ ‭many tasks simultaneously, excelling in parallel processing compared to‬ ‭CPUs with fewer, more powerful cores for sequential tasks.‬ ‭-‬ ‭Fast Data Transfer -‬‭High memory bandwidth allows‬‭rapid data transfer‬ ‭between the GPU and its memory, essential for large datasets and‬ ‭complex computations in deep learning and simulations.‬ ‭-‬ ‭Large VRAM -‬‭GPUs feature large VRAM for storing and‬‭quickly accessing‬ ‭data, reducing latency and enhancing performance.‬ ‭-‬ ‭Programmability -‬‭Frameworks like NVIDIA's CUDA and‬‭OpenCL enable‬ ‭custom coding to leverage GPU parallel processing for various‬ ‭applications beyond graphics.‬ ‭-‬ ‭Tensor Processing Units (TPUs)‬ ‭-‬ ‭Custom-designed application-specific integrated circuits (ASICs)‬ ‭developed by Google specifically to accelerate machine learning‬ ‭workloads, particularly deep learning tasks‬ ‭-‬ ‭Each TPU unit has 8 cores‬ ‭-‬ ‭Each core has between 8 and 32GB of RAM associated with it‬ ‭-‬ ‭Optimized Architecture:‬‭TPUs have a unique architecture‬‭tailored to‬ ‭perform large matrix multiplications‬‭and other operations‬‭common in‬ ‭deep learning efficiently.‬ ‭-‬ ‭Parallelism:‬‭TPUs can handle massive amounts of parallel‬‭computations,‬ ‭which significantly speeds up the training and inference of large machine‬ ‭learning models.‬ ‭-‬ ‭High Bandwidth Memory (HBM)‬‭: TPUs use high-speed memory‬‭to store‬ ‭large amounts of data close to the processing units, reducing latency and‬ ‭increasing throughput.‬ ‭-‬ ‭Power Consumption:‬‭TPUs are designed to deliver high‬‭performance with‬ ‭lower energy consumption, making them more power-efficient for‬ ‭intensive machine learning workloads.‬ ‭-‬ ‭Thermal Design:‬‭Their specialized design often leads‬‭to better thermal‬ ‭27‬ ‭ fficiency, allowing them to perform heavy computations with less heat‬ e ‭generation.‬ ‭-‬ ‭Distributed Processing:‬‭TPUs are designed to work‬‭in large-scale clusters‬ ‭(“pods”), allowing for the distribution of training tasks across many TPUs.‬ ‭This scalability supports the training of extremely large models on‬ ‭massive datasets.‬ ‭-‬ ‭Usage‬ ‭-‬ ‭Large-Scale Model Training -‬‭TPUs are used to train‬‭very large‬ ‭neural networks, such as those in natural language processing‬ ‭(NLP) and computer vision, much faster than would be possible‬ ‭with GPUs or CPUs.‬ ‭-‬ ‭Real-Time Inference -‬‭TPUs provide low-latency inference‬‭for‬ ‭deployed machine learning models, making them suitable for‬ ‭applications that require real-time decision-making, such as‬ ‭autonomous driving and live video analysis.‬ ‭-‬ ‭Research and Development -‬‭Researchers use TPUs to‬ ‭experiment with new model architectures and training techniques,‬ ‭taking advantage of their high computational power to iterate‬ ‭quickly.‬ ‭-‬ ‭ PUs vs. TPUs‬ G ‭-‬ ‭Clustering + LLMs‬ ‭-‬ ‭Advantages‬ ‭-‬ ‭Increased Computational Power -‬‭Clustering multiple‬‭GPUs or‬ ‭TPUs provides substantial computational power, enabling the‬ ‭training of very large language models that would be infeasible on‬ ‭28‬ ‭ single unit‬ a ‭-‬ ‭Scalability -‬‭Clusters can be scaled up or down based‬‭on‬ ‭workload requirements, allowing for flexible resource‬ ‭management and efficient handling of varying demands‬ ‭-‬ ‭Reduced Training Time -‬‭Distributing the training‬‭process across‬ ‭multiple units significantly reduces the time required to train large‬ ‭models by parallelizing computations‬ ‭-‬ ‭High Throughput -‬‭Clusters can handle large volumes‬‭of data‬ ‭simultaneously, improving throughput for both training and‬ ‭inference tasks‬ ‭-‬ ‭Fault Tolerance -‬‭Clusters can provide redundancy,‬‭where the‬ ‭failure of a single unit does not halt the entire training process,‬ ‭thus improving reliability and uptime‬ ‭-‬ ‭Disadvantages‬ ‭-‬ ‭Complexity in Setup and Management -‬‭Setting up and‬‭managing‬ ‭a cluster of GPUs or TPUs involves significant complexity,‬ ‭including configuring networking, synchronization, and software‬ ‭environments.‬ ‭-‬ ‭High Cost -‬‭Clustering multiple high-end GPUs or TPUs‬‭can be‬ ‭very expensive, both in terms of initial hardware investment and‬ ‭ongoing operational costs, such as power and cooling.‬ ‭-‬ ‭Communication Overhead -‬‭Distributing tasks across‬‭multiple‬ ‭units introduces communication overhead, which can limit the‬ ‭efficiency gains from parallel processing, especially if the network‬ ‭bandwidth is insufficient.‬ ‭-‬ ‭Software and Framework Compatibility -‬‭Ensuring compatibility‬ ‭and optimizing performance across all units in the cluster can be‬ ‭challenging, requiring specialized knowledge and effort to tune‬ ‭software and frameworks.‬ ‭-‬ ‭Energy Consumption -‬‭Running large clusters consumes‬‭a‬ ‭significant amount of power, contributing to higher operational‬ ‭costs and potential environmental impact.‬ ‭-‬ ‭Training + Processing Power‬ ‭-‬ ‭Model Complexity and Size‬ ‭-‬ ‭Number of Parameters:‬‭Larger models with more parameters‬ ‭(e.g., billions in LLMs) require significantly more computational‬ ‭resources.‬ ‭-‬ ‭Architectural Complexity:‬‭Advanced architectures with‬‭more‬ ‭layers and sophisticated components, such as transformers,‬ ‭increase processing power requirements.‬ ‭-‬ ‭Dataset Characteristics:‬ ‭-‬ ‭Size:‬‭Larger datasets necessitate more processing power to‬ ‭29‬ ‭ andle and process the increased volume of data.‬ h ‭-‬ ‭Quality:‬‭High-quality datasets that need extensive‬‭preprocessing,‬ ‭cleaning, and augmentation can add to computational demands.‬ ‭-‬ ‭Hardware Utilization‬ ‭-‬ ‭GPU/TPU Availability:‬‭The number and quality of GPUs/TPUs‬ ‭available directly affect the processing power and training speed.‬ ‭-‬ ‭Efficiency:‬‭Utilizing hardware-specific optimizations‬‭and‬ ‭accelerators can significantly reduce processing power‬ ‭requirements.‬ ‭-‬ ‭Model Architecture‬ ‭-‬ ‭Transformer Variants‬‭: Different architectures (e.g.,‬‭BERT, GPT, T5)‬ ‭have varying computational requirements. The design of attention‬ ‭mechanisms, feedforward layers, and other components impacts‬ ‭processing power.‬ ‭-‬ ‭Custom Layers and Operations‬‭: Inclusion of specialized‬‭layers or‬ ‭operations can add to the computational burden.‬ ‭-‬ ‭Deployment + Processing Power‬ ‭-‬ ‭Inference Latency and Throughput‬ ‭-‬ ‭Latency:‬‭The‬‭time required‬‭to produce a result after‬ ‭receiving an input. Low-latency requirements demand‬ ‭more processing power for real-time responses.‬ ‭-‬ ‭Throughput:‬‭The‬‭number of inferences‬‭the model can‬ ‭handle per second. High-throughput applications require‬ ‭significant computational resources to maintain‬ ‭performance.‬ ‭-‬ ‭Model Size and Complexity:‬ ‭-‬ ‭Parameters:‬‭Larger models with more parameters require‬ ‭more processing power and memory for inference.‬ ‭-‬ ‭Architecture:‬‭More complex architectures may involve‬ ‭additional computations, increasing the processing power‬ ‭needed.‬ ‭-‬ ‭Hardware Utilization‬ ‭-‬ ‭GPUs/TPUs:‬‭Effective use of specialized hardware can‬ ‭significantly reduce inference time and processing power‬ ‭needs.‬ ‭-‬ ‭Accelerators:‬‭Utilizing hardware accelerators designed‬‭for‬ ‭specific tasks can improve efficiency and performance.‬ ‭-‬ ‭Batch Size -‬‭The number of inputs processed simultaneously‬ ‭affects computational load‬ ‭-‬ ‭Larger batch sizes can improve throughput but also‬ ‭increase the memory and processing power needed‬ ‭30‬ ‭Datasets‬ ‭-‬ ‭Real Data vs. Synthetic Data‬ ‭-‬ ‭Real Data‬ ‭-‬ ‭Description:‬‭Collected from real-world events, transactions,‬‭or‬ ‭observations.‬ ‭-‬ ‭Example:‬‭Customer transaction records, sensor readings,‬‭user‬ ‭interactions on a website.‬ ‭-‬ ‭Synthetic Data‬ ‭-‬ ‭Description:‬‭Generated artificially using algorithms‬‭or simulations,‬ ‭designed to mimic the statistical properties of real data.‬ ‭-‬ ‭Example:‬‭Simulated user behavior in a website, generated‬‭medical‬ ‭records for training purposes.‬ ‭-‬ ‭Real Data: Advantages and Disadvantages‬ ‭-‬ ‭Advantages‬ ‭-‬ ‭Authenticity and Relevance -‬‭Real data accurately‬‭reflects real-world‬ ‭scenarios, providing genuine insights for analysis and model training.‬ ‭-‬ ‭Diverse and Complex -‬‭Captures natural variability‬‭and complexity,‬ ‭including rare events and edge cases, which are crucial for robust model‬ ‭performance.‬ ‭-‬ ‭Credibility and Trust -‬‭Higher confidence in results‬‭and insights derived‬ ‭from real data, as it is based on actual observations and experiences.‬ ‭-‬ ‭Disadvantages‬ ‭-‬ ‭Collection Challenges -‬‭Gathering real data can be‬‭expensive and‬ ‭time-consuming, requiring significant resources for data collection,‬ ‭storage, and management.‬ ‭-‬ ‭Quality Issues -‬‭Real data can contain inaccuracies,‬‭inconsistencies, and‬ ‭noise, requiring extensive cleaning and preprocessing to ensure quality.‬ ‭-‬ ‭Privacy and Legal Concerns -‬‭Access to real data may‬‭be restricted due‬ ‭to privacy concerns, legal regulations, or proprietary restrictions, limiting‬ ‭its availability.‬ ‭-‬ ‭Synthetic Data: Advantages and Disadvantages‬ ‭-‬ ‭Advantages‬ ‭-‬ ‭Cost-Effective -‬‭Generating synthetic data is often‬‭less expensive than‬ ‭collecting and labeling real-world data, allowing for budget-friendly‬ ‭scalability and rapid data production.‬ ‭-‬ ‭Privacy-Safe -‬‭Synthetic data does not represent real‬‭individuals,‬ ‭eliminating privacy concerns and enabling easier data sharing and‬ ‭compliance with data protection regulations.‬ ‭-‬ ‭Customizable and Balanced -‬‭It can be tailored to‬‭specific needs,‬ ‭31‬ ‭ nsuring balanced datasets and inclusion of rare or extreme cases, which‬ e ‭helps in building more robust machine learning models.‬ ‭-‬ ‭Disadvantages‬ ‭-‬ ‭Lack of Realism -‬‭Synthetic data may not fully capture‬‭the complexity and‬ ‭nuance of real-world scenarios, leading to models that might not‬ ‭generalize well to real-world applications.‬ ‭-‬ ‭Complex Generation Process -‬‭Creating high-quality‬‭synthetic data‬ ‭requires sophisticated algorithms and domain expertise, which can be‬ ‭technically challenging and resource-intensive.‬ ‭-‬ ‭Skepticism and Regulatory Hurdles -‬‭Stakeholders may‬‭be skeptical of‬ ‭models trained on synthetic data, and regulatory bodies may not accept‬ ‭synthetic data for compliance purposes in certain industries like‬ ‭healthcare and finance.‬ ‭-‬ ‭Biases‬ ‭-‬ ‭Confirmation Bias‬ ‭-‬ ‭Description:‬‭Confirmation bias occurs when the dataset‬‭favors a‬ ‭particular viewpoint or hypothesis, leading to skewed model predictions.‬ ‭-‬ ‭Example:‬‭A customer service chatbot is trained only‬‭on queries related to‬ ‭a specific type of insurance policy, leading it to poorly handle queries‬ ‭about other policies.‬ ‭-‬ ‭Solution:‬‭Ensure the training data is diverse and representative of all‬ ‭possible viewpoints or scenarios. Incorporate data augmentation‬ ‭techniques and perform regular audits to identify and mitigate biases.‬ ‭-‬ ‭Historical Bias‬ ‭-‬ ‭Description:‬‭Historical bias arises when the training‬‭data reflect outdated‬ ‭information, failing to account for recent changes or trends.‬ ‭-‬ ‭Example:‬‭An NLP model trained on customer service‬‭queries from five‬ ‭years ago may not understand or accurately respond to current slang or‬ ‭new types of customer inquiries.‬ ‭-‬ ‭Solution:‬‭Regularly update the training data to include‬‭recent information‬ ‭and trends. Use techniques such as transfer learning to adapt models to‬ ‭new data efficiently.‬ ‭-‬ ‭Labeling Bias‬ ‭-‬ ‭Description:‬‭Labeling bias occurs when the labels‬‭applied to data are‬ ‭subjective, inaccurate, or incomplete, affecting the model's performance.‬ ‭-‬ ‭Example:‬‭Customer queries labeled too generically‬‭(e.g., "general inquiry")‬ ‭prevent the model from learning specific intents, leading to poor‬ ‭prediction accuracy.‬ ‭-‬ ‭Solution:‬‭Implement a detailed and consistent labeling‬‭process, involving‬ ‭multiple annotators to cross-validate labels. Use tools to detect and‬ ‭correct labeling inconsistencies.‬ ‭-‬ ‭Linguistic Bias‬ ‭32‬ ‭-‬ ‭ escription:‬‭Linguistic bias happens when the dataset is biased toward‬ D ‭specific linguistic features, such as formal language, neglecting variations‬ ‭in dialects or vocabulary.‬ ‭-‬ ‭Example:‬‭A dataset composed mainly of formal written‬‭language may‬ ‭cause a model to struggle with interpreting informal speech or regional‬ ‭dialects.‬ ‭-‬ ‭Solution:‬‭Include diverse linguistic styles and dialects‬‭in the training data.‬ ‭Utilize techniques like data augmentation to simulate informal language‬ ‭and dialects.‬ ‭-‬ ‭Sampling Bias‬ ‭-‬ ‭Description:‬‭Sampling bias occurs when the training‬‭dataset is not‬ ‭representative of the entire population, leading to biased model‬ ‭outcomes.‬ ‭-‬ ‭Example:‬‭Training data that only include queries from‬‭young adults may‬ ‭cause a model to perform poorly with queries from older adults.‬ ‭-‬ ‭Solution:‬‭Ensure the training dataset is representative‬‭of the entire target‬ ‭population. Use stratified sampling to maintain diversity across various‬

Use Quizgecko on...
Browser
Browser