the-fast-path-nvidia.txt

Welcome to the Fast Path to Developing with Large Language Models. I'm David Toppenheim, a Senior Solutions Engineer at NVIDIA Developer Programs. Let's begin. Here's today's agenda. We're going to start off by looking at large language models in a wider context to give you some familiarity if you're not already. We'll also do a demonstration of an app that's taking advantage of the capabilities of large language models to help with the problem we all have. We'll then look at using large language model APIs, an example. We'll then move on to prompt engineering and show you the considerations of building your prompts for your application. We'll move ahead to using large language model workflow frameworks and do a bit of an analysis of a few different ones for you. Then finally, we'll talk about how we can use one of those frameworks to combine large language models with your data. Therefore, your data is being used as part of the prompt. Historically, language models have been trained for specific tasks. Things like text classification or entity extraction, where you're trying to find the names of peoples, places, or things, question answering. But then in 2017, the large language model revolution began in earnest, and this was powered by transformer models. That's a particular type of deep learning architecture that specializes in processing sequence of data points or tokens, where tokens are numbers represented in words or parts of words. These transformer architectures use self-attention to figure out which parts of the sequence can help interpret the other parts of the sequence. This came about in a paper written by Google and University of Toronto researchers called Attention is All You Need. It's a famous paper in large language model evolution. Now, we have these much larger models trained on extraordinary quantities of data, billions and trillions of tokens. Looking over at the tree on the right, we can see that there's been an explosion of models. We just want to point out that some of the ones that you might have heard of like GPT-4 or LLAMA aren't the only ones there. There might be some that are specifically good for what you want. We'll talk about how to pick some of those in a little while. I also wanted to point out the branches in this evolution tree. Lots of the attention has gone to the GPT decoder only branch. We're going to be showing uses of the encoder only branch as well. There are even cases of the encoder and decoder branch being useful like human language translation. But the complexity of the models isn't really reflected in the complexity of the APIs, which have begun to converge substantially. Don't worry, we're going to show you the audience how to access this field. Transformer models are built with unsupervised learning, and they've proven to be very effective at next token prediction. Sure enough, looking over the left, you can see what I mean. The sky is the phrase that's going into the large language model, with log probabilities of blue clear, usually the and less than coming out. Of course, blue has the most likely probability being the least negative and highest number, the most positive number. That is the most likely next word, and this goes on and on, and that's how we can observe these applications like chat GPT that are predicting on the next word, but doing so with such correctness and fluidity, that it sounds like a knowledgeable person is telling you the answer that you queried about. These are called foundation models when we've pre-trained them like this on unlabeled datasets, and they can be tuned later to a bunch of specialized applications. Some of those might be what we talked about earlier, the traditional natural language processing tasks, or it can learn general or other domains specific knowledge, or it can perform new tasks with few or no examples. In that case, the large language model is a scale-up architecture that can perform a lot of various large language tasks like summarizing, translating, even composing new content. That's what gives it its generative name. Let's hop into an example where we're going to use a large language model to help us triage our company e-mail. Imagine the case of the fictitious company Melodious, who manufactures musical instruments and audio equipment. Their issue is that they have hundreds of e-mails that come in every day from customers with various needs, from urgent needs to non-injured needs, from repairs to compliments. This is a lot for somebody to go through, and somebody would have to also assign which of these customer service representatives that e-mail would need to be handled by. But what if we can redo this and think about it in a more modern way that takes advantage of large language models abilities? Now, our inbox looks quite a bit different. We have a description of the problem rather than just the subject of the e-mail, and when we click into one of those e-mails, we see that a few characteristics have been sussed out of the e-mail by the generative AI, by the large language model. What product the e-mail is about, the representative who should handle it, the tone, a summary of the issue, and then an assignment of priority. In this case, it's the most urgent response. We see that there are several e-mails here, all that are urgent, not urgent, or in some cases, some e-mails don't need a response right now, so let's not take time working on them if we're very busy. We can also look at which customer service support representatives have which e-mails. Chris looks like he's going to be plenty busy. We can also choose products, so we can go down just the same way that we did with the customer service representatives through the list and sort by those particular instruments. Now, let's say that there's an issue that needs to be further researched by the customer service representative. They can click on research issue and what pops up is a summary of the issue and then a summary of the search results into our assets at Melodious. We'll talk about how this works a little bit later, but what you saw here happen was that the summary was created on the fly based on the customer's e-mail, and then the sources that were found that are very similar in semantic value to the question that came in from the customer. These resources are intended to help the customer service response address the issue the customer is bringing. How did that demonstration work? How were we able to use a large language model to parse through those e-mails and triage them for us? Well, what we had going in is a semi-structured text input just containing an e-mail body. That e-mail body is then added to a prompt that gives the large language model a task to do with the e-mail body. We call that large language model through an API. Then finally, the large language model's output was requested to be in the JSON format, and sure enough, that's our output. Next, let's consider how the large language model was functioning and think step-by-step through what we need to do this, starting with calling the API. Let's go through an example with OpenAI's chat GPT. The first thing that we do is to import packages and looking over toward the right, we see that yes, we did import the OpenAI package. We're making use of load.env to help us find and then utilize the information inside our.env file that contains our OpenAI API key. Let's choose GPT 3.5 Turbo with a temperature of 0.9 from 0-1 for temperature, 1 being more random and more creative and maybe beyond what you need and 0 being less so. Some other key parameters too might be top k and p or repetition penalty. Those depend on which model you're using. Next, we write the prompt. The prompt is write a haiku about large language models. We use that prompt as part of a greater message that goes to the large language model in our framework. Finally, we call the API with the create method. The response is in the response variable. If we print out the contents of the response variable, we see endless worlds unfold, giant minds, vast text arrays, wisdom from the void. It gives me chills. What are some of the factors that you need to consider when you're selecting a large language model? One of the important ones is to take a look at benchmark scores on a relevant benchmark. On the left, you'll see a table with a different task types like reasoning or reading comprehension slash question answering, math, coding, and so on. If your application is using a large language model for reasoning, you might, for example, use Hellaswag. Or for coding, you might use Uminival and MBPP. You also will need a lot of data, especially if we're doing our pre-training. Later, we can fine-tune with far less. We also need to think about the kinds of evaluation or validation tests that we'll run on the model when it becomes part of our system. Latency is very important too. For instance, latency is the amount of time it takes for us to start getting our answer back out of the language models. Of course, for something conversational, we might need something that's shorter than a third of a second. The cost of deployment or the use or the price per tokens is going to eventually end up costing you a certain amount. So you want to be able to think through costs, typical responses, or typical prompts that go into the language model. How much context can the language model remember at a time? That's called the context size. Then licensing terms, which normally is something that you might gloss over, but I recommend not doing that here. Because in some cases, these models are not for commercial use, but are for research use. There are models available for commercial use or you can train your own. We also have to think about the domain specificity. When we start to train models that are very domain-specific, we find that we can get as good a performance from a smaller model that's trained on that narrower field than with a much larger model that's trained more generally, and of course, going to cost more, not only in terms of the hardware resources to instantiate, but also in the use of energy and the cost of deployment. Looking at the Hella SWAG benchmark column, we can see how the Falcon 180B model will perform compared to the Falcon 40B model. They're both highlighted in yellow. Then in the Hella SWAG benchmark highlighted in green, we see that for the 180B model, we have a performance of 89 points, and for the 40B, 40 billion parameter model, we have 85 points. For four and a half times more size, you gain three points on the performance. That's something to consider. Are those three points worth nearly a four or five times size model, which will impact the cost. That's an important consideration for you when you're designing a system. Let's shift gears a little bit and talk about prompted engineering. Having a good prompt is very important if you expect good results from a large language model. Let's talk about a few of the different methodologies. One of them is called zero-shot. In this case, we're asking the foundation model to perform a task with no impromptu example. What is the capital of France? That question goes into the large language model, and if all goes well, A, Paris, pops out, tries to follow the format. We have Q and then we have A. What's nice about zero-shot prompting is that it gives you a lower token count. You remember we were talking about potential token costs or token memory in the context memory. We have a lower token count, we want to be efficient, giving us more space for the context. But in some cases, a zero-shot prompt isn't enough for a model to give you what you want. With a few-shot prompt, on the other hand, we provide examples as some context for the foundation model that's relevant to the task. In this case, we would give it some examples of capitals and answers. What is the capital of Spain? Answer Madrid, Italy, answer Rome. We notice also that we are asking for the answer in a particular format that looks like a Python dictionary. So when it comes time for us to ask what the capital of France is, we get a response back in the correct format. The answer is Paris and in the dictionary format. So the responses are then better aligned, especially in terms of formatting. In general, they have higher accuracy. These few-shot prompts give higher accuracy on complex questions. On some models, including large models, the newer ones as well, few-shot prompting may not be necessary. Zero-shot prompting may get you the result that you need. So I would suggest trying both and figuring out which one is giving you the answer more reliably that you need. We can also use a prompt to generate synthetic test data. Do you remember in the demo that you just saw, we had about 100 e-mails that came in that were complaining or complimentary about a whole range of products and over a whole bunch of names. Here's the prompt that was used to generate that. We see that contains variables, customer, product, and feedback, along with some instructions. For example, when you write e-mails, you get right to the point and avoid pleasantries, like I hope this e-mail finds you well, or I hope you are having a great day. Start with a subject line and such. So we want it to be concise. If we take a look at what's produced, I'm just cutting the body here. We have a pretty good e-mail that's catching the things that we asked for. The product is the CG Series Grand Piano, the customer, that's Zhiyong at the very bottom, and then feedback. Feedback, exceptional quality of sound, exceeded my expectations, and then a thank you. So it's important when you use a large language model for synthetic test data for SDG to check the model's license. In some cases at a company, you may not be permitted to use the model, and that includes to use it to create synthetic data. There also might be prohibitions on using the output of one language model to train another language model. Taking a look at the prompt on the right, you'll notice that it is asking the large language model to think logically step-by-step to help the customer service representative. Certainly, there are five steps there, including specific output to format the answer in, and at the very bottom in blue, we see that the body of the e-mail is going to be appended to these instructions. So the instructions plus the body of the e-mail comprise what we call our triage prompt. This is called a chain of thought prompt in the sense that we're asking a large language model to reason through a process step-by-step. Certainly, adding something like let's think step-by-step, or let's think about this logically into your prompt has been shown to improve the result from some models of large language model, definitely worth trying. You can supply those specific steps if there's a consistent process that you need to have your large language model run through. In our case, indeed, it was sorting through a big pile of e-mail. The prompt on the left along with the e-mail body that is appended to it produces the result on the right, and we can see that the large language model is reasoning through each of the steps. Step 1 determines that it's about a specific product, that piano. Then what the issue is, in this case, it's praise for the quality. The tone of the e-mail is positive. Then we don't need an urgent response anyway because the customer isn't expressing a problem or something we need to correct quick. Then finally, step 5 is outputting the customer name, the product, the category, the summary, the tone, and then the response urgency like we asked for. When it's your turn to design a prompt, you have to go through the process of deciding if it's going to be a zero-shot, a few-shot, a chain-of-thought prompt, or in something like this, which is a long zero-shot prompt that's shown on the right. You'll notice that the more aligned or sophisticated a model is, and one of those like the GPT 3.5 or 4, then the fewer explicit cues it will typically need. In this case, we can see that even though it's a fairly lengthy prompt, it's still a zero-shot prompt because we're giving no examples although we're giving a lot of instructions. Now, some of those prompt elements that you should consider, going in your prompt would contain role, so dictate a job along with the descriptive adjective or two, and we can see that's an efficient administrative assistant. Instructions, that's step-by-step what you want done. You can use action verbs to make this better, determine, classify, write a one-sentence summary, organize your answers. The context is the relevant background info into the prompt. For example, in this case, we're at a musical instrument company and we receive e-mail, for example. Then the output format can be almost anything you can dream up, including something custom. In this case, we're asking for JSON object with the following keys, name, product, category, summary, tone, and urgency. Our prompt also needs to be exacting in what we want. We don't have to worry about being too brief. The more exacting we are, the more exacting the result will be. Of course, you can imagine the kinds of things you'll need to avoid, vagueness, unfounded assumptions, or topics that are just too broad. The output format of the large language model can be almost anything that you can imagine. Your prompt can specify the output to JSON, CSV, HTML, markdown, and even code. That list is always growing. If you take a look over on the right, this was one of the instructions from the previous example to organize our answers into a JSON object. We see that the JSON was formulated correctly, but indeed it is just text. It's not an actual object that's output by the large language model as text. We perform a simple conversion step for a structured format like that. Now, even these high-end LLMs can sometimes result in an imperfect format. You can try tuning, but I also strongly urge that you add some error checking in your code to make sure the output is in the format that you expect, and if it's not, that it's somehow corrected or perhaps re-requested in a slightly different way. Tools are available to help keep our LLM-based application in its lane. Something to add boundaries to ensure that the large language model is not performing any undesirable behavior. These are called guardrails and toxicity checks. If we have an enterprise application like the one that we were just working on for the Melodious music store, we might have a user who asks something in an e-mail, and if we take a look at the green path through the checkboxes in the diagram, that user's request makes it all the way through the guardrails of NEMO. This is NVIDIA's LLM. Then the app toolkits like LangChain is what we used for some of the examples we're going to be showing you in a bit. We'll make the call to the LLM. Possibly we have to access a third-party app to get a result back into LangChain. Then finally, the response that comes back, maybe all terrible in red, and can't actually go back to the user like it is, or is mostly right and needs just some modifications in order to make it through, and that's shown in blue. NEMO guardrails use a co-link pattern established in a configuration file, but other systems would configure their boundaries and behaviors a little bit differently. The upshot is that when it comes to topical safety, we want these guardrails to focus interactions with a specific domain like something in a music store, not a grocery store, not a political opinion, not the weather. Safety to prevent hallucinations. If the LLM is producing undesirable results or something toxic, we would need to also perform a toxicity check on the input and output. Then finally, security. We don't want necessarily our user to be able to access everything about a third-party app simply through an e-mail which goes into a prompt. You might be wondering how you can evaluate how your large language model is doing in its application. I'll say that the evaluation type that you're going to perform depends on the data. For example, if we have structured data, we could use a large language model to help bolster our currently existing structured data by generating more structured data. We could generate a data test set including inputs and known outputs. We run the large language model on those inputs. Then we compare the large language model outputs to our test outputs and perform scoring. It's a little bit different with unstructured data generation though. That might be text like those e-mails that we had. We synthetically generated the e-mails in the first demo from all of the customers. Text generation, auto-completion, summarization. These are things that have many possible good answers. There's no one perfect exact answer. There are many good answers. In this case, you would simply take those unstructured inputs, run those through your large language model, and then either a human or an AI will be applying a rubric to determine how well the evaluation went. Similarly, one thing that we do internally on my team is to supply users with A-B testing so that two different model candidates or two different output candidates make it to the user's eyes, and then they decide which one is better than vote on it. That kind of A-B testing has allowed us to implement internal systems that perform really well on unstructured data. Next, with API calls, prompting, guardrails, and evaluation all behind us, let's take a look at another function that large language models can help us with in our e-mail app demo, and that is researching based on the customer's e-mail and on up-to-the-minute company content. Let's go back to the e-mail application and see how we can help one of our customers. Noah has a clear net whose keys are sticking and that makes it difficult for him to play smoothly. We see that's the summary as inferred by the generative AI LLM, and that Lee would be the customer service representative that would handle the call. Let's go ahead and pretend we're Lee and research that issue. We can see the summary of our search results as it just filed in. We're using that summary and using another large language model, albeit much smaller than GPT, to determine what are called the embeddings of this. It's the semantic embeddings. What does this sentence boil down to semantically, meaning-wise? Once we have that embeddings vector, we can then compare it to embeddings vectors for each of these assets shown on the right, blog posts, press releases, and so on. When we recall the original issue, and then we find the passages that are relevant to our solution, those can go into the prompt of the large language models, the large language models input once more, and ask for a summary of the search result, and sure enough, it's here. It looks like it says that the clear net commonly does experience some issues with dry cork joints and sticking. It's suggesting some powder graphite and so on. What if we had an e-mail, but it wasn't in our box. It's just something we wanted to quickly triage. Let's try it. Let's go to research, and we're going to paste in an e-mail that I asked ChatGPT to write for me based on having a problem with my MP stage keyboard. I'm about to have a performance, but my wah-wah joystick is not working right with MIDI. What can we do? Let's go ahead and search that. Instantly, what just happened there was that my e-mail was boiled down to a summary. The summary was vectorized into its embeddings. The embeddings were compared to the embeddings of all of our solutions for everything, and the most similar ones were saved to be produced by the ChatGPT large language model to produce this summary and possible solutions. So how did that work again? Well, this involved a couple of steps. The first step was data preparation. We had our documents to be input and processed for later searching and retrieval. Don't forget that our documents are all synthetic. The company and those products don't really exist. But in any case, these are the input documents that are going to be used as sources of technical information or problem resolution. Then we process those documents by breaking them down into smaller pieces. We'll be discussing this in some detail with code in a bit. We convert those bits, I should say those smaller passages of text into vectors. Then we store those vectors as embeddings in a vector store. Then when it comes time for somebody to use our application, their actions will result in a prompt that will cause document retrieval. So we find the most similar documents and then we retrieve them back into their text. Then that gets passed into the large language model with an API call asking for a technical summary of the problem, the solution to the problem and a description of the problem as well. Then that text comes out and what you saw scrolling past. Let's switch gears and look at the code behind how the demo app works in terms of accessing the large language model and preparing the data. Then let's also compare a few different frameworks. Developing these kinds of systems can become complex pretty quickly, and so we look for ways to simplify our development. One way is to take advantage of the modularity and flexibility of certain frameworks. One framework that we used in this demonstration that you saw earlier and that we use in our own development is Lang chain. Taking a look at the code, you can see that in Python, Lang chain allows us to open up an LLM object and specify which model it is. Then we can inject the parameters into standard chat messages using a definition like shown here. We're going to be writing a poem in a given topic, in a given language, and with a given large language model. One of the benefits to using this kind of framework is that it comes with components. So Lang chain comes with components to help you not only build the chain, and work with different chains of prompts, but also gives you the ability to instantiate agents, which is something that's talked about later, how to use memory, vector DB storage, and indices, and how to load in documents. We haven't talked too much about that yet, but we're about to. The example on the left is a simple standalone use case, something that you probably wouldn't need Lang chain for actually. But when it comes to more complex graph-like chains, then we want to think about using Lang chains facilities. So on the right, you'll see that we do put in a swappable large language model, just like we did in the simple standalone case. But on the right, we developed a system prompt and a human prompt. The system prompt is what our large language model system is supposed to do, and then the human template normally contains the request. Putting those things together, the system prompt and the human prompt makes a complete full prompt. Then the next step is to connect the chain, which you see at the bottom right code box. We can also flow more parameters into the chain so that they can be used by potentially multiple props. For example, with this run chain prompt, we can use it again and again, changing the topic and the language. Of course, it's possible to take the output of this chain and feed it into other chains, thus creating a complex flow. You've seen link chain and action in the demonstration that we gave earlier. It is an excellent framework, but it's certainly not the only game in town, despite it being open source, having a large community, lots of integrations and even enterprise tools to help you. Two more frameworks that are available are Haystack and GripTape, each with their own target purposes and advantages. Haystack, for example, developed by DeepSet is open source, and it has a lot of resources to help you with scaled search and retrieval. Also, the evaluation of pipelines, so you can tell how your whole system is evaluating. Remember, we talked about evaluation earlier. It's also deployable as a REST API. The GripTape can be deployed open source or in a managed way with commercial support. It also is optimized for scalability and Cloud deployments containing resources for encryption, access control, and security. In the example shown here, we've created a small toy example that is asking the LLM to write a four-line poem about a topic in French. So all three of these examples are doing this. The first task is to create the LLM object. So the three different frameworks do offer different functions that are in actuality quite similar. You just want to select the right model fitting your need and then passing in its API key. It's pretty easy then to create a function, pass in some arguments, and in our case, that's the topic of a poem. We use the LLM object that we had previously instantiated, and then we define the correct props, and then run it to get back the output. In this case, it's the poem in French. In fact, in Haystack, we give a further example where we take the output in French and then translate it using another large language model. To GripTape, we also added the ability for context to be read into our prompts. So in this case, we are going to load a PDF that contains some information that we're going to use as the larger language model is composing the poem, and then the pipeline is run. LinkChain has components to help you build a vector database as well. A vector database is going to be handy and in fact necessary for you to perform similarity search. Similarity search is looking through a set of documents or a base of text to try to find passages that are similar to the query. You can see that this is different than keyword search, which is looking for an exact character match. The process here is to input our document and shown in code here, we're using the LinkChain web base loader function, which is going to allow us to point directly at Wikipedia, the poetry page, and take the text in to a variable text loader. The next step is to process the text into chunks. Sometimes this is also referred to as splitting. There are a few different ways to split our long text from that web page or any other source into these chunks. But the one that we're showing here is the recursive character text splitter. It's going to achieve the 300 character chunk size by naturally looking for markers that may suggest a change in semantics or meaning. For example, a couple of slash ends to signify a new paragraph or the start of a new section in the text. The chunks are then put into the variable that you see here and passed into version to vectors, where we use a large language model embedding model specifically. This is distinct from a GPT model that is letting us chat or query. Instead, this is producing a set of 768 embeddings. These are values that represent the data as determined by the embedding model. We call this a vector, and we can have a whole database of vectors based on parts of a larger text body or individual documents that are sentences. One way to do this is to use the face function, F-A-I-S-S, and this is the Facebook AI similarity search function, where we pass in our chunks, then we pass in the reference to our embedding model, and what comes out is our vector database. Later, we'll need to retrieve. We have to make sure to set up a retriever object that's going to be looking at our vector database, and we can pass in a text query. Can you help me on defining the big picture of the tetrameter metric? What happens here is that that query is then itself embedded and compared to every vector that was embedded from our larger document, and the most similar ones rise to the top in similarity and become the sorted results of our search. At this point, your gears may already be turning, thinking of ways that you could take advantage of having a large vector database full of information that's ready to use with your large language model. That's what we're going to talk about next. What is retrieval augmented generation, and why would we combine a large language models and encapsulated knowledge with your data? Well, one of the first reasons is that while large language models have been trained on large amounts of data, they may have been created without data data or on topics that don't really fit your application. For example, getting a summary of some confidential enterprise info or private medical data come to mind. To retrain a model with newer data can be long and costly. The new data may become irrelevant over time anyway. Besides, if you expect to add private data into training, you would have to ensure that the model does not make it out into the public. Also, adding in real-time new specific data on the fly, does not remove the burden of a limited large language models context window. Then the RAG concept is simply to ingest data from an existing database or web pages or a specific document, like the latest text document on a topic that you created not too recently to be part of a database. That way, you can use the retrieved relevant information in the workflow of your application, either providing some of the context or when it's necessary to acquire factual information to generate an answer. Then summing up, a few benefits of RAG over a standard language model alone are access to external knowledge. You can pull relevant information from a fixed dataset during the retrieval phase. Another is answer diversity. By retrieving different passages, RAG can produce a variety of answers based on the external data interacting with it. RAG can also give you structured responses by pulling information from structured databases and that can give you more concise pinpointed answers sometimes, and then traceability of responses. Answers derived from database entries can be traced back to their sources, and that gives a level of transparency. Thinking about the workflow that we followed in the research app demo, that was the one where we clicked on the Research button, and then the LLM synthesized a technical support summary for us from technical information that was available in documentation blogs. Now, let's consider a schematic of a RAG workflow. When the queries arrive, a retrieval process goes to fetch the relevant data through the framework. We know once again that the data could already be part of a database or it could have been loaded at that moment. Then the retrieved info is combined in the prompt and sent to the large language model which in turn produces a response that considers the info contained in both. Note that this RAG workflow also incorporates guardrails for both the prompt and the response. Now, one of the first steps is to vectorize and embed your input. Putting it simply, that means that embedding is a method to convert an input, and in this case that's text, but it can be images, videos, whatever, into numbers to comprise a numerical vector. Important here is that the embedding is performed taking into account the context and the meaning, which means that the same word in two different sentences could be encoded differently based on their semantic use. Thinking about the example here, the gray box, we have a query who will lead the construction team. One chunk says the construction team found lead in the paint, and the other chunk says that Ozzie has been picked to lead the group. So we can see that if we want to find a more similar chunk, it would have to do with leadership and teams or groups, and less to do with the chemical substance. So chunk two would be more akin to the original query. All right, going back then, once finished, we take these vectors and we store them in a vector database. But the idea that the closer two vectors are, the closer is their similarity and also their meaning. But to be fair, we have to note that similarity doesn't apply relevance, and that's why in some cases, keyword search approaches can offer better results than vector DB ones. You got to know your application. In any case, it's becoming common to use embeddings and similarity search with vector databases and semantic retrieval for different use cases like classification or topic discovery. If we take a look at the clusters on the left, we can use this as an example. On my team, we were interested in better understanding feedback we got about GTC, which is free form and unstructured. So looking at the image, you'll notice that we have these clusters of points. These arose from plotting the embedding vectors, and then projecting down into two dimensions, where here each point represents a feedback messages text. These clusters turn out to be thematic, sharing the common topic like scientific computing or speech AI. So this is a good way to visualize semantic distance. Let's now think step-by-step through bringing new data into our application for the LLM to process. One good question is, what can the LLM ingest, how much? As a reminder, today's LLMs can really only ingest a limited number of tokens, and today that's in the 10 to low hundreds of thousands. So to ensure that an LLM can ingest our data, we need our data to be split into chunks that fit the context windows size limit. So turning to the code, we see that pyPDFLoader grabs the text from an unencrypted PDF. Though I should point out that other loaders could have handled things like JSONs or CSVs or so many other types of data. Either way, a list of document objects is returned, and for this example, let's imagine it has a length of 20 pages. You can note that could have different vectors representing these various segments. So then afterwards, when we retrieve the information, the retrieval function make it back pieces of the file's text, and that's more useful than simply being returned 20 pages the entirety of the PDF file's text. Then after loading, we need to break up the text into smaller pieces to capture semantics of the document, and to improve the potential text passage relevance during our search. That will also allow for the limited context window that we have. In the initial splitter shown on the left with 500 characters as the chunk size, that's a large chunk size that could help us find, say, an idea more holistically inside one of our files. But on the right, we have a chunk size of 30 characters, and that smaller chunk size allows for more fine-grained searches. We also know that there's a chunk overlap to make sure that we don't miss any semantic concepts as we slide our window along our text. In the second stage, the chunking process is somewhat deceivingly, not such a straightforward task. To highlight the complexity, let's consider two different kinds of data, a French novel and an English tech document. As you may know, some languages are more verbose than others. So when we're dealing with content of different languages, we may have to consider the alphabet or the recurrence of words, or sounds, in that language. Besides, some languages are more direct than others, being more efficient given a meanings per character. Then finally, a tech document is less verbose and more straightforward than a long poetic description in French literature. All this goes to say that chunking by splitting by character count alone, may not be enough to extract a meaningful piece of text. The same is true if you have a technical document, let's say a code example. Going to the next line, like using the slash n info, may be sufficient to indicate the switch of a topic. An approximate conclusion here would be then that depending on the kind of data, text, PDF, markdown, etc, the kind of fields the document is related to, like text specs or historical summary or business report, language used, and so on. You will have to consider different chunk sizes, even though it would be ideal to find some elusive standard one. But one piece of advice, do experiment and do consider parameters. We just mentioned to determine the right chunking size, try it out. Just to highlight it once more, given the same kind of input, but by modifying the chunk size parameters, we do end up with either a more or less meaningful vector. Understanding how to pre-process your data to be subsequently embedded is of uttermost concern. Let's take a look at some other means to make the ingestion of data more efficient. One is to use RAG sub-query chaining and cascading. Many inputs have multiple topics, and using a single embedding for such an input really dilutes the ability to retrieve the topic relevant chunks. We can use an LLM to generate retrieval queries, identify relevant info and return chunks, and then combine those into the final prompt. We may be able to further optimize by parallelizing part of the RAG process. Referring to the block diagram on the left, after a complex query is decomposed into sub-queries, both sub-queries are processed simultaneously and subsequently combined to produce the output. We do love to parallelize processing. Re-ordering or re-ranking, here's an example. You can retrieve more results than you ultimately desire via an efficient search like embedding distances. Then use something like a pairwise evaluation to select from that list. Or completely different, apply the maximum marginal relevance MMR approach. Score on both the query's relevance and the results diversity to reduce redundancy. Then finally, another technique is to embed data with their associated metadata. That's a context awareness. Let's illustrate that looking at the code example on the top right. When looking through the chunks of text to embed, adding information from headers like in a markdown document, will emphasize the meaning and the context of the sentence. That function markdown header text split gives us the ability to combine the text with header information when looking at the sentence. In this case, customizing a model using parameter efficient customization. The added metadata will help the large language model to understand the context of the sentence and possibly its relevance. For instance, adding the document date so that the large language model can determine which chunk to prioritize if the two conflict. Earlier, we described a simple but realistic case with our demo about incoming e-mail at our fictitious company. We further express several ideas on how it could be extended or improved via various techniques. Though simple, we wanted to highlight a real enterprise grade level of UI design and back-end API endpoints leveraging some of our internal tools as well as others from the open source community. Finally, the code shown is from the research functionality of our demo and it's surprisingly brief considering all that it's doing. Of course, that's largely thanks to the framework and API. The workflow is to retrieve the documents, then filter by the top, and then feed those into the large language model to summarize. The LLM chain here contains the stuff documents chain, which will concatenate together documents to feed into the LLM as the context within the overall prompt. Then the large language model takes in all of its information and forms a natural language response. Before we wrap up, I just wanted to invite you to come explore the NVIDIA AI foundation models including Nemetron 3, Code Llama, Neva, Stable Diffusion XL, Llama 2, and Clip. Here, I use Nemo LLM service on the left to generate a story about an Egyptian goddess who's a cat. Then on the right, I asked Neva, the Nemo vision and language assistant, to analyze a synthetic image that I made about that cat. Sure enough, it understands that this cat is sitting on a couch in what appears to be an ancient Egyptian setting. Very well done. Let's review the information that we covered today. We first discussed the core concepts of large language model architecture and foundation models, before moving on to the factors for selecting between and evaluating large language model APIs. We then moved on to prompt engineering basics, and covered a few workflow frameworks for LLMs, before taking a look at retrieval augmented generation or RAG. We also presented two demonstrations showing how you could use these principles for an e-mail triage application. Before we go, I'd like to thank my colleagues, Ben Byart, Chris Milroy, and Chris Pang for their many contributions to the session. Thank you. Thank you. you you

the-fast-path-nvidia.txt

Document Details

Tags

Related

Full Transcript