Introduction to AI Lecture Notes AIN2001 PDF
Document Details
Tags
Summary
These lecture notes provide an introduction to artificial intelligence, focusing on models like ChatGPT. They discuss performance metrics, datasets, and the role of computation, providing a foundational overview of the field.
Full Transcript
Introduction to AI What is ChatGPT? ChatGPT is essentially a next Word Prediction Engine. So the way chatgpt works, it's a large language model. The way it works is that it just predicts the next word. 0R– the way that we predicted accuracy is that we looked at what's the accuracy, given that the...
Introduction to AI What is ChatGPT? ChatGPT is essentially a next Word Prediction Engine. So the way chatgpt works, it's a large language model. The way it works is that it just predicts the next word. 0R– the way that we predicted accuracy is that we looked at what's the accuracy, given that the most outcome. If the most outcome is mostly you're not going to win the lottery, and the model predicts you're not going to win the lottery, it's probably 99.99%. 1R– maybe it takes one input. How can we know what the performance of a model should be? So the way we know is that we put these kind of models into tests that humans take. In 2023, AI surpasses human performance of image classification, visual reasoning (how many wheels are there?). What that means is AI consistently does 95% Question — What's the best humans do in exams that are trivial to us. What AI can’t do is competition level math and reasoning and planning. So these are different models, ImageNet, the blue one, which actually is the first image recognition model that we have that used deep neural networks. It's been doing really well since 2012 so for the last 12 years, we have something that can classify you just really. But in the last couple of years we have models that really surpass the human baseline (95%). The way that we figure out how well chat GPT does is that you put them through tests.--dataset Accuracy J 48– 72% 0R – missed Machine learning is a little bit art and a little bit science. Art part comes in how do you define the models. Science part comes in measuring the performance. And when you measure the performance, you cannot try to hit 100%. Because it means that I don't have a mathematical explanation for this or a scientific explanation for this, but I have an experiential explanation for this. – MODEL will fail. But if your model is around 90, 95%’ your model may be very well. Google Gemini does 59% accuracy and the human expert does 82%. OR 1R and linear regression– the way we measure the performance of any of these models, is a mathematical formula. But with LLMs we can not measure this because it is just language. So what we need to do is to ask them a question and then look at the answer and some human grader — teacher. Some human grader should reply whether this answer is correct or not. So in order to figure out how well LLM’'s doing, what chatgpt does for programming, we create a test set for coding problems. 3 of them— human about 164 coding problems. These are entry level coding problems that someone would ask you, if you want to get an internship at the random company. Swebench…….. - difficult problems about 2000 to 22. When you look at the performance of these models, you can see that agent coders a Chat GPT for modified model. Agent coder can solve 96%. In the last three years, the human eval is a dataset. Why do we create datasets? For final exams, we use the same test for these models. We asked the same questions to two different models - 164 coding problems quiz. So with the 1000 mbpp – it took a little bit longer. By january 23, there was already about 80% accuracy. And then by january 23 here we’re at 70% accuracy with this dataset. These are answer level questions. There exists another dataset - Swebench is the machine learning algorithm they created. In here, we have 2200 complex problems that they collected from GitHub that other people have asked about. So for the hard problems, it seems like LLMs are not doing well but for the entry level problems it works really well. Best one is – 12% WHY? The first thing is, obviously, there are a lot more simple problems in the world than the hard problems, so it's hard to train them on the hard problems. Second of all is we don't know if this intelligence, they call it, an alien intelligence, doesn't work the way they're human intelligence. It's a statistical model.-- we really don’t know. In 2011, there was only 845 AI projects. When we create an LLM model, it’s proven by scientific measure that if we use more data and more computational power, then the next model will do much better in terms of performance. How do we measure the performance of a language model?-- We put it on a test. So what you need to understand is the performance of a model is well predicted by the number of parameters that we put into it. The more computation pattern we have, the more parameters we have. If we want to have the best model in the world, we need to have more parameters, meaning more computation and the most data. Governments can collect all the data in the world. Who’s going to win? ChatGPT use estimated 78 million words of computation power. These millions come from the petaflops that gives. Flop means floating point operation. In order to calculate, we need to do a lot of linear algebra. Alex named the model and use 470 for the whole system and Google to train the system. The big players will be the US, China, Saudi Arabia, Russia. Other problem is —- Are we going to run out of data? Because there exists in the world at the moment, predicted 100 trillion world works in on the internet, and the largest moment that we have is trade on 9 trillion works. penAI went into the internet and copy pasted every single thing that they could find. So far, they only found 9 trillion. But in the whole internet, there is 100 trillions. We're going to run out of the data that is available on the internet, and that's around the projection is somewhere between 2032 to 2040. Low quality language stock — internet High quality language stock — the books What do we train them? The petaflops is how much resources we use for the training, and that creates carbon emissions. The Foundation models are the ones that are trained on the public data, and then the specialized models are the ones that you take the foundation model and you use the specialized data set, and then you get a specialized model. All the coding models were created by they use a foundation model first, and then they added the extra data. Question – why is Google's Gemini model not better than chat GPT if it cost more money? On the public available data in 2023, 499 149 foundation models were released. In 2022, there was only 67. In 2022, only 44 50% of them in 2020 65%. We are going more open source. One of the biggest problems with this process is the deep fakes. Also not want capstone project– they use a crawler to crawl the internet mostly news websites. They collect these articles and parse them. They figure out what the article is saying, and then they use AI to create a fake article that responds. That's a political war. Article to counter article Tesla makes the decision. Is this the right thing to do or the wrong thing to do?--- QUIZ Amir’s question — autonomous hacked (very good question) Who is responsible for it? Tokenization — In order to create a LLM (chatgpt - large language model), what we are trying to do is before we do anything, we have to featurizer the whole vector space. A featurizer takes whatever is put in front of us into a future space. So we were presented with some photos, and a photo said, Oh, it's either a car or a motorcycle, right? That's our goal, trying to figure out what to do with the data set. And then, in order to go from photos to an Excel sheet, I have to go to an Excel sheet to run all these algorithms, because these algorithms are very mathematical. Okay, so we need to take the raw data and then put it into something that makes sense for everyone. In LLM, a featurizer is actually what they do is they take a text – they take it and turn into tokens. Translating from raw text to tokens is called encoding. We essentially take a big chunk of text and then encode it into small tokens. – Tiktokenizer In a tokenizer, which is a featurizer, when we think about the features of a sentence, we can think about small word sections, okay, maybe like two three letters, or maybe four or five letters, and all of those things will become small features. And then we can give them an index. For example, A could be 1, AA could be 2. And they create some tokens. These tokens are essential to their activities. Assume that our tokenizer just took the words and assume that we tokenized for example- future space in a book. In the index of the book, we can see which words are repeated on which page. So essentially, a featurizer that futurizes a text using only the words will have an index space of all the words that's in the world. So essentially, what we're trying to do, we're trying to build a chat GPT. So how many issues on your models for algorithms? So this is the fourth one where it's how to build a large language model. And Tokenizers essentially take the test and turn it into an exhaustion, right? We do not want to deal with the actual data. We want to take the data, we want to pursue the data and turn it into some kind of machine learning that we choose that these algorithms and the first steps so causation is actually these algorithms cannot work with Text. They only work with numbers. These are all statistical machines. So we have to take the text and turn it into some number space, meaning, when I look at the text, it has to be just numbers. So how do I take something written by Shakespeare and turning into numbers. So an idea is, why don't we call every word a number? AI systems are mathematically. We have to keep them numbers, and we have to get numbers out, and now we can map what those numbers may be into our language space. As you take your photo and now put it into a zero R, you cannot just give the photo. You have to translate the photo into some exception. It has to be some numbers, some other always the same. You don't know what it is, but you can send open AI requests, and then you can get responses back. So if you were to use the featurizer of GPT 3.5 then if you wrote a sentence, you're a helpful assistant. Don't worry about the system and the user so much. Because even Sam Elton says they don't know what it means, okay, but the system and the user are two inputs. You can think about it that way, and then the featurizer will divide this into smaller sections. The tokens are essentially a featurizer that divides the text into futures. What does it mean to have a feature? A feature is just a function of the input, but really dumb. A featurizer takes that and puts it into a number space. Tokenizers are your real time feature space creators. So essentially, you feed them data, and then they create some tokens out of it, and these tokens are indexed. 0 R – 1 R– Linear regression– Large language model So essentially, it involves breaking down the text into manageable pieces that the model can interpret. Manageable pieces—numbers. It's really important that we map these tokens into an integer base. Think about your brain. It has to be dance, meaning that you can't remember every single conversation that you've ever had. So it has to be type of conversation that you had. You kind of remember, you know, when you meet someone who's got a great memory, don't you respect them? You do respect that they have a bigger tokenization space, meaning that they map that conversation into some kind of symbolic thing in their head. And if mapped the right way, they can remember it the right way. And then, second of all, there is mapping of what is being conversed into some kind of an abstract thing in your head. You have to think about. These are very unstructured data. Right? One is to talk about the subway, and the other one is to talk about something else. So how can we make sure that the machine understands it's a really, really difficult problem. So tokenization is nothing easy. Tongue twister– So what I told you right now, in a language setting, is something kind of irrelevant to a lot of things that we've been discussing. In fact, the Turkish students that I just talked to, they couldn't even translate what I said, but what I said was, Oh, the other people in the town, they bought a cow that's black and white, and why can we not buy the same cow that's black and white so that we can have A cow with black and white. In order to make it 70 million, we have to represent our data in a dense, low dimensional, continuous space, meaning that in order for these chips to work on our data, first we have to turn everything in the excel sheet. And the excel sheet should be dense, meaning that there should not be two little entries. A excel sheet is something that doesn't have 65,000 entries, it has only five entries, and then three of them have data low dimensional, meaning that squeeze the data as much as you can and continue. It has to be a floating point. If it is integer, it doesn't work very well. Embeddings are things that are tokenized the same way. So it's kind of giving pre information to the model. Okay, in a way, what we're trying to do is to get to representation by the ID and trying to find a correlation with this ID to other IDs. Chat, GPT, like models, all they're trying to do is to guess the next word. So our goal is to guess the next word, and when you do it by the word, it's not that smart, it's much better to do it by tokenizing strategy. So the difference here between the embeddings and tokenization is just a straightforward process, okay, we say, oh, every three letters is an ID, or every three characters is an ID. Or we can say, Oh, I'm gonna get rid of the space in between the words, and then every three characters is a talk. Or I can say, a token is such a thing that it has two vowels and one consonant. You can say anything you want. It doesn't matter, as I said in the featurizer space, the featurizer can be done. — we're trying to get to raw data Tokenizers have one input, one output. Embeddings, take some text, and then they create a correlation space between this text. So embeddings are essentially how these numbers are collated with each other. So we can have a bunch of numbers. So tokenizer takes it, you know, a text space into space, but then we have to know how these numbers are correlated, okay? And the way that we can know how these numbers are correlated is that kind of like making bread. We kind of put our hands into it, and then mix it and match it, and blah, blah, blah, and then we pull one piece out, and it oozes. What it means is this is how it's correlated with everything else. So we need to create some kind of a relational space where different tokens, tokens. You know, I said words. Words is a very weak strategy. Usually people do three characters, four characters, depending on the situation. And we need to find that. You know, how often they happen together. Embedding stores all the information that the retrieval algorithm needs. We started to figure out relationships between these small pieces. An LLM is like bread. It's a really good cooked bread, when you cut it, you say, Oh, my God, it's so nice, because there's so many relationships in this. But a bad LLM is like crumbling. You know, if you ever had corn bread, it crumbles like you take the bread, and then you try to make it little bite out of it, and then it's gone. This whole crumble, let's see really nice cooked bread. What would you think is the main challenge while creating a large language model? 4th November Quiz – What are these numbers? Okay, so what we do is we take a work, we chop it up into three, four characters, whatever, and then we put them into a table. It's called an index table. So every token gets an index gets a number assigned to so two here is equivalent to today. So let's look at, for example, chat gpts, okay, so 18 tokens. Do you see any numbers that are the same in here? So all we are trying to do is we take a bunch of characters the all these algorithms are different from each other. Tokenization algorithms are different. And do we know which one is the best? No, we just it's trial and error. What this is chat GPT. I think chat GPT 3.5 chat GPT uses a lot more tokens than metas, right? Chat, GPT made 18 tokens out of a five word sentence, and then meta Lama just itself. The model is a lot simpler, and it's not precise, so maybe chat GPT is more precise. And then this one is not that precise. The Lama is not precise. So another thing is the less tokens we have less completion time. So now that we've turned everything into tokens, we've turned them into taking a sentence and turned out into integers, a bunch of integers, we need to create an embedding. So embeddings are really important, because that's what defines the relationship between tokens. What I'm trying to look for is how often words exist in the same sentence. We're trying to guess, given a word, what the next word could be. So what we should learn from the sentence structure, from whatever is given to us from the text. How often do two words exist together? That's really important. And now there are two ways of doing this. One of them is syntactic, and one of them is semantic. Semantic means the logical relationship between words and syntactic is how often it's the grammatical relationship between words. So essentially, when I say I am okay, M and I has the syntactic relationship between them. Okay, we are. There's a syntactic relationship between we and are. I don't say we AM. There's no syntactic relationship. When I say Istanbul is a city. Istanbul and city as a semantic relationship, okay, so it's logically relationship. So embeddings are vectors. They're the vector representation of these relationships. The token vectors are not dense. They're actually really large. Embeddings are low dimensional and they're continuous, meaning that they're floating point. Embedding algorithms are different than the machine learning algorithms, even though they can use machine learning algorithms. So what I'm trying to say is that the large language model uses a transformer architecture, and the embeddings happen before the transformer architecture. So what we are talking about here is rack. Rack is retrieval, augmented generation, essentially creating an embedding is kind of creating, like a dictionary for the transformer to look into, because the embedding itself has relationships between two words, and the retrieval algorithm of the transformer will look into the embedding, and then we'll try to find the relationships, and we'll use that in its calculation of the next word. Embeddings are vectors, and the dimension is defined by the embedding algorithm, and we can actually change those parameters, okay? We can say, Okay, we want the dimension to be like this or that. And then there is the origin, since it's a vector, there's the origin, there's a direction, there's the magnitude. That's not agnitude, that's magnitude. And this determines the relationship between two words. It's semantic, right? Because it's meaningful to have Athens and Istanbul in the same paragraph, if you're talking about seeds, right? Okay, so what we would do is we will probably create a table of the percentage of the times that these words can exist in the same writing, in the same paragraph. Let's say so. Let's say that we downloaded all the text from the internet, and you look at the same paragraph, and now we're look at looking at the percentage of these two words existing together in the same paragraph. Okay, I made up these numbers, but that's one algorithm. One algorithm could say of Istanbul and cats happen in the same paragraph 99% of the time on the whole internet. That makes sense, right? Cats in Istanbul, it's like a thing, but maybe summer happens only 5% of the time. Okay, so these numbers are just the percentages of how often they exist.