Revolutionizing AI: Advancements in Large Language Models and Quantization

It’s very exciting. There are multiple advancements that are like 100 x improving the current status quo, which is r, which is already just amazing, right? But like we have multiple hundred x improvements coming. So it’s very exciting. It saved a bunch of really good graphs to talk about this. Let me find them.

Okay, so there’s two really big things to know about how large language models are getting a lot smaller and a lot more efficient and also better at the same time. So these red lines, this is like performance on these different metrics. And then the, oh, I’m sorry, the x axis is the size and the y axis is the, how good it is at different stuff. So like math, coding, etc. So you can see there’s this kind of like trend in the red line, right? Let’s talk about the red line. This is a single expert type model. So this is things like Grok, GPT, three, llama, all of these like old fashioned AI models are single expert. And so there’s this asymptote. You can see the lines all start to flatten out. There is a natural maximum for that type of structure. And if you want it to be really good, it has to be really big. You need a lot of resources to run it. Okay? And that has a problem, obviously, because resource consumption is limited. So if we could 10 x that if we get 100 x, that would be great. And in fact, we can, you can see the orange line. This is mixture of experts. There’s more than one AI inside the AI, and they are fighting with each other about what the best answer is. And they will always do so much better at everything. So all of the new stuff is like that, this orange line. And you can see it’s a very different trend in terms of the resource cost of doing things. And that’s why they’re so much better, these new ones.

So on this chart, almost all of these are single expert. Okay? The ones that are mixture of experts is like GBT for and then these ones here, the DBRX and mixture large. So they are significantly better than everything else. So it’s like, why are they even still doing it this way? Like, why would Elon do that? Because he can’t get anyone competent to work for him. So this is all he can do. He doesn’t have the technical capacity to produce these very high quality models because smart, capable people don’t wanna work for him. Same with Zach. So llama is not really getting any better. It’s basically flat with the last one. And so this single expert architecture is just out. So moving to this new one is so much better. So the other thing is this very exciting thing that’s happening with quantization.

So this is like more technical, but I can, I’m gonna explain it in a very simple way. Okay. So the reason that you need GPUs is because you have to do matrix multiplication, which just means basically like doing a whole spreadsheet of math all at the same time, right? So here you have all these things, something happening to all of them. It takes a long time for computer to do that because in this case, these are the weights. So when you see like GPT three is 167 billion parameters or weights, these each one is a weight. Okay, so it has a lot of numbers. It has to multiply them very fast in order to give you your answer. And this is a two step process because it has to first multiply and then add the answers. So there’s two steps that it has to do for every single number, for all those billions of numbers. Okay? And because these are floating point numbers, these are decimals, it’s more complicated. It has to keep track of like where does the decimal go? What is the significance that I need to do? How many numbers is in the number? Is it gonna be an 8 bit number? Is it gonna be a five bit number? Is it gonna be a 32 bit number? There’s that many numbers I have to multiply. It’s very complicated, okay, for a lot of reasons. So changing it to only be ones and zeros is this new idea because then you don’t have to do this multi step math process. See here multiplication than addition versus here you’re just adding. You don’t have to multiply that. So it’s this like multiple step change improvement and efficiency for inference for doing the for running the AI to give you an answer.

So my first thought when I saw this concept is like, well, what about all like there’s all the significance in these digits. It’s very important if you start making these numbers shorter, the model isn’t as good. Okay? But if you train it this way, it actually is as good. And so all of the experimental evidence shows that, in fact, quantizing models this way, you, you s, you can still teach it just as much stuff. You can still make it just as good of a model and then it takes 100 times less compute to get the answers that you want out of it. So you’re training the model to be way better at inference, to be way better at like responding quickly to questions versus here, this is just like the way that we’ve always done it. And it’s like not, it’s not the best way to do it. This is like a way better way to do it. And now none, so none of these large companies doing this. Yeah, this is like a brand new paper, right? But they’re going to. So this combined with the mixture of experts, architecture is going to totally revolutionize how fast and small these models can be in a way that it’s like really exciting. And then we’re also gonna have to refactor the way that charts like this work because here you can see number of parameters. So the reason we’re using this number to compare how big it is, is because the amount of compute that it takes to run the model is just the number of parameters times the quantization. So if you have an 8 bit parameter times 100 billion parameters, you need 100 gigabytes of memory and a lot of CPU cores or GPU cores in order to run the model. But if you have single bit quantization and you have mixture of experts, you have some percentage of the parameters that are even running, and you have a totally different quantization, right? So like the, this is, we’re gonna have to think of a whole new way to compare these things. Because the reality is that they’re all gonna be all like if they combine these things in the way I’m talking about, they would be like over here. And it’s like, this will easily run on your Apple Watch and it’ll give you like GPT four level answers, right?