Large Language Models

This is the second post in a series from the basics of machine learning to state of the art large language models (ChatGPTBard and friends). Here the links to the entire series:

  1. The basics of Artificial Intelligence and Machine Learning
  2. Deep Learning and Neural Networks
  3. Large Language Models (this post)

The Transformer Architecture

As explained in my previous post, neural networks are an ML model designed after the blueprint of our brain, capable of representing complex relationships and hence deep knowledge. The structure of such a neural network - how the artificial neurons are connected, or in mathematical terms the layout of the network graph - is what we call its architecture.

Over the last decade or so, ML researchers have found better and better architectures for a number of different tasks, such as computer vision or language understanding. The analogy in real life us how the different parts of our own brain are wired to perform specific parts like vision, memory or other things (just that for artificial intelligence, we need researches to substitute natural evolution).

The breakthrough in language understanding came with the transformer architecture, which - guess what - can transform one thing into another, like a German sentence to an English one. Transformers consist of a number of encoder layers, which transform the input of the data step by step. Decoders transform the target (the thing we want to predict). The trick is that the decoders aren't only fed by the target but also by the results of the encoders - this is what learns the relationship between input and target.


But the greatest breakthrough was about attention. This mechanism allows the model to figure out which other words are related to the current word, and take them into account while making their prediction. This makes them understand context much better, and figure out nuances of speach, like what the word it really relates to in a sentence. And that's exactly what the transformer architecture enables.

What do LLM's actually predict?

Like I mentioned in the first article of this series, supervised learning is all about predicting an appropriate label for a set of features. How does that work for language models? It's surprisingly simple: The features are simply some text, and the label is the next word in that text. The next word? That sounds like all this fuss is about simply doing auto-complete. And that's exactly what it is: A super super super smart auto-complete. 

For example, if you pose a question to the model, the next best word will be the first word of the answer. Then we run the model again (with the first word of the answer as additional input) and get the second word of the answer. And so on. This approach makes these models very simple to train: All you need is vast amounts of English text (or other languages), like a good chunk of the texts in the internet, or all of Wikipedia. Then you take part of the text into the training as the features and cut it at an arbitrary word. That word then becomes the label of your training example. The model will learn that this comes next. This process can be highly automated for vasts amounts of texts.

How are LLM's created?

As explained in the last article, the problem is that these models have to be huge in order to be so good at auto-complete (imagine the myriad of different inputs that all lead to the next word "the"). This has a few practical implications
  • You need enormous amounts of data to train them. While there's a lot of data on the web, it can be tricky to find vast amounts of high-quality stuff. And who has those? Right, big tech companies with lots of users - Google, Facebook, Microsoft and the like.
  • The training data need to be high quality. Imagine what happens if you use the nastiest part of Reddit or other forums to train your model? You essentially created an AI troll - nobody would use such a product. This can be much more nuanced, like slightly favouring input data from a certain racial or economic group (say rich white males from California). And zap, you have political bias in your model. The science of making those things fair is what we call Responsible AI.
  • Training takes massive amounts of resources. Think several months on tens of thousands of machines in parallel. And that's super expensive. It's estimated that it cost OpenAI $100 million to train GPT-4. This again means only a handful of rich tech companies (yep, again Google, Facebook etc.) can afford to do that.
So that's the recipe? Lot's of high-quality data, money and that's it? Well almost. Of course, you need a small army of data scientists to understand all this in detail. And oh yeah, those human raters...

Humans in the Loop

The latest twist on how to squeeze better accuracy out of these models are - I know, ironic - humans. Reinforcement Learning using Human Feedback (RLHF) or Human In The Loop (HITL) are techniques to let humans assess the predictions from the model, and provide feedback that the models then uses again as input to learn more. Say you ask the model 1000 questions, record their answers and give them to a pool of human raters (recruiter through mechanical turk or a similar platform). Their job is to rate whether each was a good answer or a bad one - and in the latter case, provide a better alternative as to what the model should've said. These 1000 modified examples can then be used as another set of training data to improve the model further.

In practice, this is mostly used in fine-tuning since the amount of data created this way is relatively low, but the quality is very high - given the raters provide good answers and don't offload their work to another model, creating a infinite model-feedback loop where models train models. Don't laugh, with more and more context in the web created by these LLMs that's near indistinguishable from humans, this is a real problem.

Conclusion

Since emergence of ChatGPT, LLMs are all the hype. Yes, there's a whole bunch of science, data, and human feedback involved, and yes, they feel pretty human-like when you actually talk to them. But under the hood, they're based on the same kind of neural networks that are around since the 90s (just much larger "brains" with clever structures), and essentially just do auto-complete (albeit a very very very smart auto-complete). Is this what Artificial General Intelligence looks like? Probably not. But these models are insanely good at sounding like it. If we get this far with auto-complete, then woah, what can we achieve with the next stage of AI research - exciting times ahead!


Comments

Popular posts from this blog

Writing A Book

Deep Learning and Neural Networks

Intro to LLMs and Generative AI