Deep Learning and Neural Networks

This is the second post in a series from the basics of machine learning to state of the art large language models (ChatGPTBard and friends). Here the links to the entire series:

  1. The basics of Artificial Intelligence and Machine Learning
  2. Deep Learning and Neural Networks (this post)
  3. Large Language Models

Fundamentals of Neural Networks

Artificial neural networks are a type of machine learning model that can be trained to encapsulate knowledge and use it to predict attributes of data (other models are rule sets or decision trees, as discussed in the last post). They are modelled after the neuronal structure of the human brain (or any biological brain for that matter) - maybe that's why they're working so well.

Like in our brain, artificial neural networks consist of a interconnected web of (simulated) neurons. Each neuron receives signals from a number of input neurons, and if the signals accumulate to a certain threshold, can become activated. If activated, it sends a signal to a number of output neurons. 



Each connection - incoming or outgoing - has a certain weight (w1, w2, ...), which determines the signal strength of that connection. If the sum of the weights of all active input neurons is larger than the threshold (t1), then our neuron will be itself active, and send its own signal to all output neurons according to the weights of its outgoing connections.

The web of neurons is usually organized in layers (mathematically a directed, weighted graph). There are three basic types:
  • A special input layer where neurons take their value from the training data
  • A special output layer where each value of the neurons represents predicted values
  • A number of layers in between (usually called hidden layers) where the propagation between input and output happens

So how is data used to train such a neural network? Imagine how the perfect neural network would look like: Given a specific structure (the number and shape of hidden layers - defining this is usually the job of data scientists), it simply has the perfect weights assigned to all connections (w1, ... above) and thresholds (t1, ...) so that it always predicts the right output values for all input values. Training is simply the process of finding these perfect weights - or at least as good ones as possible. And since these networks can become very very big (see below), it's not really feasible to just try out all possible combinations - in fact, even with powerful computers this would in most cases take longer than the lifetime of our universe. So the key is to do this very efficiently. I'll save you the maths behind it.


Sizes of neural networks

As you can imagine, our brain is a little more complex than the simple example above. The same is true for state of the art models like GPT-4 - they're called large language models for a reason. And the more complex these networks are, the better they perform and the more intelligent they appear - at least to an extent.

We measure the size of a model by the number of it's parameters: the number of neurons (thresholds) and connections (weights). The following graph shows a handful famous models that emerged over the last few years.



The exact number size of parameters of GPT-4 is a secret but estimated at over 1 trillion. Note that the y-axis is exponential - within about 5 years, sizes have increased by four orders of magnitude, or a factor of 10,000. Some call this the Moore's Law of LLMs.

One thing to note us that it's extremely expensive to train these models. First, you need immense amounts of data: think the complete set of articles of Wikipedia, a huge number of freely available books, plus a snapshot of the most visited pages of the internet - these are just a few examples of datasets used for today's language models. Then you need to push this data through the training algorithm, which for today's models can take several months on computing infrastructure made up of tens of thousands of GPUs. Need help wrapping your head around that? Here's a number: $100 million is how much it cost Open AI to train GPT-4.

How does this all compare to real brains in the real world? Since neural networks are modelled after biological brains, we can directly compare their sizes. Wikipedia lists the number of neurons and synapses (connections) for a variety of animals, and the sum of both is equivalent to the number of parameters in artificial neural networks: A sea squirt has about 200 neurons and 8,000 synapses (sea squirts don't really do much, so it's easy to see how even the dumbest ML model is smarter than that). A honey bee's brain has ~1 million neurons and ~1 billion synapses. 

Let's stop here for a second. BERT was considered a breakthrough and by all accounts a very smart model in late 2018, and it's largest incarnation had 340 million parameters, a mere third of the honey bee. Either we don't fully grasp the genius of bees or BERT was much more stupid than we thought. Here's the thing: BERT can do one thing, and one thing only - understand human language. Bees need to fly, find flowers, make honey, communicate with other bees, navigate hostile environments, and just about do anything that makes up their simple, yet diverse life. Given the wide range of tasks they can perform, I have no trouble believing that a bee is actually smarter than BERT.

Side note: a model capable of performing well across generic tasks is what we commonly AGI - artificial general intelligence. Spoiler alert: We're not there yet.

But back to animal brains: The house mouse has about 70 million neurons and 1 trillion synapses. There we go, that's about the size of GPT-4 (the foundation of ChatGPT). As explained above, this comparison emphasizes the differences between a model that does one thing really well (but doesn't know how to deal with anything else) versus one that is versatile and can perform many things in a very basic way. Let's skip all the way to the human brains, which has more than 100 trillion parameters total (86 million neurons and 100 trillion synapses). 

Does this mean that once we train a 100 trillion parameter model, we achieved human-level intelligence? I'd say probably not, at least not in a general sense. We might get to models that feel even more like us in a conversational context (in fact, it's already hard to distinguish these models from humans), and maybe they make fewer obvious mistakes - if you scrape below the surface, today's models do get a lot of things wrong. But experts agree that artificial general intelligence isn't going to happen anytime soon

Conclusion

Moore's Law of LLMs states AGI should happen next year, at least in terms of parameters, right? Well, size isn't everything (like with everything in life, huh?). Evolution has figured out ways to evolve our brain in other ways, too. For example, rewiring of synapsis (a.k.a. changing the network structure), or even evolving size of the brain from species to species. And I'm pretty sure we humans aren't actually capable of understanding how our own mind works, at least not fully. Not to mention we could fully replicate or even surpass it.

Nevertheless, when it comes to AI, we live in exciting times. Let's enjoy our front row seat :)



Comments

Popular posts from this blog

Writing A Book

Intro to LLMs and Generative AI