Solvro Talk - How Modern Language Models Work
Introduction
Recently, large language models (LLMs) have been gaining increasing popularity. The reason for this phenomenon is undoubtedly the success of tools like ChatGPT. Despite the undeniable success of such solutions, they are still merely large mathematical models whose operation is mainly based on skillful use of statistics. In this entry, I will try to shed light on their operation and help in understanding them. I invite you to read along.
Artificial Intelligence, Machine Learning, and Deep Learning
I will begin this entry from the very basics, which is defining what artificial intelligence is. It is a field of computer science encompassing algorithms whose operation somewhat mimics human intelligence. These include genetic algorithms, expert systems, metaheuristics, machine learning, etc. It's the innovation in machine learning that we owe the advancements seen in recent years. The operation of algorithms in this category is primarily based on finding and utilizing patterns in datasets on which our model is trained. Neural networks, which modern language models are based on, have their own separate category called deep learning.

What are Neural Networks
The simplest neural networks, called feedforward networks, are nothing but linear transformations of vectors. The graphical representation of a single layer looks as follows

Mathematically, we can write this as

Where W and b are the matrix and vector of parameters, respectively. () is the so-called activation function, but more on that later; for now, assume (x)=x. As you can see, a neural network layer is simply multiplying an input vector by some matrix and then adding a vector to the result. The operation of a single layer isn't very impressive, but combining several layers has a very interesting property, which will be shown later.
Example: Linear Regression
Assume we have some set of points X, arranged in the following manner

Our goal is to find the parameters W and b that best fit the data, visually represented as follows

In our two-dimensional example, the parameters W and b are just scalars, only in models with more dimensions do we need matrices and vectors of parameters. We begin the search for the best model parameters by randomly initializing weights W and b. They are then fitted to our needs, which we call training the model. So how do we find the appropriate parameters, that is, train this neural network? In the past, this was done using an algorithm called Stochastic Gradient Descent. Nowadays, its modifications, such as Adam and AdamW, are used. A detailed explanation of their operation is beyond the scope of this entry.
The above example is, of course, extremely simplified. Moreover, there is an analytical solution to this problem, so using gradient algorithms that give only approximate solutions is not necessary.
Example 2: Polynomial Regression
Now let's consider a slightly more complicated example, one where we deal with nonlinear dependencies:

We have here a third-degree polynomial. Let's try to find its approximation using a neural network. In the previous example, we used a single linear layer, and the activation function was an identity function, meaning we were simply looking for the equation of a line. In this case, our model requires more complexity. We need an additional layer of the neural network called a hidden layer, mathematically represented as follows

When we add more layers to neural networks, it becomes apparent why activation functions are necessary for their proper functioning. Let me explain: a single neural network layer is a linear transformation. A composition of such linear transformations can be replaced by a single, equivalent transformation, making these additional layers unnecessary. For example, take functions f(x)=x+5 and g(x)=x+3. Their composition f(g(x))=x+3+5 can simply be replaced by function h(x)=x+8. The same happens in neural networks when we don't use activation functions. The most popular activation function today is ReLU (rectified linear unit)

Or mathematically

The role of the activation function is to introduce non-linearity between the individual layers of the network. This allows us to obtain a model capable of approximating any functional dependency. It suffices to use a neural network that is deep enough. As seen in the image below, the obtained approximation doesn't differ significantly from the actual function

It's worth adding here that a network with identical architecture but different parameter values can take on a completely different shape

As you can see, a neural network is simply a model capable of finding patterns in datasets. The application of a sufficiently large hidden layer allows it to adapt to any functional dependency and thus fit any dataset. Modern architectures often consist of many hidden layers.
Real Application: Image Classification
Now we'll move on to an example practical application of such a network. Suppose we deal with two classes of objects
Class 0 
Class 1

Our task is to build a model that assigns a given image to one of these two classes. How can a neural network help us in this task? Let's recall the previous subsections. In each presented situation, there was a certain dependency between the sets of points. It so happens that the above images can also be represented as points in a multidimensional space. Such a representation has some effectiveness, but it causes the loss of dependencies between individual pixels in the image. For this reason, extracting features from an image is a common procedure. A convolutional neural network can serve as a feature extractor. The image processed by it can be reduced to a vector form, based on which a feedforward neural network assigns the image to a given category. Thus, the entire procedure looks like this:

Passing the image through a feature extractor allows us to extract key information for the task at hand, in this case, image classification. In an ideal situation, examples belonging to individual classes are separated as in the example below:

In previous examples, the neural network was used to estimate a certain functional dependency y=f(x). Here, to solve the classification problem, the neural network will return 2 numbers as output. Often, the output of such a network is passed through a softmax function at the end. This allows the numbers returned by the model to be treated as the probability that a given observation belongs to a given category. The area where the probability that the point belongs to class 0 is higher than to class 1 is marked in red on the diagram, while where the likelihood that the example belongs to class 0 is lower is marked in blue. Hence the visible decision boundary in the diagram.
Neural Networks in Language Processing
As mentioned earlier, the universal architecture of neural networks makes them applicable to language-related tasks. For simplicity, we omit the process of converting a word into a vector here. Let's consider it a black box. The simplest application of a neural network in language processing is predicting the next word in a sentence. We input a vector representing one word, e.g., "day," and receive a vector representing another word, e.g., "good."

In this way, we can generate a sequence of any length:

The sequence generated in this way will likely not resemble a true, human statement. Each subsequent word is predicted based only on the previous one. This actually prevents the generation of meaningful and realistic texts. For this, knowledge of all the words that have been generated so far would be helpful. Fortunately, there is a modification of the traditional linear network called a recurrent neural network.
Recurrent Neural Networks
A recurrent neural network is a modification of the traditional feedforward network. What sets it apart is the hidden state, which can be thought of as the network's memory. Mathematically, a recurrent layer can be written as:

The hidden state h ensures that each generated word depends on all the previous words in the sequence, not just the last one, allowing for more realistic sentence generation. The output of the recurrent layer is the new hidden state.
Thanks to the hidden state, a recurrent network does not need to receive a new input vector in every iteration, as it can generate new sequence elements based on the hidden state.
Recurrent neural networks have many applications; we will focus on encoder-decoder architectures. The role of such an architecture is to transform an input sequence into another sequence. The simplest example is machine translation, where we input a sequence in the source language and receive its translation in the target language as output.

Another application is responding to questions posed to the model. That's right, GPT models are based on encoder-decoder architectures.

Problems of Recurrent Neural Networks
Recurrent neural networks have particular difficulties in modeling long sequences. Information contained in the hidden state from the beginning of a sequence tends to gradually fade. Over the years, various modifications have been developed to mitigate this problem, such as LSTM and GRU networks. However, we will focus on another approach, namely attention mechanisms.
Attention Mechanisms
Attention mechanisms are an innovation that has made language models so effective today. They are, to some extent, a response to the problem of vanishing memory. In the encoder-decoder model, a traditional RNN generates the output sequence solely based on the hidden state. As mentioned in the previous section, relying solely on the hidden state vector is not the best method of remembering sequences, especially during decoding, when the model must be aware of what the original sequence looked like and what has been translated so far. Here, the attention mechanism comes to the rescue, providing the model with a weighted sum of elements from the original sequence. The fragments of the original sequence being decoded at any given moment have greater weight.
This is where the name "attention mechanism" comes from. Besides recalling the original sequence, the model also receives information about which part it should now focus on.
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
Attention mechanisms proved to be so effective that in 2017 a paper titled "Attention Is All You Need" was published. Its authors proposed a new architecture called the Transformer, which relies solely on attention mechanisms instead of using recursion.
Transformer, the Foundation of Modern Language Models
The Transformer is an encoder-decoder architecture, but unlike recurrent neural networks, it does not process individual elements of a sequence one by one. Instead, the entire input sequence is processed simultaneously. This ensures that the initial elements of the sequence are not forgotten by the model. However, the downside of this solution is the predetermined maximum length of the sequence.
The self-attention mechanism used in Transformers models dependencies between individual words in the sequence. This representation allows for the generation of better target sequences.
Another issue with Transformers needs to be discussed. The use of the self-attention mechanism means the model has no idea of the actual order in which elements appeared in the sequence. For this reason, so-called positional encoding was introduced, which provides the model with information about the order of elements in the sequence.
It's also important to distinguish between the attention mechanisms used in both described architectures. RNN uses the connections of the hidden state with individual elements of the input sequence. In the Transformer, there are connections between individual elements of the input sequence rather than the hidden state. This is why it is called the self-attention mechanism. The Transformer schematic is shown in the image below:

Vaswani, Ashish. "Attention is all you need." arXiv preprint arXiv:1706.03762 (2017)
Modern language models are heavily inspired by the architecture described in this subsection. The abbreviation GPT stands for Generative Pre-trained Transformer. Due to their widespread use, it's worthwhile to understand the basic concept of their operation.