Published on

Introduction to Large Language Models

Authors
  • avatar
    Name
    Antonio Castaldo
    Twitter

Large Language Models (LLMs) have demonstrated impressive performance across a large variety of tasks and have amazed the world with their abilities. Still, the way these models work is still unknown to many. In this article, we will unveil in simple terms what LLMs are, how they work, and how to use them to satisfy our needs.

What are Large Language Models?

LLMs are an example of what we call foundation models. These sophisticated models, built on the Transformers architecture [@vaswani2017attention], are designed to understand and generate text by training on vast amounts of unlabeled data. The magic lies in their ability to grasp the sequence and context of words, thanks to the unique structure of Transformers.

How do LLMs Work?

Training Process

LLMs acquire their knowledge from large amounts of data during a process called "training". This involves iteratively adjusting the model's weights to minimize the gap between its predictions and the actual data, refining the model's understanding and predictions.

Next Token Prediction

The primary objective used in LLM training is called next token prediction. Tokens are the units of text that LLMs process during their training and inference. When these models receive an input, a tokenizer first splits the content into smaller units, which can be words, subwords, or characters. The LLM then processes these tokens and predicts the most likely next token based on the probability distribution learned during training. Therefore, when the LLM generates seemingly human text, it is doing nothing more than selecting one at a time the most plausible token from a probability distribution.

Tokenization and Embeddings

Tokenization is the process of converting raw text into a sequence of tokens that the model can understand. For LLMs, we typically use subword tokenization methods like Byte-Pair Encoding (BPE), which balances the representation of common words and rare words by combining characters into larger units [@sennrich2016bpe]. After tokenization, the model converts each token into a numerical representation called an embedding. These embeddings are learned during the training process and capture semantic and syntactic information about the tokens. The embedding layer of the model maps each token to a high-dimensional vector. In addition to token embeddings, positional embeddings are added to provide information about the position of each token in the sequence. This is crucial for the model to understand the order and context of the tokens, especially in transformer-based architectures that process all tokens in parallel. Tokenization and embedding are interdependent, but separate processes. The tokenizer splits the text into tokens, while the embedding layer within the model converts these tokens into vector representations that the neural network can process.

Data Selection and Scaling Laws

Data selection is possibly the most important task when training a model. The amount of data used to train an LLM is what gives it its abilities. @kaplan2020scaling described the relationship between the resources used to train language models and the performance acquired, referred to as Kaplan's Scaling Law.

How LLMs Generate Text

Most LLMs are autoregressive models, meaning they generate each token by considering the previous ones and the context they have seen so far. This process of generating text is called "decoding", and several strategies are available:

  1. Greedy Decoding
  2. Random Sampling
  3. Beam Search with Temperature Sampling (most commonly used)

Greedy Decoding samples from the probability distribution the token with the highest probability (argmax). This is simple to compute, however it is not the norm for LLMs, as it leads to a deterministic textual generation.

Random Sampling is another method based on randomly selecting the next token, taking into account their probability. This adds randomness and solves the deterministic behavior, but generates incoherent text.

What is actually used in most cases is Beam Search with temperature sampling [@ficler2017controlling]. Beam Search generates different options of several tokens at once and calculates the overall probability of all tokens together, where each beam is a sentence that the algorithm considers. To beam search, we add the concept of temperature sampling. This smoothly increases the probability of more probable words and decrease the probability of less probable words, where the temperature controls the amount of randomness in the outputs.

In-Context Learning

@brown2020language introduced the concept of in-context learning. This means that due to their autoregressive nature, LLMs can learn to execute new tasks without being explicitly fine-tuned for these purposes. They can learn and improve their prediction according to the context they have seen.

  • Zero-shot prompting: Using LLMs directly without providing examples
  • Few-shot prompting: Feeding the LLM a few examples before asking it to perform a task

Challenges and Considerations

Biases in LLMs

Several studies, such as @Bender2021dangers , have highlighted the importance of data quality in preventing various types of biases that models may inherit if the training data is not effectively curated.

The Tay Incident

An important cautionary tale in the world of AI is Microsoft's Tay Chatbot [@holley2016tay], released in 2016 and taken down less than 24 hours after its release due to its rapid descent into generating toxic content. This incident underscores the critical importance of supervised and curated training data.

Conclusion

Large Language Models represent a significant leap forward in AI technology, offering impressive capabilities in understanding and generating human-like text. However, their development and deployment come with important considerations regarding data quality, bias prevention, and responsible use.

References

[^ref]