Language Models
In :numref:sec_text-sequence, we saw how to map text sequences into tokens, where these tokens can be viewed as a sequence of discrete observations such as words or characters. Assume that the tokens in a text sequence of length
$
P(x_1, x_2, \ldots, x_T) $
where statistical tools in :numref:sec_sequence can be applied.
Language models are incredibly useful. For instance, an ideal language model should generate natural text on its own, simply by drawing one token at a time
Nonetheless, language models are of great service even in their limited form. For instance, the phrases "to recognize speech" and "to wreck a nice beach" sound very similar. This can cause ambiguity in speech recognition, which is easily resolved through a language model that rejects the second translation as outlandish. Likewise, in a document summarization algorithm it is worthwhile knowing that "dog bites man" is much more frequent than "man bites dog", or that "I want to eat grandma" is a rather disturbing statement, whereas "I want to eat, grandma" is much more benign.
using Pkg; Pkg.activate("../../d2lai")
using d2lai
using Flux
using Downloads
using StatsBase
using Plots Activating project at `/workspace/d2l-julia/d2lai`Learning Language Models
The obvious question is how we should model a document, or even a sequence of tokens. Suppose that we tokenize text data at the word level. Let's start by applying basic probability rules:
For example, the probability of a text sequence containing four words would be given as:
Markov Models and -grams
Among those sequence model analyses in :numref:sec_sequence, let's apply Markov models to language modeling. A distribution over sequences satisfies the Markov property of first order if
The probability formulae that involve one, two, and three variables are typically referred to as unigram, bigram, and trigram models, respectively. In order to compute the language model, we need to calculate the probability of words and the conditional probability of a word given the previous few words. Note that such probabilities are language model parameters.
Word Frequency
Here, we assume that the training dataset is a large text corpus, such as all Wikipedia entries, Project Gutenberg, and all text posted on the web. The probability of words can be calculated from the relative word frequency of a given word in the training dataset. For example, the estimate
where subsec_natural-lang-stat, things take a turn for the worse for three-word combinations and beyond. There will be many plausible three-word combinations that we likely will not see in our dataset. Unless we provide some solution to assign such word combinations a nonzero count, we will not be able to use them in a language model. If the dataset is small or if the words are very rare, we might not find even a single one of them.
Laplace Smoothing
A common strategy is to perform some form of Laplace smoothing. The solution is to add a small constant to all counts. Denote by
Here
Unfortunately, models like this get unwieldy rather quickly for the following reasons. First, as discussed in :numref:subsec_natural-lang-stat, many
Perplexity
Next, let's discuss about how to measure the quality of the language model, which we will then use to evaluate our models in the subsequent sections. One way is to check how surprising the text is. A good language model is able to predict, with high accuracy, the tokens that come next. Consider the following continuations of the phrase "It is raining", as proposed by different language models:
"It is raining outside"
"It is raining banana tree"
"It is raining piouw;kcj pwepoiut"
In terms of quality, Example 1 is clearly the best. The words are sensible and logically coherent. While it might not quite accurately reflect which word follows semantically ("in San Francisco" and "in winter" would have been perfectly reasonable extensions), the model is able to capture which kind of word follows. Example 2 is considerably worse by producing a nonsensical extension. Nonetheless, at least the model has learned how to spell words and some degree of correlation between words. Last, Example 3 indicates a poorly trained model that does not fit data properly.
We might measure the quality of the model by computing the likelihood of the sequence. Unfortunately this is a number that is hard to understand and difficult to compare. After all, shorter sequences are much more likely to occur than the longer ones, hence evaluating the model on Tolstoy's magnum opus War and Peace will inevitably produce a much smaller likelihood than, say, on Saint-Exupery's novella The Little Prince. What is missing is the equivalent of an average.
Information theory comes handy here. We defined entropy, surprisal, and cross-entropy when we introduced the softmax regression (:numref:subsec_info_theory_basics). If we want to compress text, we can ask about predicting the next token given the current set of tokens. A better language model should allow us to predict the next token more accurately. Thus, it should allow us to spend fewer bits in compressing the sequence. So we can measure it by the cross-entropy loss averaged over all the
:eqlabel:eq_avg_ce_for_lm
where eq_avg_ce_for_lm:
Perplexity can be best understood as the reciprocal of the geometric mean of the number of real choices that we have when deciding which token to pick next. Let's look at a number of cases:
In the best case scenario, the model always perfectly estimates the probability of the target token as 1. In this case the perplexity of the model is 1.
In the worst case scenario, the model always predicts the probability of the target token as 0. In this situation, the perplexity is positive infinity.
At the baseline, the model predicts a uniform distribution over all the available tokens of the vocabulary. In this case, the perplexity equals the number of unique tokens of the vocabulary. In fact, if we were to store the sequence without any compression, this would be the best we could do for encoding it. Hence, this provides a nontrivial upper bound that any useful model must beat.
Partitioning Sequences
We will design language models using neural networks and use perplexity to evaluate how good the model is at predicting the next token given the current set of tokens in text sequences. Before introducing the model, let's assume that it processes a minibatch of sequences with predefined length at a time. Now the question is how to [read minibatches of input sequences and target sequences at random].
Suppose that the dataset takes the form of a sequence of corpus. We will partition it into subsequences, where each subsequence has
For language modeling, the goal is to predict the next token based on the tokens we have seen so far; hence the targets (labels) are the original sequence, shifted by one token. The target sequence for any input sequence
Obtaining five pairs of input sequences and target sequences from partitioned length-5 subsequences.
Figure shows an example of obtaining five pairs of input sequences and target sequences with
function d2lai.TimeMachine(batchsize::Int, num_steps::Int, num_train = 10000, num_val = 5000)
corpus, vocab = d2lai.build(TimeMachine, d2lai._download(TimeMachine))
array = reduce(hcat, [corpus[i:i+num_steps] for i in 1:(length(corpus) - num_steps)])
X,y = array[1:end-1, :], array[2:end, :]
d2lai.TimeMachine(X, y, vocab, corpus, (batchsize = batchsize, num_steps = num_steps, num_train=num_train, num_val = num_val))
endTo train language models, we will randomly sample pairs of input sequences and target sequences in minibatches. The following data loader randomly generates a minibatch from the dataset each time. The argument batch_size specifies the number of subsequence examples in each minibatch and num_steps is the subsequence length in tokens.
function d2lai.get_dataloader(data::d2lai.TimeMachine; train = true)
idxs = train ? (1:data.args.num_train) : (data.args.num_train+1):(data.args.num_train+data.args.num_val)
return Flux.DataLoader((data.X[:, idxs], data.y[:, idxs]), shuffle = train, batchsize = data.args.batchsize)
endAs we can see in the following, a minibatch of target sequences can be obtained by shifting the input sequences by one token.
data = d2lai.TimeMachine(2, 10)
x, y = get_dataloader(data) |> first([3 1; 16 7; … ; 22 1; 1 21], [16 7; 21 26; … ; 1 21; 5 3])Summary and Discussion
Language models estimate the joint probability of a text sequence. For long sequences,
Language models can be scaled up with increased data size, model size, and amount in training compute. Large language models can perform desired tasks by predicting output text given input text instructions. As we will discuss later (e.g., :numref:sec_large-pretraining-transformers), at the present moment large language models form the basis of state-of-the-art systems across diverse tasks.
Exercises
Suppose there are 100,000 words in the training dataset. How much word frequency and multi-word adjacent frequency does a four-gram need to store?
How would you model a dialogue?
What other methods can you think of for reading long sequence data?
Consider our method for discarding a uniformly random number of the first few tokens at the beginning of each epoch.
Does it really lead to a perfectly uniform distribution over the sequences on the document?
What would you have to do to make things even more uniform?
If we want a sequence example to be a complete sentence, what kind of problem does this introduce in minibatch sampling? How can we fix it?