Skip to content

Converting Raw Text into Sequence Data

Throughout this book, we will often work with text data represented as sequences of words, characters, or word pieces. To get going, we will need some basic tools for converting raw text into sequences of the appropriate form. Typical preprocessing pipelines execute the following steps:

  1. Load text as strings into memory.

  2. Split the strings into tokens (e.g., words or characters).

  3. Build a vocabulary dictionary to associate each vocabulary element with a numerical index.

  4. Convert the text into sequences of numerical indices.

julia
using Pkg; Pkg.activate("../../d2lai")
using d2lai
using Flux 
using Downloads
using StatsBase
using Plots
  Activating project at `/workspace/d2l-julia/d2lai`
    [ Info: Precompiling d2lai [749b8817-cd67-416c-8a57-830ea19f3cc4] (cache misses: include_dependency fsize change (2))

Reading the Dataset

Here, we will work with H. G. Wells' The Time Machine, a book containing just over 30,000 words. While real applications will typically involve significantly larger datasets, this is sufficient to demonstrate the preprocessing pipeline. The following _download method reads the raw text into a string.

julia
mutable struct TimeMachine{X, Y, RT} <: AbstractData
    X::X 
    y::Y
    raw_text::RT
end

function _download(::Type{TimeMachine}) 
    file_path = d2lai._download("timemachine.txt")
    s = open(file_path, "r") do f
        read(f, String)
    end
    return s
end
_download (generic function with 1 method)
julia
data = TimeMachine(nothing, nothing, nothing)
raw_text = _download(TimeMachine)
raw_text[1:60]
"The Time Machine, by H. G. Wells [1898]\n\n\n\n\nI\n\n\nThe Time Tra"

For simplicity, we ignore punctuation and capitalization when preprocessing the raw text.

julia
function _preprocess(::Type{TimeMachine}, raw_text)
    text = replace(raw_text, r"[^A-Za-z]+" => " ") |> lowercase
end 
text_ = _preprocess(TimeMachine, raw_text)
text_[1:60]
"the time machine by h g wells i the time traveller for so it"

Tokenization

Tokens are the atomic (indivisible) units of text. Each time step corresponds to 1 token, but what precisely constitutes a token is a design choice. For example, we could represent the sentence "Baby needs a new pair of shoes" as a sequence of 7 words, where the set of all words comprise a large vocabulary (typically tens or hundreds of thousands of words). Or we would represent the same sentence as a much longer sequence of 30 characters, using a much smaller vocabulary (there are only 256 distinct ASCII characters). Below, we tokenize our preprocessed text into a sequence of characters.

julia
function _tokenize(::Type{TimeMachine}, text)
    return string.(collect(text))
end 
tokens = _tokenize(TimeMachine, text_[1:30])
join(tokens, " ,")
"t ,h ,e ,  ,t ,i ,m ,e ,  ,m ,a ,c ,h ,i ,n ,e ,  ,b ,y ,  ,h ,  ,g ,  ,w ,e ,l ,l ,s , "

Vocabulary

These tokens are still strings. However, the inputs to our models must ultimately consist of numerical inputs. Next, we introduce a class for constructing vocabularies, i.e., objects that associate each distinct token value with a unique index. First, we determine the set of unique tokens in our training corpus. We then assign a numerical index to each unique token. Rare vocabulary elements are often dropped for convenience. Whenever we encounter a token at training or test time that had not been previously seen or was dropped from the vocabulary, we represent it by a special "&lt;unk&gt;" token, signifying that this is an unknown value.

julia
struct Vocab{TF, IT, TI}
    token_freqs::TF 
    idx_to_token::IT 
    token_to_idx::TI
end 

function Vocab(; tokens = [], min_freq = 0, reserved_tokens = [])
    # Flatten a 2D list if needed
    if !isempty(tokens) && tokens[1] isa Vector
        tokens = vcat(tokens...)
    end
    
    # Count token frequencies
    counter = countmap(tokens)
    token_freqs = sort(collect(counter), by=x->x[2], rev=true)
    # The list of unique tokens
    idx_to_token = sort(vcat(["<unk>"], reserved_tokens, 
        [(string(token)) for (token, freq) in token_freqs if freq >= min_freq]))
    
    # Token to index mapping
    token_to_idx = Dict(token => idx for (idx, token) in enumerate(idx_to_token))

    Vocab(token_freqs, idx_to_token, token_to_idx)

end

Base.length(v::Vocab) = length(v.idx_to_token)
unk(v::Vocab) = v.token_to_idx["<unk>"]

function Base.getindex(v::Vocab, tokens)
    if !(typeof(tokens) <: AbstractVector)
        return haskey(v.token_to_idx, tokens) ? v.token_to_idx[string(tokens)] : unk(v)
    else
        return map(t -> Base.getindex(v, t), string.(tokens))
    end
end

to_tokens(v::Vocab, idx::Int) = v.idx_to_token[idx]
to_tokens(v::Vocab, indices::AbstractVector{<:Int}) = to_tokens.(Ref(v), indices)
to_tokens (generic function with 2 methods)

We now construct a vocabulary for our dataset, converting the sequence of strings into a list of numerical indices. Note that we have not lost any information and can easily convert our dataset back to its original (string) representation.

julia
vocab = Vocab(;tokens)
indices = vocab[tokens[1:10]]
println("indices:", indices)
println("words:", to_tokens(vocab, indices))
indices:[14, 8, 6, 1, 14, 9, 11, 6, 1, 11]
["t", "h", "e", " ", "t", "i", "m", "e", " ", "m"]

Putting It All Together

Using the above classes and methods, we package everything into the following build method of the TimeMachine class, which returns corpus, a list of token indices, and vocab, the vocabulary of The Time Machine corpus. The modifications we did here are: (i) we tokenize text into characters, not words, to simplify the training in later sections; (ii) corpus is a single list, not a list of token lists, since each text line in The Time Machine dataset is not necessarily a sentence or paragraph.

julia
function build(T::Type{TimeMachine}, raw_text, vocab = nothing)
    tokens = _tokenize(T, _preprocess(T, raw_text))
    if isnothing(vocab)
        vocab = Vocab(; tokens)
    end
    corpus = [vocab[token] for token in tokens]
    return corpus, vocab
end
corpus, vocab = build(TimeMachine, raw_text)
length(corpus), length(vocab)
(173428, 28)

Exploratory Language Statistics

Using the real corpus and the Vocab class defined over words, we can inspect basic statistics concerning word use in our corpus. Below, we construct a vocabulary from words used in The Time Machine and print the ten most frequently occurring of them.

julia
words = split(text_)
vocab = Vocab(; tokens = words)
vocab.token_freqs[1:10]
10-element Vector{Pair{SubString{String}, Int64}}:
  "the" => 2261
    "i" => 1267
  "and" => 1245
   "of" => 1155
    "a" => 816
   "to" => 695
  "was" => 552
   "in" => 541
 "that" => 443
   "my" => 440

Note that (the ten most frequent words) are not all that descriptive. You might even imagine that we might see a very similar list if we had chosen any book at random. Articles like "the" and "a", pronouns like "i" and "my", and prepositions like "of", "to", and "in" occur often because they serve common syntactic roles. Such words that are common but not particularly descriptive are often called (stop words) and, in previous generations of text classifiers based on so-called bag-of-words representations, they were most often filtered out. However, they carry meaning and it is not necessary to filter them out when working with modern RNN- and Transformer-based neural models. If you look further down the list, you will notice that word frequency decays quickly. The 10th most frequent word is less than 1/5 as common as the most popular. Word frequency tends to follow a power law distribution (specifically the Zipfian) as we go down the ranks. To get a better idea, we plot the figure of the word frequency.

julia
freqs = [freq for (_, freq) in vocab.token_freqs]
plot(freqs, xlabel = "token: x", ylabel = "frequency: n(x)", xscale = :log10, yscale = :log10)

After dealing with the first few words as exceptions, all the remaining words roughly follow a straight line on a log–log plot. This phenomenon is captured by Zipf's law, which states that the frequency ni of the ith most frequent word is:

ni1iα,

which is equivalent to

logni=αlogi+c,

where α is the exponent that characterizes the distribution and c is a constant. This should already give us pause for thought if we want to model words by counting statistics. After all, we will significantly overestimate the frequency of the tail, also known as the infrequent words. But [what about the other word combinations, such as two consecutive words (bigrams), three consecutive words (trigrams)], and beyond? Let's see whether the bigram frequency behaves in the same manner as the single word (unigram) frequency.

julia
bigram_tokens = [join(pair, "--") for pair in zip(words[1:end-1], words[2:end])]
bigram_vocab = Vocab(; tokens = bigram_tokens)
bigram_vocab.token_freqs[1:10]
10-element Vector{Pair{String, Int64}}:
   "of--the" => 309
   "in--the" => 169
    "i--had" => 130
    "i--was" => 112
  "and--the" => 109
 "the--time" => 102
   "it--was" => 99
   "to--the" => 85
     "as--i" => 78
     "of--a" => 73

One thing is notable here. Out of the ten most frequent word pairs, nine are composed of both stop words and only one is relevant to the actual book–-"the time". Furthermore, let's see whether the trigram frequency behaves in the same manner.

julia
trigram_tokens = [join(triple, "--") for triple in zip(words[1:end-2], words[2:end-1], words[3:end])]
trigram_vocab = Vocab(; tokens = trigram_tokens)
trigram_vocab.token_freqs[1:10]
10-element Vector{Pair{String, Int64}}:
 "the--time--traveller" => 59
   "the--time--machine" => 30
    "the--medical--man" => 24
       "it--seemed--to" => 16
     "here--and--there" => 15
           "it--was--a" => 15
          "i--did--not" => 14
       "seemed--to--me" => 14
         "i--began--to" => 13
          "i--saw--the" => 13

Now, let’s visualize the token frequency among these three models: unigrams, bigrams, and trigrams.

julia
bigram_freqs = [freq for (_, freq) in bigram_vocab.token_freqs]
trigram_freqs = [freq for (_, freq) in trigram_vocab.token_freqs]

plot(freqs, xlabel = "token: x", ylabel = "frequency: n(x)", xscale = :log10, yscale = :log10, label = "unigram")
plot!(bigram_freqs, label = "bigram")
plot!(trigram_freqs, label = "trigram")

This figure is quite exciting. First, beyond unigram words, sequences of words also appear to be following Zipf's law, albeit with a smaller exponent α in :eqref:eq_zipf_law, depending on the sequence length. Second, the number of distinct n-grams is not that large. This gives us hope that there is quite a lot of structure in language. Third, many n-grams occur very rarely. This makes certain methods unsuitable for language modeling and motivates the use of deep learning models. We will discuss this in the next section.

Summary

Text is among the most common forms of sequence data encountered in deep learning. Common choices for what constitutes a token are characters, words, and word pieces. To preprocess text, we usually (i) split text into tokens; (ii) build a vocabulary to map token strings to numerical indices; and (iii) convert text data into token indices for models to manipulate. In practice, the frequency of words tends to follow Zipf's law. This is true not just for individual words (unigrams), but also for n-grams.

Exercises

  1. In the experiment of this section, tokenize text into words and vary the min_freq argument value of the Vocab instance. Qualitatively characterize how changes in min_freq impact the size of the resulting vocabulary.

  2. Estimate the exponent of Zipfian distribution for unigrams, bigrams, and trigrams in this corpus.

  3. Find some other sources of data (download a standard machine learning dataset, pick another public domain book, scrape a website, etc). For each, tokenize the data at both the word and character levels. How do the vocabulary sizes compare with The Time Machine corpus at equivalent values of min_freq. Estimate the exponent of the Zipfian distribution corresponding to the unigram and bigram distributions for these corpora. How do they compare with the values that you observed for The Time Machine corpus?

julia
julia