Converting Raw Text into Sequence Data
Throughout this book, we will often work with text data represented as sequences of words, characters, or word pieces. To get going, we will need some basic tools for converting raw text into sequences of the appropriate form. Typical preprocessing pipelines execute the following steps:
Load text as strings into memory.
Split the strings into tokens (e.g., words or characters).
Build a vocabulary dictionary to associate each vocabulary element with a numerical index.
Convert the text into sequences of numerical indices.
using Pkg; Pkg.activate("../../d2lai")
using d2lai
using Flux
using Downloads
using StatsBase
using Plots Activating project at `/workspace/d2l-julia/d2lai`[ Info: Precompiling d2lai [749b8817-cd67-416c-8a57-830ea19f3cc4] (cache misses: include_dependency fsize change (2))
Reading the Dataset
Here, we will work with H. G. Wells' The Time Machine, a book containing just over 30,000 words. While real applications will typically involve significantly larger datasets, this is sufficient to demonstrate the preprocessing pipeline. The following _download method reads the raw text into a string.
mutable struct TimeMachine{X, Y, RT} <: AbstractData
X::X
y::Y
raw_text::RT
end
function _download(::Type{TimeMachine})
file_path = d2lai._download("timemachine.txt")
s = open(file_path, "r") do f
read(f, String)
end
return s
end_download (generic function with 1 method)data = TimeMachine(nothing, nothing, nothing)
raw_text = _download(TimeMachine)
raw_text[1:60]"The Time Machine, by H. G. Wells [1898]\n\n\n\n\nI\n\n\nThe Time Tra"For simplicity, we ignore punctuation and capitalization when preprocessing the raw text.
function _preprocess(::Type{TimeMachine}, raw_text)
text = replace(raw_text, r"[^A-Za-z]+" => " ") |> lowercase
end
text_ = _preprocess(TimeMachine, raw_text)
text_[1:60]"the time machine by h g wells i the time traveller for so it"Tokenization
Tokens are the atomic (indivisible) units of text. Each time step corresponds to 1 token, but what precisely constitutes a token is a design choice. For example, we could represent the sentence "Baby needs a new pair of shoes" as a sequence of 7 words, where the set of all words comprise a large vocabulary (typically tens or hundreds of thousands of words). Or we would represent the same sentence as a much longer sequence of 30 characters, using a much smaller vocabulary (there are only 256 distinct ASCII characters). Below, we tokenize our preprocessed text into a sequence of characters.
function _tokenize(::Type{TimeMachine}, text)
return string.(collect(text))
end
tokens = _tokenize(TimeMachine, text_[1:30])
join(tokens, " ,")"t ,h ,e , ,t ,i ,m ,e , ,m ,a ,c ,h ,i ,n ,e , ,b ,y , ,h , ,g , ,w ,e ,l ,l ,s , "Vocabulary
These tokens are still strings. However, the inputs to our models must ultimately consist of numerical inputs. Next, we introduce a class for constructing vocabularies, i.e., objects that associate each distinct token value with a unique index. First, we determine the set of unique tokens in our training corpus. We then assign a numerical index to each unique token. Rare vocabulary elements are often dropped for convenience. Whenever we encounter a token at training or test time that had not been previously seen or was dropped from the vocabulary, we represent it by a special "<unk>" token, signifying that this is an unknown value.
struct Vocab{TF, IT, TI}
token_freqs::TF
idx_to_token::IT
token_to_idx::TI
end
function Vocab(; tokens = [], min_freq = 0, reserved_tokens = [])
# Flatten a 2D list if needed
if !isempty(tokens) && tokens[1] isa Vector
tokens = vcat(tokens...)
end
# Count token frequencies
counter = countmap(tokens)
token_freqs = sort(collect(counter), by=x->x[2], rev=true)
# The list of unique tokens
idx_to_token = sort(vcat(["<unk>"], reserved_tokens,
[(string(token)) for (token, freq) in token_freqs if freq >= min_freq]))
# Token to index mapping
token_to_idx = Dict(token => idx for (idx, token) in enumerate(idx_to_token))
Vocab(token_freqs, idx_to_token, token_to_idx)
end
Base.length(v::Vocab) = length(v.idx_to_token)
unk(v::Vocab) = v.token_to_idx["<unk>"]
function Base.getindex(v::Vocab, tokens)
if !(typeof(tokens) <: AbstractVector)
return haskey(v.token_to_idx, tokens) ? v.token_to_idx[string(tokens)] : unk(v)
else
return map(t -> Base.getindex(v, t), string.(tokens))
end
end
to_tokens(v::Vocab, idx::Int) = v.idx_to_token[idx]
to_tokens(v::Vocab, indices::AbstractVector{<:Int}) = to_tokens.(Ref(v), indices)to_tokens (generic function with 2 methods)We now construct a vocabulary for our dataset, converting the sequence of strings into a list of numerical indices. Note that we have not lost any information and can easily convert our dataset back to its original (string) representation.
vocab = Vocab(;tokens)
indices = vocab[tokens[1:10]]
println("indices:", indices)
println("words:", to_tokens(vocab, indices))indices:[14, 8, 6, 1, 14, 9, 11, 6, 1, 11]
["t", "h", "e", " ", "t", "i", "m", "e", " ", "m"]Putting It All Together
Using the above classes and methods, we package everything into the following build method of the TimeMachine class, which returns corpus, a list of token indices, and vocab, the vocabulary of The Time Machine corpus. The modifications we did here are: (i) we tokenize text into characters, not words, to simplify the training in later sections; (ii) corpus is a single list, not a list of token lists, since each text line in The Time Machine dataset is not necessarily a sentence or paragraph.
function build(T::Type{TimeMachine}, raw_text, vocab = nothing)
tokens = _tokenize(T, _preprocess(T, raw_text))
if isnothing(vocab)
vocab = Vocab(; tokens)
end
corpus = [vocab[token] for token in tokens]
return corpus, vocab
end
corpus, vocab = build(TimeMachine, raw_text)
length(corpus), length(vocab)(173428, 28)Exploratory Language Statistics
Using the real corpus and the Vocab class defined over words, we can inspect basic statistics concerning word use in our corpus. Below, we construct a vocabulary from words used in The Time Machine and print the ten most frequently occurring of them.
words = split(text_)
vocab = Vocab(; tokens = words)
vocab.token_freqs[1:10]10-element Vector{Pair{SubString{String}, Int64}}:
"the" => 2261
"i" => 1267
"and" => 1245
"of" => 1155
"a" => 816
"to" => 695
"was" => 552
"in" => 541
"that" => 443
"my" => 440Note that (the ten most frequent words) are not all that descriptive. You might even imagine that we might see a very similar list if we had chosen any book at random. Articles like "the" and "a", pronouns like "i" and "my", and prepositions like "of", "to", and "in" occur often because they serve common syntactic roles. Such words that are common but not particularly descriptive are often called (stop words) and, in previous generations of text classifiers based on so-called bag-of-words representations, they were most often filtered out. However, they carry meaning and it is not necessary to filter them out when working with modern RNN- and Transformer-based neural models. If you look further down the list, you will notice that word frequency decays quickly. The
freqs = [freq for (_, freq) in vocab.token_freqs]
plot(freqs, xlabel = "token: x", ylabel = "frequency: n(x)", xscale = :log10, yscale = :log10)After dealing with the first few words as exceptions, all the remaining words roughly follow a straight line on a log–log plot. This phenomenon is captured by Zipf's law, which states that the frequency
which is equivalent to
where
bigram_tokens = [join(pair, "--") for pair in zip(words[1:end-1], words[2:end])]
bigram_vocab = Vocab(; tokens = bigram_tokens)
bigram_vocab.token_freqs[1:10]10-element Vector{Pair{String, Int64}}:
"of--the" => 309
"in--the" => 169
"i--had" => 130
"i--was" => 112
"and--the" => 109
"the--time" => 102
"it--was" => 99
"to--the" => 85
"as--i" => 78
"of--a" => 73One thing is notable here. Out of the ten most frequent word pairs, nine are composed of both stop words and only one is relevant to the actual book–-"the time". Furthermore, let's see whether the trigram frequency behaves in the same manner.
trigram_tokens = [join(triple, "--") for triple in zip(words[1:end-2], words[2:end-1], words[3:end])]
trigram_vocab = Vocab(; tokens = trigram_tokens)
trigram_vocab.token_freqs[1:10]10-element Vector{Pair{String, Int64}}:
"the--time--traveller" => 59
"the--time--machine" => 30
"the--medical--man" => 24
"it--seemed--to" => 16
"here--and--there" => 15
"it--was--a" => 15
"i--did--not" => 14
"seemed--to--me" => 14
"i--began--to" => 13
"i--saw--the" => 13Now, let’s visualize the token frequency among these three models: unigrams, bigrams, and trigrams.
bigram_freqs = [freq for (_, freq) in bigram_vocab.token_freqs]
trigram_freqs = [freq for (_, freq) in trigram_vocab.token_freqs]
plot(freqs, xlabel = "token: x", ylabel = "frequency: n(x)", xscale = :log10, yscale = :log10, label = "unigram")
plot!(bigram_freqs, label = "bigram")
plot!(trigram_freqs, label = "trigram")This figure is quite exciting. First, beyond unigram words, sequences of words also appear to be following Zipf's law, albeit with a smaller exponent eq_zipf_law, depending on the sequence length. Second, the number of distinct
Summary
Text is among the most common forms of sequence data encountered in deep learning. Common choices for what constitutes a token are characters, words, and word pieces. To preprocess text, we usually (i) split text into tokens; (ii) build a vocabulary to map token strings to numerical indices; and (iii) convert text data into token indices for models to manipulate. In practice, the frequency of words tends to follow Zipf's law. This is true not just for individual words (unigrams), but also for
Exercises
In the experiment of this section, tokenize text into words and vary the
min_freqargument value of theVocabinstance. Qualitatively characterize how changes inmin_freqimpact the size of the resulting vocabulary.Estimate the exponent of Zipfian distribution for unigrams, bigrams, and trigrams in this corpus.
Find some other sources of data (download a standard machine learning dataset, pick another public domain book, scrape a website, etc). For each, tokenize the data at both the word and character levels. How do the vocabulary sizes compare with The Time Machine corpus at equivalent values of
min_freq. Estimate the exponent of the Zipfian distribution corresponding to the unigram and bigram distributions for these corpora. How do they compare with the values that you observed for The Time Machine corpus?