Recurrent Neural Networks ​
In :numref:sec_language-model we described Markov models and
$
P(x_t \mid x_{t-1}, \ldots, x_1) \approx P(x_t \mid h_{t-1}) $
where
:eqlabel:eq_ht_xt
For a sufficiently powerful function eq_ht_xt, the latent variable model is not an approximation. After all,
Recall that we have discussed hidden layers with hidden units in :numref:chap_perceptrons. It is noteworthy that hidden layers and hidden states refer to two very different concepts. Hidden layers are, as explained, layers that are hidden from view on the path from input to output. Hidden states are technically speaking inputs to whatever we do at a given step, and they can only be computed by looking at data at previous time steps.
Recurrent neural networks (RNNs) are neural networks with hidden states. Before introducing the RNN model, we first revisit the MLP model introduced in :numref:sec_mlp.
using Pkg; Pkg.activate("../../d2lai")
using d2lai
using Flux
using Downloads
using StatsBase
using Plots Activating project at `/workspace/d2l-julia/d2lai`Neural Networks without Hidden States ​
Let's take a look at an MLP with a single hidden layer. Let the hidden layer's activation function be
:eqlabel:rnn_h_without_state
In :eqref:rnn_h_without_state, we have the weight parameter subsec_broadcasting) during the summation. Next, the hidden layer output
where
This is entirely analogous to the regression problem we solved previously in :numref:sec_sequence, hence we omit details. Suffice it to say that we can pick feature-label pairs at random and learn the parameters of our network via automatic differentiation and stochastic gradient descent.
Recurrent Neural Networks with Hidden States ​
Matters are entirely different when we have hidden states. Let's look at the structure in some more detail.
Assume that we have a minibatch of inputs
:eqlabel:rnn_h_with_state
Compared with :eqref:rnn_h_without_state, :eqref:rnn_h_with_state adds one more term eq_ht_xt. From the relationship between hidden layer outputs rnn_h_with_state is recurrent. Hence, as we said, neural networks with hidden states based on recurrent computation are named recurrent neural networks. Layers that perform the computation of :eqref:rnn_h_with_state in RNNs are called recurrent layers.
There are many different ways for constructing RNNs. Those with a hidden state defined by :eqref:rnn_h_with_state are very common. For time step
Parameters of the RNN include the weights
Figure illustrates the computational logic of an RNN at three adjacent time steps. At any time step rnn_h_with_state. The hidden state of the current time step
An RNN with a hidden state.
We just mentioned that the calculation of X, W_xh, H, and W_hh, whose shapes are (3, 1), (1, 4), (3, 4), and (4, 4), respectively. Multiplying X by W_xh, and H by W_hh, and then adding these two products, we obtain a matrix of shape (3, 4).
X, Wxh = rand(1, 3), rand(4, 1)
H, Whh = rand(4, 3), rand(4,4)
Wxh*X + Whh*H4×3 Matrix{Float64}:
1.05201 1.00062 1.67581
1.60482 1.98626 2.43147
0.93348 1.02113 1.48314
1.47263 1.4634 2.25598Now we concatenate the matrices X and H along columns (axis 1), and the matrices W_xh and W_hh along rows (axis 0). These two concatenations result in matrices of shape (3, 5) and of shape (5, 4), respectively. Multiplying these two concatenated matrices, we obtain the same output matrix of shape (3, 4) as above.
hcat(Wxh, Whh)*vcat(X, H)4×3 Matrix{Float64}:
1.05201 1.00062 1.67581
1.60482 1.98626 2.43147
0.93348 1.02113 1.48314
1.47263 1.4634 2.25598RNN-Based Character-Level Language Models ​
Recall that for language modeling in :numref:sec_language-model, we aim to predict the next token based on the current and past tokens; thus we shift the original sequence by one token as the targets (labels). Bengio et al. [159] first proposed to use a neural network for language modeling. In the following we illustrate how RNNs can be used to build a language model. Let the minibatch size be one, and the sequence of the text be "machine". To simplify training in subsequent sections, we tokenize text into characters rather than words and consider a character-level language model. Figure demonstrates how to predict the next character based on the current and previous characters via an RNN for character-level language modeling.
A character-level language model based on the RNN. The input and target sequences are "machin" and "achine", respectively.
During the training process, we run a softmax operation on the output from the output layer for each time step, and then use the cross-entropy loss to compute the error between the model output and the target. Because of the recurrent computation of the hidden state in the hidden layer, the output,
In practice, each token is represented by a subsec_rnn_w_hidden_states.
In the following sections, we will implement RNNs for character-level language models.
Summary ​
A neural network that uses recurrent computation for hidden states is called a recurrent neural network (RNN). The hidden state of an RNN can capture historical information of the sequence up to the current time step. With recurrent computation, the number of RNN model parameters does not grow as the number of time steps increases. As for applications, an RNN can be used to create character-level language models.
Exercises ​
If we use an RNN to predict the next character in a text sequence, what is the required dimension for any output?
Why can RNNs express the conditional probability of a token at some time step based on all the previous tokens in the text sequence?
What happens to the gradient if you backpropagate through a long sequence?
What are some of the problems associated with the language model described in this section?