Implementation of Multilayer Perceptrons

Multilayer perceptrons (MLPs) are not much more complex to implement than simple linear models. The key conceptual difference is that we now concatenate multiple layers.

julia

using Pkg
Pkg.activate("../../d2lai")
using d2lai, Flux, Plots, Distributions

  Activating project at `/workspace/workspace/d2l-julia/d2lai`

Initializing Model Parameters

Recall that Fashion-MNIST contains 10 classes, and that each image consists of a $28 \times 28 = 784$ grid of grayscale pixel values. As before we will disregard the spatial structure among the pixels for now, so we can think of this as a classification dataset with 784 input features and 10 classes. To begin, we will implement an MLP with one hidden layer and 256 hidden units. Both the number of layers and their width are adjustable (they are considered hyperparameters). Typically, we choose the layer widths to be divisible by larger powers of 2. This is computationally efficient due to the way memory is allocated and addressed in hardware.

Again, we will represent our parameters with several tensors. Note that for every layer, we must keep track of one weight matrix and one bias vector. As always, we allocate memory for the gradients of the loss with respect to these parameters.

julia

struct MLP <: AbstractClassifier 
    W1::AbstractArray
    W2::AbstractArray
    B1::AbstractArray 
    B2::AbstractArray 
    args::NamedTuple
end

function MLP(num_inputs::Int, num_outputs::Int, num_hiddens::Int, lr, sigma = 0.01)
    W1 = rand(Normal(0, sigma), (num_hiddens, num_inputs))
    B1 = zeros(num_hiddens, 1)
    W2 = rand(Normal(0, sigma), (num_outputs, num_hiddens))
    B2 = zeros(num_outputs, 1)
    args = (num_inputs = num_inputs, num_hiddens = num_hiddens, num_outputs = num_outputs, lr = lr)
    MLP(W1, W2, B1, B2, args)
end
Flux.@layer MLP trainable=(W1,W2,B1,B2)

Model

To make sure we know how everything works, we will implement the ReLU activation ourselves rather than invoking the built-in relu function directly.

julia

relu_custom(x) = max(x, 0.)

relu_custom (generic function with 1 method)

Since we are disregarding spatial structure, we reshape each two-dimensional image into a flat vector of length num_inputs. Finally, we (implement our model) with just a few lines of code. Since we use the framework built-in autograd this is all that it takes.

julia

function d2lai.forward(m::MLP, x)
    H = relu_custom.(m.W1*x .+ m.B1)
    O = softmax(m.W2*H .+ m.B2)
    return O
end

Training

Fortunately, the training loop for MLPs is exactly the same as for softmax regression. We define the model, data, and trainer, then finally invoke the fit method on model and data.

julia

function d2lai.loss(m::MLP, y_pred, y)
    # cross entropy 
    # y_pred is an array of n_outputs x batchsize 
    # y actual is a vector of labels 
    y_prob = getindex.(eachcol(y_pred), y .+ 1)
    mean(-1*log.(y_prob))
end

julia

model = MLP(28*28, 10, 256, 0.01)
opt = Descent(0.01)
data = d2lai.FashionMNISTData(; batchsize = 256, flatten=true)
trainer = Trainer(model, data, opt; max_epochs = 10)
d2lai.fit(trainer)

    [ Info: Train Loss: 1.8086399910502988, Val Loss: 1.8758227031595318, Val Acc: 0.4375
    [ Info: Train Loss: 1.1667184916733622, Val Loss: 1.2518912257203245, Val Acc: 0.625
    [ Info: Train Loss: 1.0179561515744096, Val Loss: 0.9484510179121753, Val Acc: 0.6875
    [ Info: Train Loss: 0.8624210317622039, Val Loss: 0.792685096123279, Val Acc: 0.75
    [ Info: Train Loss: 0.8150524736874862, Val Loss: 0.6771740790875137, Val Acc: 0.75
    [ Info: Train Loss: 0.7064154271294161, Val Loss: 0.5977226518346531, Val Acc: 0.75
    [ Info: Train Loss: 0.6743462036084669, Val Loss: 0.5277229711836189, Val Acc: 0.8125
    [ Info: Train Loss: 0.5125813502508515, Val Loss: 0.47928279763561915, Val Acc: 0.9375
    [ Info: Train Loss: 0.6353711775451608, Val Loss: 0.4239313740731752, Val Acc: 0.9375
    [ Info: Train Loss: 0.49330677526811845, Val Loss: 0.39833012812119917, Val Acc: 0.875

(MLP([0.0025907253012318 0.004434211742339771 … 0.005948848386828566 -0.004396632680610081; -0.01389325653004344 -0.0011132944731336177 … 0.001565572069784555 -0.011690286663672693; … ; -0.0030419971399973694 -0.009797153218174827 … 0.017208869386144 -0.0021644320780200683; 0.005653869642510044 0.00516614739146323 … -0.00930094933047343 0.00125522016678587], [-0.02261086552697285 0.12511211528987537 … 0.00770935608775288 -0.008336313757029881; 0.12241688893245524 0.06049709111587903 … 0.006157700732419801 0.004634774334394043; … ; -0.0032371654640280965 0.036300996101492164 … -0.004576708890541834 0.0032142054671962664; -0.03554650737048207 -0.12095986363824825 … 0.004313900447319184 0.002968760071406733], [0.003992542412849325; 0.00498983816799759; … ; -0.0007639085639713631; -0.0011848515594961512;;], [0.00885574653302873; 0.018424048539937087; … ; -0.16186837388373113; -0.2571295877214314;;], (num_inputs = 784, num_hiddens = 256, num_outputs = 10, lr = 0.01)), (val_loss = [0.6075688169558727, 0.5900464218227306, 0.7035459637079877, 0.6316132833104677, 0.7028046369944622, 0.6129771166217814, 0.589745203955325, 0.6413945676206254, 0.5793911786651581, 0.6014784633828493  …  0.664900760190743, 0.7017198448903048, 0.6059283986378813, 0.593572766971733, 0.6947916744852096, 0.6572553180778933, 0.6345673023591967, 0.6922486000942907, 0.6389265185331557, 0.39833012812119917], val_acc = [0.76953125, 0.8125, 0.76171875, 0.78515625, 0.7421875, 0.78515625, 0.81640625, 0.76171875, 0.80859375, 0.78125  …  0.765625, 0.75390625, 0.7890625, 0.80859375, 0.7421875, 0.76171875, 0.75, 0.7890625, 0.796875, 0.875]))

Concise Implementation

As you might expect, by relying on the high-level APIs, we can implement MLPs even more concisely.

Model

Compared with our concise implementation of softmax regression implementation (:numref:sec_softmax_concise), the only difference is that we add two fully connected layers where we previously added only one. The first is [the hidden layer], the second is the output layer.

julia

struct MLPConcise{N, A} <: AbstractClassifier 
    net::N 
    args::A
end

function MLPConcise(num_inputs::Int64, num_outputs::Int64, num_hiddens::Int64, lr, sigma = 0.01)
    args = (num_inputs = num_inputs, num_hiddens = num_hiddens, num_outputs = num_outputs, lr = lr)
    net = Chain(Dense(num_inputs, num_hiddens, relu), Dense(num_hiddens, num_outputs), Flux.softmax)
    MLPConcise(net, args)
end
d2lai.forward(m::MLPConcise, x) = m.net(x)
d2lai.loss(m::MLPConcise, y_pred, y) = Flux.crossentropy(y_pred, Flux.onehotbatch(y, 0:9))

julia


model = MLPConcise(28*28, 10, 256, 0.01)
opt = Descent(0.01)
data = d2lai.FashionMNISTData(; batchsize = 256, flatten=true)
trainer = Trainer(model, data, opt; max_epochs = 10)
d2lai.fit(trainer)

    [ Info: Train Loss: 1.032749, Val Loss: 0.7838099, Val Acc: 0.75
    [ Info: Train Loss: 0.6469836, Val Loss: 0.54223603, Val Acc: 0.8125
    [ Info: Train Loss: 0.7175854, Val Loss: 0.44240537, Val Acc: 0.8125
    [ Info: Train Loss: 0.5304204, Val Loss: 0.39751008, Val Acc: 0.8125
    [ Info: Train Loss: 0.5611671, Val Loss: 0.37504986, Val Acc: 0.75
    [ Info: Train Loss: 0.51782006, Val Loss: 0.35605115, Val Acc: 0.8125
    [ Info: Train Loss: 0.5623277, Val Loss: 0.33210504, Val Acc: 0.8125
    [ Info: Train Loss: 0.5539348, Val Loss: 0.32132548, Val Acc: 0.8125
    [ Info: Train Loss: 0.5783891, Val Loss: 0.32704777, Val Acc: 0.875
    [ Info: Train Loss: 0.46418503, Val Loss: 0.31624204, Val Acc: 0.875

(MLPConcise{Chain{Tuple{Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, typeof(softmax)}}, @NamedTuple{num_inputs::Int64, num_hiddens::Int64, num_outputs::Int64, lr::Float64}}(Chain(Dense(784 => 256, relu), Dense(256 => 10), softmax), (num_inputs = 784, num_hiddens = 256, num_outputs = 10, lr = 0.01)), (val_loss = Float32[0.47921216, 0.48217854, 0.5917104, 0.49460533, 0.54532135, 0.48910785, 0.46038538, 0.5588525, 0.44173148, 0.51100296  …  0.51876605, 0.5761145, 0.48369744, 0.5024649, 0.5960579, 0.54009736, 0.4811007, 0.5494064, 0.49265826, 0.31624204], val_acc = [0.828125, 0.85546875, 0.8046875, 0.828125, 0.81640625, 0.8359375, 0.86328125, 0.8359375, 0.84765625, 0.828125  …  0.8359375, 0.78515625, 0.8203125, 0.81640625, 0.7890625, 0.80859375, 0.80078125, 0.82421875, 0.84765625, 0.875]))

Summary

Now that we have more practice in designing deep networks, the step from a single to multiple layers of deep networks does not pose such a significant challenge any longer. In particular, we can reuse the training algorithm and data loader. Note, though, that implementing MLPs from scratch is nonetheless messy: naming and keeping track of the model parameters makes it difficult to extend models. For instance, imagine wanting to insert another layer between layers 42 and 43. This might now be layer 42b, unless we are willing to perform sequential renaming. Moreover, if we implement the network from scratch, it is much more difficult for the framework to perform meaningful performance optimizations.

Nonetheless, you have now reached the state of the art of the late 1980s when fully connected deep networks were the method of choice for neural network modeling. Our next conceptual step will be to consider images. Before we do so, we need to review a number of statistical basics and details on how to compute models efficiently.

Exercises

Change the number of hidden units num_hiddens and plot how its number affects the accuracy of the model. What is the best value of this hyperparameter?
Try adding a hidden layer to see how it affects the results.
Why is it a bad idea to insert a hidden layer with a single neuron? What could go wrong?
How does changing the learning rate alter your results? With all other parameters fixed, which learning rate gives you the best results? How does this relate to the number of epochs?
Let's optimize over all hyperparameters jointly, i.e., learning rate, number of epochs, number of hidden layers, and number of hidden units per layer.
What is the best result you can get by optimizing over all of them?
Why it is much more challenging to deal with multiple hyperparameters?
Describe an efficient strategy for optimizing over multiple parameters jointly.
Compare the speed of the framework and the from-scratch implementation for a challenging problem. How does it change with the complexity of the network?
Measure the speed of tensor–matrix multiplications for well-aligned and misaligned matrices. For instance, test for matrices with dimension 1024, 1025, 1026, 1028, and 1032.
How does this change between GPUs and CPUs?
Determine the memory bus width of your CPU and GPU.
Try out different activation functions. Which one works best?
Is there a difference between weight initializations of the network? Does it matter?

julia

Implementation of Multilayer Perceptrons ​

Initializing Model Parameters ​

Model ​

Training ​

Concise Implementation ​

Model ​

Summary ​

Exercises ​

Implementation of Multilayer Perceptrons

Initializing Model Parameters

Model

Training

Concise Implementation

Model

Summary

Exercises