Convolutional Neural Networks (LeNet)

We now have all the ingredients required to assemble a fully-functional CNN. In our earlier encounter with image data, we applied a linear model with softmax regression (:numref:sec_softmax_scratch) and an MLP (:numref:sec_mlp-implementation) to pictures of clothing in the Fashion-MNIST dataset. To make such data amenable we first flattened each image from a $28 \times 28$ matrix into a fixed-length $784$ -dimensional vector, and thereafter processed them in fully connected layers. Now that we have a handle on convolutional layers, we can retain the spatial structure in our images. As an additional benefit of replacing fully connected layers with convolutional layers, we will enjoy more parsimonious models that require far fewer parameters.

In this section, we will introduce LeNet, among the first published CNNs to capture wide attention for its performance on computer vision tasks. The model was introduced by (and named for) Yann LeCun, then a researcher at AT&T Bell Labs, for the purpose of recognizing handwritten digits in images [38]. This work represented the culmination of a decade of research developing the technology; LeCun's team published the first study to successfully train CNNs via backpropagation [83].

At the time LeNet achieved outstanding results matching the performance of support vector machines, then a dominant approach in supervised learning, achieving an error rate of less than 1% per digit. LeNet was eventually adapted to recognize digits for processing deposits in ATM machines. To this day, some ATMs still run the code that Yann LeCun and his colleague Leon Bottou wrote in the 1990s!

julia

using Pkg; Pkg.activate("../../d2lai")
using d2lai
using Flux 
using CUDA, cuDNN

  Activating project at `c:\Users\abhar\Personal\d2l-julia\d2lai`

LeNet

At a high level, (LeNet (LeNet-5) consists of two parts: (i) a convolutional encoder consisting of two convolutional layers; and (ii) a dense block consisting of three fully connected layers). The architecture is summarized in :numref:img_lenet.

🏷️img_lenet

The basic units in each convolutional block are a convolutional layer, a sigmoid activation function, and a subsequent average pooling operation. Note that while ReLUs and max-pooling work better, they had not yet been discovered. Each convolutional layer uses a $5 \times 5$ kernel and a sigmoid activation function. These layers map spatially arranged inputs to a number of two-dimensional feature maps, typically increasing the number of channels. The first convolutional layer has 6 output channels, while the second has 16. Each $2 \times 2$ pooling operation (stride 2) reduces dimensionality by a factor of $4$ via spatial downsampling. The convolutional block emits an output with shape given by (batch size, number of channel, height, width).

In order to pass output from the convolutional block to the dense block, we must flatten each example in the minibatch. In other words, we take this four-dimensional input and transform it into the two-dimensional input expected by fully connected layers: as a reminder, the two-dimensional representation that we desire uses the first dimension to index examples in the minibatch and the second to give the flat vector representation of each example. LeNet's dense block has three fully connected layers, with 120, 84, and 10 outputs, respectively. Because we are still performing classification, the 10-dimensional output layer corresponds to the number of possible output classes.

While getting to the point where you truly understand what is going on inside LeNet may have taken a bit of work, we hope that the following code snippet will convince you that implementing such models with modern deep learning frameworks is remarkably simple. We need only to instantiate a Sequential block and chain together the appropriate layers, using Xavier initialization as introduced in :numref:subsec_xavier.

julia

struct LeNet{N} <: AbstractClassifier
    net::N
end

function LeNet(; num_classes = 10, init = Flux.glorot_normal)
    net = Chain(
        Conv((5,5), 1 => 6, sigmoid; pad = 2, init = init),
        MeanPool((2,2), stride = 2),
        Conv((5,5), 6 => 16, sigmoid, init = init),
        MeanPool((2,2), stride = 2),
        Flux.flatten,
        Dense(400 => 120, sigmoid, init = init),
        Dense(120 => 84, sigmoid, init = init),
        Dense(84 => num_classes, init = init),
        Flux.softmax
    ) |> f64
    LeNet(net)
end

Flux.@layer LeNet

We have taken some liberty in the reproduction of LeNet insofar as we have replaced the Gaussian activation layer by a softmax layer. This greatly simplifies the implementation, not least due to the fact that the Gaussian decoder is rarely used nowadays. Other than that, this network matches the original LeNet-5 architecture. Let's see what happens inside the network. By passing a single-channel (black and white) $28 \times 28$ image through the network and printing the output shape at each layer, we can inspect the model to ensure that its operations line up with what we expect from :numref:img_lenet_vert.

julia

function layer_summary(model::AbstractClassifier, X_shape)
    X = randn(X_shape)
    for layer in model.net.layers 
        X = layer(X)
        println(typeof(layer).name.wrapper, " output shape :\t", size(X))
    end
end
model = LeNet()
layer_summary(model, (28,28,1,1))

Conv output shape :	(28, 28, 6, 1)
MeanPool output shape :	(14, 14, 6, 1)
Conv output shape :	(10, 10, 16, 1)
MeanPool output shape :	(5, 5, 16, 1)
typeof(Flux.flatten) output shape :	(400, 1)
Dense output shape :	(120, 1)
Dense output shape :	(84, 1)
Dense output shape :	(10, 1)
typeof(softmax) output shape :	(10, 1)

Note that the height and width of the representation at each layer throughout the convolutional block is reduced (compared with the previous layer). The first convolutional layer uses two pixels of padding to compensate for the reduction in height and width that would otherwise result from using a $5 \times 5$ kernel. As an aside, the image size of $28 \times 28$ pixels in the original MNIST OCR dataset is a result of trimming two pixel rows (and columns) from the original scans that measured $32 \times 32$ pixels. This was done primarily to save space (a 30% reduction) at a time when megabytes mattered.

In contrast, the second convolutional layer forgoes padding, and thus the height and width are both reduced by four pixels. As we go up the stack of layers, the number of channels increases layer-over-layer from 1 in the input to 6 after the first convolutional layer and 16 after the second convolutional layer. However, each pooling layer halves the height and width. Finally, each fully connected layer reduces dimensionality, finally emitting an output whose dimension matches the number of classes.

Training

Now that we have implemented the model, let's run an experiment to see how the LeNet-5 model fares on Fashion-MNIST.

While CNNs have fewer parameters, they can still be more expensive to compute than similarly deep MLPs because each parameter participates in many more multiplications. If you have access to a GPU, this might be a good time to put it into action to speed up training. Note that the d2lai.Trainer class takes care of all details. By default, it initializes the model parameters on the available devices. Just as with MLPs, our loss function is cross-entropy, and we minimize it via minibatch stochastic gradient descent.

julia

data = d2lai.FashionMNISTData(batchsize = 128)
opt = Descent(0.1)
trainer = Trainer(model, data, opt; max_epochs = 10, gpu = true, board_yscale = :identity)
t = d2lai.fit(trainer)

┌ Info: Train Loss: 2.2856421, Val Loss: 2.330483, Val Acc: 0.0
└ @ d2lai c:\Users\abhar\Personal\d2l-julia\d2lai\src\train.jl:49
┌ Info: Train Loss: 2.2564824, Val Loss: 2.232867, Val Acc: 0.1875
└ @ d2lai c:\Users\abhar\Personal\d2l-julia\d2lai\src\train.jl:49
┌ Info: Train Loss: 2.2793787, Val Loss: 2.2695622, Val Acc: 0.0625
└ @ d2lai c:\Users\abhar\Personal\d2l-julia\d2lai\src\train.jl:49
┌ Info: Train Loss: 1.4100746, Val Loss: 1.1644549, Val Acc: 0.75
└ @ d2lai c:\Users\abhar\Personal\d2l-julia\d2lai\src\train.jl:49
┌ Info: Train Loss: 1.0515851, Val Loss: 0.785407, Val Acc: 0.6875
└ @ d2lai c:\Users\abhar\Personal\d2l-julia\d2lai\src\train.jl:49
┌ Info: Train Loss: 0.85749763, Val Loss: 0.6361971, Val Acc: 0.8125
└ @ d2lai c:\Users\abhar\Personal\d2l-julia\d2lai\src\train.jl:49
┌ Info: Train Loss: 0.81882715, Val Loss: 0.56473196, Val Acc: 0.875
└ @ d2lai c:\Users\abhar\Personal\d2l-julia\d2lai\src\train.jl:49
┌ Info: Train Loss: 0.6714182, Val Loss: 0.4811849, Val Acc: 0.8125
└ @ d2lai c:\Users\abhar\Personal\d2l-julia\d2lai\src\train.jl:49
┌ Info: Train Loss: 0.629397, Val Loss: 0.4333808, Val Acc: 0.9375
└ @ d2lai c:\Users\abhar\Personal\d2l-julia\d2lai\src\train.jl:49
┌ Info: Train Loss: 0.67057157, Val Loss: 0.40765715, Val Acc: 0.875
└ @ d2lai c:\Users\abhar\Personal\d2l-julia\d2lai\src\train.jl:49

(LeNet{Chain{Tuple{Conv{2, 4, typeof(σ), Array{Float32, 4}, Vector{Float32}}, MeanPool{2, 4}, Conv{2, 4, typeof(σ), Array{Float32, 4}, Vector{Float32}}, MeanPool{2, 4}, typeof(Flux.flatten), Dense{typeof(σ), Matrix{Float32}, Vector{Float32}}, Dense{typeof(σ), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, typeof(softmax)}}}(Chain(Conv((5, 5), 1 => 6, σ, pad=2), MeanPool((2, 2)), Conv((5, 5), 6 => 16, σ), MeanPool((2, 2)), flatten, Dense(400 => 120, σ), Dense(120 => 84, σ), Dense(84 => 10), softmax)), (val_loss = Float32[0.67991555, 0.74102235, 0.66995776, 0.70574814, 0.88130754, 0.7767141, 0.65413445, 0.794901, 0.77748394, 0.8065926  …  0.72315204, 0.7248938, 0.76078093, 0.7686513, 0.6747246, 0.79584336, 0.74073476, 0.68909705, 0.7938775, 0.40765715], val_acc = [0.765625, 0.71875, 0.734375, 0.6875, 0.703125, 0.75, 0.765625, 0.703125, 0.734375, 0.6953125  …  0.7265625, 0.71875, 0.6796875, 0.65625, 0.734375, 0.7109375, 0.75, 0.78125, 0.7265625, 0.875]))

Summary

We have made significant progress in this chapter. We moved from the MLPs of the 1980s to the CNNs of the 1990s and early 2000s. The architectures proposed, e.g., in the form of LeNet-5 remain meaningful, even to this day. It is worth comparing the error rates on Fashion-MNIST achievable with LeNet-5 both to the very best possible with MLPs (:numref:sec_mlp-implementation) and those with significantly more advanced architectures such as ResNet (:numref:sec_resnet). LeNet is much more similar to the latter than to the former. One of the primary differences, as we shall see, is that greater amounts of computation enabled significantly more complex architectures.

A second difference is the relative ease with which we were able to implement LeNet. What used to be an engineering challenge worth months of C++ and assembly code, engineering to improve SN, an early Lisp-based deep learning tool [17], and finally experimentation with models can now be accomplished in minutes. It is this incredible productivity boost that has democratized deep learning model development tremendously. In the next chapter, we will journey down this rabbit hole to see where it takes us.

Exercises

Let's modernize LeNet. Implement and test the following changes:
Replace average pooling with max-pooling.
Replace the softmax layer with ReLU.
Try to change the size of the LeNet style network to improve its accuracy in addition to max-pooling and ReLU.
Adjust the convolution window size.
Adjust the number of output channels.
Adjust the number of convolution layers.
Adjust the number of fully connected layers.
Adjust the learning rates and other training details (e.g., initialization and number of epochs).
Try out the improved network on the original MNIST dataset.
Display the activations of the first and second layer of LeNet for different inputs (e.g., sweaters and coats).
What happens to the activations when you feed significantly different images into the network (e.g., cats, cars, or even random noise)?
Let's modernize LeNet. Implement and test the following changes:
Replace average pooling with max-pooling.
Replace the softmax layer with ReLU.

julia

struct ModernLeNet{N} <: AbstractClassifier
    net::N
end

function ModernLeNet(; num_classes = 10, init = Flux.glorot_normal)
    net = Chain(
        Conv((5,5), 1 => 6, relu; pad = 2, init = init),
        MaxPool((2,2), stride = 2),
        Conv((5,5), 6 => 16, relu, init = init),
        MaxPool((2,2), stride = 2),
        Flux.flatten,
        Dense(400 => 120, relu, init = init),
        Dense(120 => 84, relu, init = init),
        Dense(84 => num_classes, init = init),
        Flux.softmax
    ) |> f64
    ModernLeNet(net)
end

Flux.@layer ModernLeNet

julia

struct LeNetCustom{N} <: AbstractClassifier
    net::N
end

function LeNetCustom(; num_classes = 10, init = Flux.glorot_normal, kernel_size = (5,5), num_conv = 1, channels = (6, 16,))
    @assert length(channels) == num_conv + 1
    net = Flux.@autosize (28, 28, 1, 128) Chain(
        Conv(kernel_size, 1 => 6, relu; pad = 2, init = init),
        MaxPool((2,2), stride = 2),
        [Conv(kernel_size, channels[i] => channels[i+1], relu, init = init) for i in 1:num_conv]...,
        MaxPool((2,2), stride = 2),
        Flux.flatten,
        Dense(_ => 120, relu, init = init),
        Dense(120 => 84, relu, init = init),
        Dense(84 => num_classes, init = init),
        Flux.softmax
    )
    LeNetCustom(f64(net))
end

Flux.@layer LeNetCustom

julia

# kernel_size, channel, num_conv, lr, epochs,
test_hyperparams = ([(3,3), (5,5), (7,7)], [8, 16, 24], [1, 2], [0.1, 0.05], [10, 15, 20])

hyperparameter_combinations = Iterators.product(test_hyperparams...) |> collect |> vec

108-element Vector{Tuple{Tuple{Int64, Int64}, Int64, Int64, Float64, Int64}}:
 ((3, 3), 8, 1, 0.1, 10)
 ((5, 5), 8, 1, 0.1, 10)
 ((7, 7), 8, 1, 0.1, 10)
 ((3, 3), 16, 1, 0.1, 10)
 ((5, 5), 16, 1, 0.1, 10)
 ((7, 7), 16, 1, 0.1, 10)
 ((3, 3), 24, 1, 0.1, 10)
 ((5, 5), 24, 1, 0.1, 10)
 ((7, 7), 24, 1, 0.1, 10)
 ((3, 3), 8, 2, 0.1, 10)
 ⋮
 ((3, 3), 8, 2, 0.05, 20)
 ((5, 5), 8, 2, 0.05, 20)
 ((7, 7), 8, 2, 0.05, 20)
 ((3, 3), 16, 2, 0.05, 20)
 ((5, 5), 16, 2, 0.05, 20)
 ((7, 7), 16, 2, 0.05, 20)
 ((3, 3), 24, 2, 0.05, 20)
 ((5, 5), 24, 2, 0.05, 20)
 ((7, 7), 24, 2, 0.05, 20)

julia

results = []
data = d2lai.FashionMNISTData(batchsize = 128)

for (i, combination) in enumerate(hyperparameter_combinations) 
    println("--------------COMBINATION $i -----------")
    try
        kernel_size = combination[1]
        num_conv = combination[3]
        channels = collect((6, collect(combination[2] for i in 1:num_conv)...))
        trial_model = LeNetCustom(; kernel_size, num_conv, channels = channels)
        trial_opt = Descent(combination[4])
        trial_trainer = Trainer(trial_model, data, trial_opt; max_epochs = combination[5], gpu = true, board_yscale = :identity, verbose = false)
        t = d2lai.fit(trial_trainer) 
        push!(results, t)
    catch e 
        "Combination $i failed with error $e"
        CUDA.reclaim()
        continue 
    end 
    CUDA.reclaim()
end

--------------COMBINATION 1 -----------
--------------COMBINATION 2 -----------
--------------COMBINATION 3 -----------
--------------COMBINATION 4 -----------
--------------COMBINATION 5 -----------
--------------COMBINATION 6 -----------
--------------COMBINATION 7 -----------
--------------COMBINATION 8 -----------
--------------COMBINATION 9 -----------
--------------COMBINATION 10 -----------
--------------COMBINATION 11 -----------
--------------COMBINATION 12 -----------
--------------COMBINATION 13 -----------
--------------COMBINATION 14 -----------
--------------COMBINATION 15 -----------
--------------COMBINATION 16 -----------
--------------COMBINATION 17 -----------
--------------COMBINATION 18 -----------
--------------COMBINATION 19 -----------
--------------COMBINATION 20 -----------
--------------COMBINATION 21 -----------
--------------COMBINATION 22 -----------
--------------COMBINATION 23 -----------
--------------COMBINATION 24 -----------
--------------COMBINATION 25 -----------
--------------COMBINATION 26 -----------
--------------COMBINATION 27 -----------
--------------COMBINATION 28 -----------
--------------COMBINATION 29 -----------
--------------COMBINATION 30 -----------
--------------COMBINATION 31 -----------
--------------COMBINATION 32 -----------
--------------COMBINATION 33 -----------
--------------COMBINATION 34 -----------
--------------COMBINATION 35 -----------
--------------COMBINATION 36 -----------
--------------COMBINATION 37 -----------
--------------COMBINATION 38 -----------
--------------COMBINATION 39 -----------
--------------COMBINATION 40 -----------
--------------COMBINATION 41 -----------
--------------COMBINATION 42 -----------
--------------COMBINATION 43 -----------
--------------COMBINATION 44 -----------
--------------COMBINATION 45 -----------
--------------COMBINATION 46 -----------
--------------COMBINATION 47 -----------
--------------COMBINATION 48 -----------
--------------COMBINATION 49 -----------
--------------COMBINATION 50 -----------
--------------COMBINATION 51 -----------
--------------COMBINATION 52 -----------
--------------COMBINATION 53 -----------
--------------COMBINATION 54 -----------
--------------COMBINATION 55 -----------
--------------COMBINATION 56 -----------
--------------COMBINATION 57 -----------
--------------COMBINATION 58 -----------
--------------COMBINATION 59 -----------
--------------COMBINATION 60 -----------
--------------COMBINATION 61 -----------
--------------COMBINATION 62 -----------
--------------COMBINATION 63 -----------
--------------COMBINATION 64 -----------
--------------COMBINATION 65 -----------
--------------COMBINATION 66 -----------
--------------COMBINATION 67 -----------
--------------COMBINATION 68 -----------
--------------COMBINATION 69 -----------
--------------COMBINATION 70 -----------
--------------COMBINATION 71 -----------
--------------COMBINATION 72 -----------
--------------COMBINATION 73 -----------
--------------COMBINATION 74 -----------
--------------COMBINATION 75 -----------
--------------COMBINATION 76 -----------
--------------COMBINATION 77 -----------
--------------COMBINATION 78 -----------
--------------COMBINATION 79 -----------
--------------COMBINATION 80 -----------
--------------COMBINATION 81 -----------
--------------COMBINATION 82 -----------
--------------COMBINATION 83 -----------
--------------COMBINATION 84 -----------
--------------COMBINATION 85 -----------
--------------COMBINATION 86 -----------
--------------COMBINATION 87 -----------
--------------COMBINATION 88 -----------
--------------COMBINATION 89 -----------
--------------COMBINATION 90 -----------
--------------COMBINATION 91 -----------
--------------COMBINATION 92 -----------
--------------COMBINATION 93 -----------
--------------COMBINATION 94 -----------
--------------COMBINATION 95 -----------
--------------COMBINATION 96 -----------
--------------COMBINATION 97 -----------
--------------COMBINATION 98 -----------
--------------COMBINATION 99 -----------
--------------COMBINATION 100 -----------
--------------COMBINATION 101 -----------
--------------COMBINATION 102 -----------
--------------COMBINATION 103 -----------
--------------COMBINATION 104 -----------
--------------COMBINATION 105 -----------
--------------COMBINATION 106 -----------
--------------COMBINATION 107 -----------
--------------COMBINATION 108 -----------

julia

findall(isone, getindex.(getindex.(results, 2), :val_acc) .|> last)

25-element Vector{Int64}:
  6
 11
 13
 14
 17
 19
 21
 24
 36
 37
  ⋮
 58
 63
 66
 69
 72
 75
 84
 85
 90

julia

Convolutional Neural Networks (LeNet) ​

LeNet ​

Training ​

Summary ​

Exercises ​

Convolutional Neural Networks (LeNet)

LeNet

Training

Summary

Exercises