Skip to content

Network in Network (NiN)

LeNet, AlexNet, and VGG all share a common design pattern: extract features exploiting spatial structure via a sequence of convolutions and pooling layers and post-process the representations via fully connected layers. The improvements upon LeNet by AlexNet and VGG mainly lie in how these later networks widen and deepen these two modules.

This design poses two major challenges. First, the fully connected layers at the end of the architecture consume tremendous numbers of parameters. For instance, even a simple model such as VGG-11 requires a monstrous matrix, occupying almost 400MB of RAM in single precision (FP32). This is a significant impediment to computation, in particular on mobile and embedded devices. After all, even high-end mobile phones sport no more than 8GB of RAM. At the time VGG was invented, this was an order of magnitude less (the iPhone 4S had 512MB). As such, it would have been difficult to justify spending the majority of memory on an image classifier.

Second, it is equally impossible to add fully connected layers earlier in the network to increase the degree of nonlinearity: doing so would destroy the spatial structure and require potentially even more memory.

The network in network (NiN) blocks [78] offer an alternative, capable of solving both problems in one simple strategy. They were proposed based on a very simple insight: (i) use 1×1 convolutions to add local nonlinearities across the channel activations and (ii) use global average pooling to integrate across all locations in the last representation layer. Note that global average pooling would not be effective, were it not for the added nonlinearities. Let's dive into this in detail.

julia
using Pkg; Pkg.activate("../../d2lai")
using d2lai
using Flux 
using CUDA, cuDNN
  Activating project at `/workspace/d2l-julia/d2lai`

NiN Blocks

Recall :numref:subsec_1x1. In it we said that the inputs and outputs of convolutional layers consist of four-dimensional tensors with axes corresponding to the example, channel, height, and width. Also recall that the inputs and outputs of fully connected layers are typically two-dimensional tensors corresponding to the example and feature. The idea behind NiN is to apply a fully connected layer at each pixel location (for each height and width). The resulting 1×1 convolution can be thought of as a fully connected layer acting independently on each pixel location.

Figure illustrates the main structural differences between VGG and NiN, and their blocks. Note both the difference in the NiN blocks (the initial convolution is followed by 1×1 convolutions, whereas VGG retains 3×3 convolutions) and at the end where we no longer require a giant fully connected layer.

:width:600px 🏷️fig_nin

julia
function NiN_block(in_channels, out_channels, kernel_size; pad = 0, stride = 1)
    Chain(
        Conv((kernel_size, kernel_size), in_channels => out_channels, relu, stride = stride, pad = pad),
        Conv((1,1), out_channels => out_channels, relu)
    )
end
NiN_block (generic function with 3 methods)

[NiN Model]

NiN uses the same initial convolution sizes as AlexNet (it was proposed shortly thereafter). The kernel sizes are 11×11, 5×5, and 3×3, respectively, and the numbers of output channels match those of AlexNet. Each NiN block is followed by a max-pooling layer with a stride of 2 and a window shape of 3×3.

The second significant difference between NiN and both AlexNet and VGG is that NiN avoids fully connected layers altogether. Instead, NiN uses a NiN block with a number of output channels equal to the number of label classes, followed by a global average pooling layer, yielding a vector of logits. This design significantly reduces the number of required model parameters, albeit at the expense of a potential increase in training time.

julia
struct NiN{N} <: AbstractClassifier 
    net::N
end

function NiN(num_classes::Int = 10)
    net = Chain(
        NiN_block(1, 96, 11; pad = 0, stride = 4),
        MaxPool((3,3), stride = 2),
        
        NiN_block(96, 256, 5; pad = 2, stride = 1),
        MaxPool((3,3), stride = 2),
        
        NiN_block(256, 384, 3; pad = 1, stride = 1),
        MaxPool((3,3), stride = 2),

        Dropout(0.5),

        NiN_block(384, num_classes, 3; pad = 1, stride = 1),

        GlobalMeanPool(),

        Flux.flatten,

        softmax
        
    )
    NiN(net)
    
end
Flux.@layer NiN

(n::NiN)(x) = n.net(x)

We create a data example to see [the output shape of each block].

julia
Flux.outputsize(NiN().net, (224, 224, 1, 1))
(10, 1)

[Training]

As before we use Fashion-MNIST to train the model using the same optimizer that we used for AlexNet and VGG.

julia
model = NiN()
data = d2lai.FashionMNISTData(batchsize = 128, resize = (224, 224))
opt = Descent(0.05)
trainer = Trainer(model, data, opt; max_epochs = 10, gpu = true, board_yscale = :identity)
d2lai.fit(trainer);
    [ Info: Train Loss: 1.1040267, Val Loss: 0.9623538, Val Acc: 0.6875
    [ Info: Train Loss: 0.48630658, Val Loss: 0.41931215, Val Acc: 0.9375
    [ Info: Train Loss: 0.49091217, Val Loss: 0.44845864, Val Acc: 0.75
    [ Info: Train Loss: 0.51929885, Val Loss: 0.3203296, Val Acc: 0.9375
    [ Info: Train Loss: 0.3298808, Val Loss: 0.28225675, Val Acc: 0.9375
    [ Info: Train Loss: 0.39819065, Val Loss: 0.43181264, Val Acc: 0.875
    [ Info: Train Loss: 0.4049001, Val Loss: 0.39596462, Val Acc: 0.8125
    [ Info: Train Loss: 0.39740062, Val Loss: 0.36362204, Val Acc: 0.9375
    [ Info: Train Loss: 0.203655, Val Loss: 0.29819828, Val Acc: 0.9375
    [ Info: Train Loss: 0.23114383, Val Loss: 0.23663433, Val Acc: 0.9375

Exercises

  1. Why are there two 1×1 convolutional layers per NiN block? Increase their number to three. Reduce their number to one. What changes?

  2. What changes if you replace the 1×1 convolutions by 3×3 convolutions?

  3. What happens if you replace the global average pooling by a fully connected layer (speed, accuracy, number of parameters)?

  4. Calculate the resource usage for NiN.

  5. What is the number of parameters?

  6. What is the amount of computation?

  7. What is the amount of memory needed during training?

  8. What is the amount of memory needed during prediction?

  9. What are possible problems with reducing the 384×5×5 representation to a 10×5×5 representation in one step?

  10. Use the structural design decisions in VGG that led to VGG-11, VGG-16, and VGG-19 to design a family of NiN-like networks.

julia