Skip to content

Designing Convolution Network Architectures ​

The previous sections have taken us on a tour of modern network design for computer vision. Common to all the work we covered was that it greatly relied on the intuition of scientists. Many of the architectures are heavily informed by human creativity and to a much lesser extent by systematic exploration of the design space that deep networks offer. Nonetheless, this network engineering approach has been tremendously successful.

Ever since AlexNet (:numref:sec_alexnet) beat conventional computer vision models on ImageNet, it has become popular to construct very deep networks by stacking blocks of convolutions, all designed according to the same pattern. In particular, 3×3 convolutions were popularized by VGG networks (:numref:sec_vgg). NiN (:numref:sec_nin) showed that even 1×1 convolutions could be beneficial by adding local nonlinearities. Moreover, NiN solved the problem of aggregating information at the head of a network by aggregating across all locations. GoogLeNet (:numref:sec_googlenet) added multiple branches of different convolution width, combining the advantages of VGG and NiN in its Inception block. ResNets (:numref:sec_resnet) changed the inductive bias towards the identity mapping (from f(x)=0). This allowed for very deep networks. Almost a decade later, the ResNet design is still popular, a testament to its design. Lastly, ResNeXt (:numref:subsec_resnext) added grouped convolutions, offering a better trade-off between parameters and computation. A precursor to Transformers for vision, the Squeeze-and-Excitation Networks (SENets) allow for efficient information transfer between locations [148]. This was accomplished by computing a per-channel global attention function.

Up to now we have omitted networks obtained via neural architecture search (NAS) [149], [150]. We chose to do so since their cost is usually enormous, relying on brute-force search, genetic algorithms, reinforcement learning, or some other form of hyperparameter optimization. Given a fixed search space, NAS uses a search strategy to automatically select an architecture based on the returned performance estimation. The outcome of NAS is a single network instance. EfficientNets are a notable outcome of this search [151].

In the following we discuss an idea that is quite different to the quest for the single best network. It is computationally relatively inexpensive, it leads to scientific insights on the way, and it is quite effective in terms of the quality of outcomes. Let's review the strategy by Radosavovic et al. [97] to design network design spaces. The strategy combines the strength of manual design and NAS. It accomplishes this by operating on distributions of networks and optimizing the distributions in a way to obtain good performance for entire families of networks. The outcome of it are RegNets, specifically RegNetX and RegNetY, plus a range of guiding principles for the design of performant CNNs.

julia
using Pkg; Pkg.activate("../../d2lai")
using d2lai
using Flux 
using CUDA, cuDNN
  Activating project at `/workspace/d2l-julia/d2lai`

The AnyNet Design Space ​

The description below closely follows the reasoning in Radosavovic et al. [97] with some abbreviations to make it fit in the scope of the book. To begin, we need a template for the family of networks to explore. One of the commonalities of the designs in this chapter is that the networks consist of a stem, a body and a head. The stem performs initial image processing, often through convolutions with a larger window size. The body consists of multiple blocks, carrying out the bulk of the transformations needed to go from raw images to object representations. Lastly, the head converts this into the desired outputs, such as via a softmax regressor for multiclass classification. The body, in turn, consists of multiple stages, operating on the image at decreasing resolutions. In fact, both the stem and each subsequent stage quarter the spatial resolution. Lastly, each stage consists of one or more blocks. This pattern is common to all networks, from VGG to ResNeXt. Indeed, for the design of generic AnyNet networks, Radosavovic et al. [97] used the ResNeXt block of Figure.

The AnyNet design space. The numbers (c,r) along each arrow indicate the number of channels c and the resolution r×r of the images at that point. From left to right: generic network structure composed of stem, body, and head; body composed of four stages; detailed structure of a stage; two alternative structures for blocks, one without downsampling and one that halves the resolution in each dimension. Design choices include depth di, the number of output channels ci, the number of groups gi, and bottleneck ratio ki for any stage i.

Let's review the structure outlined in Figure in detail. As mentioned, an AnyNet consists of a stem, body, and head. The stem takes as its input RGB images (3 channels), using a 3×3 convolution with a stride of 2, followed by a batch norm, to halve the resolution from r×r to r/2×r/2. Moreover, it generates c0 channels that serve as input to the body.

Since the network is designed to work well with ImageNet images of shape 224×224×3, the body serves to reduce this to 7×7×c4 through 4 stages (recall that 224/21+4=7), each with an eventual stride of 2. Lastly, the head employs an entirely standard design via global average pooling, similar to NiN (:numref:sec_nin), followed by a fully connected layer to emit an n-dimensional vector for n-class classification.

Most of the relevant design decisions are inherent to the body of the network. It proceeds in stages, where each stage is composed of the same type of ResNeXt blocks as we discussed in :numref:subsec_resnext. The design there is again entirely generic: we begin with a block that halves the resolution by using a stride of 2 (the rightmost in Figure). To match this, the residual branch of the ResNeXt block needs to pass through a 1×1 convolution. This block is followed by a variable number of additional ResNeXt blocks that leave both resolution and the number of channels unchanged. Note that a common design practice is to add a slight bottleneck in the design of convolutional blocks. As such, with bottleneck ratio ki≥1 we afford some number of channels, ci/ki, within each block for stage i (as the experiments show, this is not really effective and should be skipped). Lastly, since we are dealing with ResNeXt blocks, we also need to pick the number of groups gi for grouped convolutions at stage i.

This seemingly generic design space provides us nonetheless with many parameters: we can set the block width (number of channels) c0,…c4, the depth (number of blocks) per stage d1,…d4, the bottleneck ratios k1,…k4, and the group widths (numbers of groups) g1,…g4. In total this adds up to 17 parameters, resulting in an unreasonably large number of configurations that would warrant exploring. We need some tools to reduce this huge design space effectively. This is where the conceptual beauty of design spaces comes in. Before we do so, let's implement the generic design first.

julia