Designing Convolution Network Architectures ​
The previous sections have taken us on a tour of modern network design for computer vision. Common to all the work we covered was that it greatly relied on the intuition of scientists. Many of the architectures are heavily informed by human creativity and to a much lesser extent by systematic exploration of the design space that deep networks offer. Nonetheless, this network engineering approach has been tremendously successful.
Ever since AlexNet (:numref:sec_alexnet) beat conventional computer vision models on ImageNet, it has become popular to construct very deep networks by stacking blocks of convolutions, all designed according to the same pattern. In particular, sec_vgg). NiN (:numref:sec_nin) showed that even sec_googlenet) added multiple branches of different convolution width, combining the advantages of VGG and NiN in its Inception block. ResNets (:numref:sec_resnet) changed the inductive bias towards the identity mapping (from subsec_resnext) added grouped convolutions, offering a better trade-off between parameters and computation. A precursor to Transformers for vision, the Squeeze-and-Excitation Networks (SENets) allow for efficient information transfer between locations [148]. This was accomplished by computing a per-channel global attention function.
Up to now we have omitted networks obtained via neural architecture search (NAS) [149], [150]. We chose to do so since their cost is usually enormous, relying on brute-force search, genetic algorithms, reinforcement learning, or some other form of hyperparameter optimization. Given a fixed search space, NAS uses a search strategy to automatically select an architecture based on the returned performance estimation. The outcome of NAS is a single network instance. EfficientNets are a notable outcome of this search [151].
In the following we discuss an idea that is quite different to the quest for the single best network. It is computationally relatively inexpensive, it leads to scientific insights on the way, and it is quite effective in terms of the quality of outcomes. Let's review the strategy by Radosavovic et al. [97] to design network design spaces. The strategy combines the strength of manual design and NAS. It accomplishes this by operating on distributions of networks and optimizing the distributions in a way to obtain good performance for entire families of networks. The outcome of it are RegNets, specifically RegNetX and RegNetY, plus a range of guiding principles for the design of performant CNNs.
using Pkg; Pkg.activate("../../d2lai")
using d2lai
using Flux
using CUDA, cuDNN Activating project at `/workspace/d2l-julia/d2lai`The AnyNet Design Space ​
The description below closely follows the reasoning in Radosavovic et al. [97] with some abbreviations to make it fit in the scope of the book. To begin, we need a template for the family of networks to explore. One of the commonalities of the designs in this chapter is that the networks consist of a stem, a body and a head. The stem performs initial image processing, often through convolutions with a larger window size. The body consists of multiple blocks, carrying out the bulk of the transformations needed to go from raw images to object representations. Lastly, the head converts this into the desired outputs, such as via a softmax regressor for multiclass classification. The body, in turn, consists of multiple stages, operating on the image at decreasing resolutions. In fact, both the stem and each subsequent stage quarter the spatial resolution. Lastly, each stage consists of one or more blocks. This pattern is common to all networks, from VGG to ResNeXt. Indeed, for the design of generic AnyNet networks, Radosavovic et al. [97] used the ResNeXt block of Figure.
The AnyNet design space. The numbers
Let's review the structure outlined in Figure in detail. As mentioned, an AnyNet consists of a stem, body, and head. The stem takes as its input RGB images (3 channels), using a
Since the network is designed to work well with ImageNet images of shape sec_nin), followed by a fully connected layer to emit an
Most of the relevant design decisions are inherent to the body of the network. It proceeds in stages, where each stage is composed of the same type of ResNeXt blocks as we discussed in :numref:subsec_resnext. The design there is again entirely generic: we begin with a block that halves the resolution by using a stride of
This seemingly generic design space provides us nonetheless with many parameters: we can set the block width (number of channels)