Pooling

In many cases our ultimate task asks some global question about the image, e.g., does it contain a cat? Consequently, the units of our final layer should be sensitive to the entire input. By gradually aggregating information, yielding coarser and coarser maps, we accomplish this goal of ultimately learning a global representation, while keeping all of the advantages of convolutional layers at the intermediate layers of processing. The deeper we go in the network, the larger the receptive field (relative to the input) to which each hidden node is sensitive. Reducing spatial resolution accelerates this process, since the convolution kernels cover a larger effective area.

Moreover, when detecting lower-level features, such as edges (as discussed in :numref:sec_conv_layer), we often want our representations to be somewhat invariant to translation. For instance, if we take the image X with a sharp delineation between black and white and shift the whole image by one pixel to the right, i.e., Z[i, j] = X[i, j + 1], then the output for the new image Z might be vastly different. The edge will have shifted by one pixel. In reality, objects hardly ever occur exactly at the same place. In fact, even with a tripod and a stationary object, vibration of the camera due to the movement of the shutter might shift everything by a pixel or so (high-end cameras are loaded with special features to address this problem).

This section introduces pooling layers, which serve the dual purposes of mitigating the sensitivity of convolutional layers to location and of spatially downsampling representations.

julia

using Pkg; Pkg.activate("d2lai")
using d2lai, Flux, Statistics

  Activating project at `~/Projects/D2L/d2lai`

Maximum Pooling and Average Pooling

Like convolutional layers, pooling operators consist of a fixed-shape window that is slid over all regions in the input according to its stride, computing a single output for each location traversed by the fixed-shape window (sometimes known as the pooling window). However, unlike the cross-correlation computation of the inputs and kernels in the convolutional layer, the pooling layer contains no parameters (there is no kernel). Instead, pooling operators are deterministic, typically calculating either the maximum or the average value of the elements in the pooling window. These operations are called maximum pooling (max-pooling for short) and average pooling, respectively.

Average pooling is essentially as old as CNNs. The idea is akin to downsampling an image. Rather than just taking the value of every second (or third) pixel for the lower resolution image, we can average over adjacent pixels to obtain an image with better signal-to-noise ratio since we are combining the information from multiple adjacent pixels. Max-pooling was introduced in Riesenhuber and Poggio [81] in the context of cognitive neuroscience to describe how information aggregation might be aggregated hierarchically for the purpose of object recognition; there already was an earlier version in speech recognition [82]. In almost all cases, max-pooling, as it is also referred to, is preferable to average pooling.

In both cases, as with the cross-correlation operator, we can think of the pooling window as starting from the upper-left of the input tensor and sliding across it from left to right and top to bottom. At each location that the pooling window hits, it computes the maximum or average value of the input subtensor in the window, depending on whether max or average pooling is employed.

Max-pooling with a pooling window shape of $2 \times 2$ . The shaded portions are the first output element as well as the input tensor elements used for the output computation: $max (0, 1, 3, 4) = 4$ .

The output tensor in Figure has a height of 2 and a width of 2. The four elements are derived from the maximum value in each pooling window:

max (0, 1, 3, 4) = 4, max (1, 2, 4, 5) = 5, max (3, 4, 6, 7) = 7, max (4, 5, 7, 8) = 8.

More generally, we can define a $p \times q$ pooling layer by aggregating over a region of said size. Returning to the problem of edge detection, we use the output of the convolutional layer as input for $2 \times 2$ max-pooling. Denote by X the input of the convolutional layer input and Y the pooling layer output. Regardless of whether or not the values of X[i, j], X[i, j + 1], X[i+1, j] and X[i+1, j + 1] are different, the pooling layer always outputs Y[i, j] = 1. That is to say, using the $2 \times 2$ max-pooling layer, we can still detect if the pattern recognized by the convolutional layer moves no more than one element in height or width.

In the code below, we (implement the forward propagation of the pooling layer) in the pool2d function. This function is similar to the corr2d function in :numref:sec_conv_layer. However, no kernel is needed, computing the output as either the maximum or the average of each region in the input.

julia

function pool2d(X, pool_size; mode = :max)
    ph, pw = pool_size 
    Y = zeros(size(X, 1) - ph + 1, size(X, 2) - pw + 1)
    for i in 1:size(Y, 1)
        for j in 1:size(Y, 2)
            if mode == :max
                Y[i, j] = maximum(X[i:(i+ph-1), j:(j+pw-1)])
            elseif mode == :avg
                Y[i, j] = mean(X[i:(i+ph-1), j:(j+pw-1)])
            end
        end
    end
    Y
end

pool2d (generic function with 1 method)

We can construct the input tensor X in Figure to [validate the output of the two-dimensional max-pooling layer].

julia

X = [0. 1. 2.; 3. 4. 5.; 6. 7. 8.]
pool2d(X, (2,2))

2×2 Matrix{Float64}:
 4.0  5.0
 7.0  8.0

Also, we can experiment with (the average pooling layer).

julia

pool2d(X, (2, 2); mode = :avg)

2×2 Matrix{Float64}:
 2.0  3.0
 5.0  6.0

Padding and Stride

As with convolutional layers, pooling layers change the output shape. And as before, we can adjust the operation to achieve a desired output shape by padding the input and adjusting the stride. We can demonstrate the use of padding and strides in pooling layers via the built-in two-dimensional max-pooling layer from the deep learning framework. We first construct an input tensor X whose shape has four dimensions, where the number of examples (batch size) and number of channels are both 1.

julia

X = reshape(0:15, 4, 4, 1, 1)

4×4×1×1 reshape(::UnitRange{Int64}, 4, 4, 1, 1) with eltype Int64:
[:, :, 1, 1] =
 0  4   8  12
 1  5   9  13
 2  6  10  14
 3  7  11  15

Since pooling aggregates information from an area, (deep learning frameworks default to matching pooling window sizes and stride.) For instance, if we use a pooling window of shape (3, 3) we get a stride shape of (3, 3) by default.

julia

pool2d_layer = MaxPool((3,3))
pool2d_layer(X)

1×1×1×1 Array{Int64, 4}:
[:, :, 1, 1] =
 10

Needless to say, the stride and padding can be manually specified to override framework defaults if required.

julia

pool2d_layer = MaxPool((3,3), pad = 1, stride=2)
pool2d_layer(X)

2×2×1×1 Array{Int64, 4}:
[:, :, 1, 1] =
 5  13
 7  15

Of course, we can specify an arbitrary rectangular pooling window with arbitrary height and width respectively, as the example below shows.

julia

pool2d_layer = MaxPool((2,3), pad = (0,1), stride=(2,3))
pool2d_layer(X)

2×2×1×1 Array{Int64, 4}:
[:, :, 1, 1] =
 5  13
 7  15

Multiple Channels

When processing multi-channel input data, [the pooling layer pools each input channel separately], rather than summing the inputs up over channels as in a convolutional layer. This means that the number of output channels for the pooling layer is the same as the number of input channels. Below, we will concatenate tensors X and X + 1 on the channel dimension to construct an input with two channels.

julia

X = cat(X, X .+ 1, dims = 3)

4×4×2×1 Array{Int64, 4}:
[:, :, 1, 1] =
 0  4   8  12
 1  5   9  13
 2  6  10  14
 3  7  11  15

[:, :, 2, 1] =
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

julia

pool2d_layer = MaxPool((3,3), pad = 1, stride = 2)
pool2d_layer(X)

2×2×2×1 Array{Int64, 4}:
[:, :, 1, 1] =
 5  13
 7  15

[:, :, 2, 1] =
 6  14
 8  16

Summary

Pooling is an exceedingly simple operation. It does exactly what its name indicates, aggregate results over a window of values. All convolution semantics, such as strides and padding apply in the same way as they did previously. Note that pooling is indifferent to channels, i.e., it leaves the number of channels unchanged and it applies to each channel separately. Lastly, of the two popular pooling choices, max-pooling is preferable to average pooling, as it confers some degree of invariance to output. A popular choice is to pick a pooling window size of $2 \times 2$ to quarter the spatial resolution of output.

Note that there are many more ways of reducing resolution beyond pooling. For instance, in stochastic pooling Zeiler.Fergus.2013 and fractional max-pooling Graham.2014 aggregation is combined with randomization. This can slightly improve the accuracy in some cases. Lastly, as we will see later with the attention mechanism, there are more refined ways of aggregating over outputs, e.g., by using the alignment between a query and representation vectors.

Exercises

Implement average pooling through a convolution.
Prove that max-pooling cannot be implemented through a convolution alone.
Max-pooling can be accomplished using ReLU operations, i.e., $ReLU (x) = max (0, x)$ .
1. Express $max (a, b)$ by using only ReLU operations.
2. Use this to implement max-pooling by means of convolutions and ReLU layers.
3. How many channels and layers do you need for a $2 \times 2$ convolution? How many for a $3 \times 3$ convolution?
What is the computational cost of the pooling layer? Assume that the input to the pooling layer is of size $c \times h \times w$ , the pooling window has a shape of $p_{h} \times p_{w}$ with a padding of $(p_{h}, p_{w})$ and a stride of $(s_{h}, s_{w})$ .
Why do you expect max-pooling and average pooling to work differently?
Do we need a separate minimum pooling layer? Can you replace it with another operation?
We could use the softmax operation for pooling. Why might it not be so popular?
Implement average pooling through a convolution.

julia

function avg_pool(X, pool_size)
    K = ones(pool_size)
    return corr2d(X, K)./prod(pool_size)
end

avg_pool (generic function with 1 method)

Prove that max-pooling cannot be implemented through a convolution alone.
It will require us to take the maximum of a kernel and thus we cannot use corr2d like we used in previous example
Max-pooling can be accomplished using ReLU operations, i.e., $ReLU (x) = max (0, x)$ .
1. Express $max (a, b)$ by using only ReLU operations.
2. Use this to implement max-pooling by means of convolutions and ReLU layers.
3. How many channels and layers do you need for a $2 \times 2$ convolution? How many for a $3 \times 3$ convolution?
Max-pooling can be accomplished using ReLU operations, i.e., $ReLU (x) = max (0, x)$ .
1. Express $max (a, b)$ by using only ReLU operations.
2. Use this to implement max-pooling by means of convolutions and ReLU layers.
3. How many channels and layers do you need for a $2 \times 2$ convolution? How many for a $3 \times 3$ convolution?

julia

max_relu(a,b) = relu(b-a) + a
max_relu(a, b, c) = relu(relu(b-a) + a - c) + c

max_pool_conv (generic function with 1 method)

What is the computational cost of the pooling layer? Assume that the input to the pooling layer is of size $c \times h \times w$ , the pooling window has a shape of $p_{h} \times p_{w}$ with a padding of $(p_{h}, p_{w})$ and a stride of $(s_{h}, s_{w})$ .
Number of pooling operations is : $$ \lfloor (h + p_h + s_h) / s_h \rfloor \times \lfloor (w + p_w + s_w) / s_w \rfloor \times c$$

Number of operation per pooling operation: $$ p_h \times p_w $$
Total operations: $$ \lfloor (h + p_h + s_h) / s_h \rfloor \times \lfloor (w + p_w + s_w) / s_w \rfloor \times c \times p_h \times p_w $$

Why do you expect max-pooling and average pooling to work differently? Max Pooling allows the dominant pixel to progress to further layers, while average pooling ensures that we down sample the pixels so that the information from each pixel progress to the further layers
Do we need a separate minimum pooling layer? Can you replace it with another operation? No we donot need a separate minimum pooling layer. We would usually reverse the maxpool such that x = -MaxPool(-x)
We could use the softmax operation for pooling. Why might it not be so popular? Softmax is computationally expensive to begin with. Other than that, assigning probabilities to each output doesnot make much sense in this context

Pooling ​

Maximum Pooling and Average Pooling ​

Padding and Stride ​

Multiple Channels ​

Summary ​

Exercises ​

Pooling

Maximum Pooling and Average Pooling

Padding and Stride

Multiple Channels

Summary

Exercises