12.10. Transposed Convolution

The layers we introduced so far for convolutional neural networks, including convolutional layers (Section 6.2) and pooling layers (Section 6.5), often reducethe input width and height, or keep them unchanged. Applications such as semantic segmentation (Section 12.9) and generative adversarial networks (Section 14.2), however, require to predict values for each pixel and therefore needs to increase input width and height. Transposed convolution, also named fractionally-strided convolution Dumoulin.Visin.2016 or deconvolution Long.Shelhamer.Darrell.2015, serves this purpose.

from mxnet import nd, init
from mxnet.gluon import nn
import d2l

12.10.1. Basic 2D Transposed Convolution

Let’s consider a basic case that both input and output channels are 1, with 0 padding and 1 stride. Fig. 12.10.1 illustrates how transposed convolution with a \(2\times 2\) kernel is computed on the \(2\times 2\) input matrix.


Fig. 12.10.1 Transposed convolution layer with a \(2\times 2\) kernel.

We can implement this operation by giving matrix kernel \(K\) and matrix input \(X\).

def trans_conv(X, K):
    h, w = K.shape
    Y = nd.zeros((X.shape[0] + h - 1, X.shape[1] + w - 1))
    for i in range(X.shape[0]):
        for j in range(X.shape[1]):
            Y[i: i + h, j: j + w] += X[i, j] * K
    return Y

Remember the convolution computes results by Y[i, j] = (X[i: i + h, j: j + w] * K).sum() (refer to corr2d in Section 6.2), which summarizes input values through the kernel. While the transposed convolution broadcasts input values through the kernel, which results in a larger output shape.

Verify the results in Fig. 12.10.1.

X = nd.array([[0,1], [2,3]])
K = nd.array([[0,1], [2,3]])
trans_conv(X, K)
[[ 0.  0.  1.]
 [ 0.  4.  6.]
 [ 4. 12.  9.]]
<NDArray 3x3 @cpu(0)>

Or we can use nn.Conv2DTranspose to obtain the same results. As nn.Conv2D, both input and kernel should be 4-D tensors.

X, K = X.reshape((1, 1, 2, 2)),  K.reshape((1, 1, 2, 2))
tconv = nn.Conv2DTranspose(1, kernel_size=2)
[[[[ 0.  0.  1.]
   [ 0.  4.  6.]
   [ 4. 12.  9.]]]]
<NDArray 1x1x3x3 @cpu(0)>

12.10.2. Padding, Strides, and Channels

We apply padding elements to the input in convolution, while they are applied to the output in transposed convolution. A \(1\times 1\) padding means we first compute the output as normal, then remove the first/last rows and columns.

tconv = nn.Conv2DTranspose(1, kernel_size=2, padding=1)
<NDArray 1x1x1x1 @cpu(0)>

Similarly, strides are applied to outputs as well.

tconv = nn.Conv2DTranspose(1, kernel_size=2, strides=2)
[[[[0. 0. 0. 1.]
   [0. 0. 2. 3.]
   [0. 2. 0. 3.]
   [4. 6. 6. 9.]]]]
<NDArray 1x1x4x4 @cpu(0)>

The multi-channel extension of the transposed convolution is the same as the convolution. When the input has multiple channels, denoted by \(c_i\), the transposed convolution assigns a \(k_h\times k_w\) kernel matrix to each input channel. If the output has a channel size \(c_o\), then we have a \(c_i\times k_h\times k_w\) kernel for each output channel.

As a result, if we feed \(X\) into a convolutional layer \(f\) to compute \(Y=f(X)\) and create a transposed convolution layer \(g\) with the same hyper-parameters as \(f\) except for the output channel set to be the channel size of \(X\), then \(g(Y)\) should has the same shape as \(X\). Let’s verify this statement.

X = nd.random.uniform(shape=(1, 10, 16, 16))
conv = nn.Conv2D(20, kernel_size=5, padding=2, strides=3)
tconv = nn.Conv2DTranspose(10, kernel_size=5, padding=2, strides=3)
tconv(conv(X)).shape == X.shape

12.10.3. Analogy to Matrix Transposition

The transposed convolution takes its name from the matrix transposition. In fact, convolution operations can also be achieved by matrix multiplication. In the example below, we define a \(3\times\) input \(X\) with a \(2\times 2\) kernel \(K\), and then use corr2d to compute the convolution output.

X = nd.arange(9).reshape((3,3))
K = nd.array([[0,1], [2,3]])
Y = d2l.corr2d(X, K)
[[19. 25.]
 [37. 43.]]
<NDArray 2x2 @cpu(0)>

Next, we rewrite convolution kernel \(K\) as a matrix \(W\). Its shape will be \((4,9)\), where the \(i\)-th row present applying the kernel to the input to generate the \(i\)-th output element.

def kernel2matrix(K):
    k, W = nd.zeros(5), nd.zeros((4, 9))
    k[:2], k[3:5] = K[0,:], K[1,:]
    W[0, :5], W[1, 1:6], W[2, 3:8], W[3, 4:] = k, k, k, k
    return W

W = kernel2matrix(K)
[[0. 1. 0. 2. 3. 0. 0. 0. 0.]
 [0. 0. 1. 0. 2. 3. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 2. 3. 0.]
 [0. 0. 0. 0. 0. 1. 0. 2. 3.]]
<NDArray 4x9 @cpu(0)>

Then the convolution operator can be implemented by matrix multiplication with proper reshaping.

Y == nd.dot(W, X.reshape((-1))).reshape((2,2))
[[1. 1.]
 [1. 1.]]
<NDArray 2x2 @cpu(0)>

We can implement transposed convolution as a matrix multiplication as well by reusing kernel2matrix. To reuse the generated \(W\), we construct a \(2\times 2\) input, so the corresponding weight matrix will have a shape \((9,4)\), which is \(W^T\). Let’s verify the results.

X = nd.array([[0,1], [2,3]])
Y = trans_conv(X, K)
Y == nd.dot(W.T, X.reshape((-1))).reshape((3,3))
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
<NDArray 3x3 @cpu(0)>

12.10.4. Summary

  • Compared to convolutions that reduce inputs through kernels, transposed convolutions broadcast inputs.

  • If a convolution layer reduces the input width and height by \(n_w\) and \(h_h\) time, respectively. Then a transposed convolution layer with the same kernel sizes, padding and strides will increase the input width and height by \(n_w\) and \(n_h\), respectively.

  • We can implement convolution operations by the matrix multiplication, the corresponding transposed convolutions can be done by transposed matrix multiplication.

12.10.5. Exercises

  • Is it efficient to use matrix multiplication to implement convolution operations? Why?