What is convolution?

Convolutional is a mathematical operation which in the context of image processing is used to transform an image which could lead to blurring, sharpening, edge detection or more. It can take input from two-dimensional matrices to four-dimensional matrices and the convolutional filters which also range from two-dimensional matrices to four-dimensional matrices. The above equation gives a unified view of this, where x is the input and k is the convolutional kernel giving us a feature map called a.

In the above picture x of shape ( 4, 4 ) represents the input, w represents the convolutional filter of shape ( 3, 3 ) giving us A which would be our feature map. The process of convolution is to divide the input x into several smaller patches (matrices) of same size as the convolutional kernel and then performing elementwise multiply and sum with the convolutional kernel.

In the following example we have shown patches of size ( 3, 3 ) created out of input x and then performing convolution with the filter w would lead us value a11. Matrix of size (4,4) with kernel of size (3,3) and stride = 1 should give us 4 patches and hence the final result has 4 values.

Now the element-wise multiply and sum would require us to iterate over element using for loops and then returning the sum like shown below.

from fastai.vision.all import *
np.set_printoptions(linewidth=140)

x = np.arange(16, dtype=np.int32).reshape(4, 4)
w = np.random.random(size=(3, 3))

x1 = x[:3, :3]

c = 0

for i in range(x1.shape[0]):
  for j in range(x1.shape[1]):
    c += (x1[i, j] * w[i, j])

print(c)

16.574367764256806

Key insight is to view convlution as matrix multiplication which we can clearly see from the above two loops that it is doing element-wise multiplication and addition which is what we will get if we do a dot product of a row-vector and a column-vector so the idea is to transform the patch into a row-vector and the weights into a column vector.

Perform the same operation over all the 4 patches giving us a new matrix which we will call X

In the same spirit we can transform w as a column vector and we will call it W

Now convolution is just a matrix multiplication between X and W

Let's see how could we perform these operations using numpy, we will validate the results using the inbuild F.conv2d provided by Pytorch.

Single channel input

x = np.arange(16, dtype=np.int32).reshape(4, 4)
x

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]], dtype=int32)

The first patch would look something like

x1 = x[:3, :3]
x1

array([[ 0,  1,  2],
       [ 4,  5,  6],
       [ 8,  9, 10]], dtype=int32)

Now transform this into row vector

x1 = x1.reshape(1, -1)
x1

array([[ 0,  1,  2,  4,  5,  6,  8,  9, 10]], dtype=int32)

Do this for rest of the patches and stack them together to get the matrix

x2 = x[:3, 1:4]
x2 = x2.reshape(1, -1)

x3 = x[1:4, :3]
x3 = x3.reshape(1, -1)

x4 = x[1:4, 1:4]
x4 = x4.reshape(1, -1)

X  = np.vstack((x1, x2, x3, x4))
X

array([[ 0,  1,  2,  4,  5,  6,  8,  9, 10],
       [ 1,  2,  3,  5,  6,  7,  9, 10, 11],
       [ 4,  5,  6,  8,  9, 10, 12, 13, 14],
       [ 5,  6,  7,  9, 10, 11, 13, 14, 15]], dtype=int32)

Now all of the row vectors have been stacked together to form X

Transform convolution kernel into a weight column vector.

np.random.seed(41)
w = np.random.random(size=(3, 3))
w

array([[0.25092362, 0.04609582, 0.67681624],
       [0.04346949, 0.1164237 , 0.60386569],
       [0.19093066, 0.66851572, 0.91744785]])

W = w.reshape(-1, 1)
W

array([[0.25092362],
       [0.04609582],
       [0.67681624],
       [0.04346949],
       [0.1164237 ],
       [0.60386569],
       [0.19093066],
       [0.66851572],
       [0.91744785]])

conv = np.dot(X, W)
conv

array([[22.49748414],
       [26.01197293],
       [36.55543931],
       [40.0699281 ]])

conv.reshape(2, 2)

array([[22.49748414, 26.01197293],
       [36.55543931, 40.0699281 ]])

Verify if 2D convolution from Pytorch yields the same value or not

F.conv2d(tensor(x[None, None, :, :]).float(), tensor(w[None, None, :, :]), bias=None, stride=1)

tensor([[[[22.4975, 26.0120],
          [36.5554, 40.0699]]]])

Indeed the values match up!!

Let's take a look at how does convolution operations changes when we consider multi-channel input and multiple filters.

Multi-channel input

Let's start with a case where we have a multi-channel input, so earlier we considered inputs of shape (H, W) now we would consider inputs like (C, H, W) and see how could we use our im2col algorithm to perform the convolution operation.

x = np.arange(32, dtype=np.int32).reshape(2, 4, 4)
x

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [12, 13, 14, 15]],

       [[16, 17, 18, 19],
        [20, 21, 22, 23],
        [24, 25, 26, 27],
        [28, 29, 30, 31]]], dtype=int32)

In all of the cases presented here we have assumed stride=1 for simplicity, let's write a method that captures a patch and transforms into a row vector and then stack all of the row vectors together to form a matrix which we are referring to as X.

def patches_to_row(inp, kh, kw):
  H, W = inp.shape
  xm = H - kh
  ym = W - kw

  X  = []

  for i in range(0,xm+1):
    for j in range(0,ym+1):
      rv = inp[i:i+kh, j:j+kw]
      rv = rv.reshape(1, -1)
      X.append(rv)

  return np.vstack(X)

mat = patches_to_row(np.arange(16, dtype=np.int32).reshape(4, 4), 3, 3)
mat

array([[ 0,  1,  2,  4,  5,  6,  8,  9, 10],
       [ 1,  2,  3,  5,  6,  7,  9, 10, 11],
       [ 4,  5,  6,  8,  9, 10, 12, 13, 14],
       [ 5,  6,  7,  9, 10, 11, 13, 14, 15]], dtype=int32)

For multi-channel input we perform the same operation for both the channels and then stack them together to give us X

X1 = patches_to_row(x[0, :, :], kh=3, kw=3)
X2 = patches_to_row(x[1, :, :], kh=3, kw=3)

X1

array([[ 0,  1,  2,  4,  5,  6,  8,  9, 10],
       [ 1,  2,  3,  5,  6,  7,  9, 10, 11],
       [ 4,  5,  6,  8,  9, 10, 12, 13, 14],
       [ 5,  6,  7,  9, 10, 11, 13, 14, 15]], dtype=int32)

X2

array([[16, 17, 18, 20, 21, 22, 24, 25, 26],
       [17, 18, 19, 21, 22, 23, 25, 26, 27],
       [20, 21, 22, 24, 25, 26, 28, 29, 30],
       [21, 22, 23, 25, 26, 27, 29, 30, 31]], dtype=int32)

Now we would like to concatenate 1st row of X1 with 1st row of X2.

X  = np.hstack((X1, X2))
X

array([[ 0,  1,  2,  4,  5,  6,  8,  9, 10, 16, 17, 18, 20, 21, 22, 24, 25, 26],
       [ 1,  2,  3,  5,  6,  7,  9, 10, 11, 17, 18, 19, 21, 22, 23, 25, 26, 27],
       [ 4,  5,  6,  8,  9, 10, 12, 13, 14, 20, 21, 22, 24, 25, 26, 28, 29, 30],
       [ 5,  6,  7,  9, 10, 11, 13, 14, 15, 21, 22, 23, 25, 26, 27, 29, 30, 31]], dtype=int32)

We need to ensure that number of channels in the convolutional kernel matches up with the input.

np.random.seed(41)
w = np.random.random(size=(2, 3, 3))
W = w.reshape(-1, 1)
W

array([[0.25092362],
       [0.04609582],
       [0.67681624],
       [0.04346949],
       [0.1164237 ],
       [0.60386569],
       [0.19093066],
       [0.66851572],
       [0.91744785],
       [0.41878009],
       [0.33225985],
       [0.28303364],
       [0.18628227],
       [0.31711047],
       [0.48116867],
       [0.06952047],
       [0.70498257],
       [0.31467693]])

fmap = np.dot(X, W)
fmap.reshape(2, 2)

array([[ 88.38632016,  95.0086239 ],
       [114.87553514, 121.49783888]])

F.conv2d(tensor(x[None, :, :, :]).float(),
         tensor(w[None, :, :, :]).float(),
         bias=None,
         stride=1
        )

tensor([[[[ 88.3863,  95.0086],
          [114.8755, 121.4978]]]])

As we can see the two values match, so far so good.

Multi channel input and multiple filters

In the previous case we considered only 1 filter which had 2 channels and it lead us to a feature map of the same size that we got in case of single input channel, now we can use multiple conv filters to capture different features in the image.

x = np.arange(32, dtype=np.int32).reshape(2, 4, 4)
x

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [12, 13, 14, 15]],

       [[16, 17, 18, 19],
        [20, 21, 22, 23],
        [24, 25, 26, 27],
        [28, 29, 30, 31]]], dtype=int32)

Now we would like to consider 2 weight filters so our filter's shape would be (2, 2, 3, 3) representing (kn, C, kh, kw) -> kn number of filters, C number of channels, kh - height of the filter, kw width of filter.

np.random.seed(41)
w = np.random.random(size=(2, 2, 3, 3))
w

array([[[[0.25092362, 0.04609582, 0.67681624],
         [0.04346949, 0.1164237 , 0.60386569],
         [0.19093066, 0.66851572, 0.91744785]],

        [[0.41878009, 0.33225985, 0.28303364],
         [0.18628227, 0.31711047, 0.48116867],
         [0.06952047, 0.70498257, 0.31467693]]],


       [[[0.74528235, 0.3982128 , 0.60822646],
         [0.72845649, 0.42175804, 0.39390821],
         [0.23222257, 0.4416646 , 0.37302139]],

        [[0.58360604, 0.10003138, 0.74135188],
         [0.08319793, 0.12622394, 0.32289153],
         [0.64292729, 0.99947173, 0.28100165]]]])

X1 = patches_to_row(x[0, :, :], kh=3, kw=3)
X2 = patches_to_row(x[1, :, :], kh=3, kw=3)

X  = np.hstack((X1, X2))
print(X)
print(X.shape)

[[ 0  1  2  4  5  6  8  9 10 16 17 18 20 21 22 24 25 26]
 [ 1  2  3  5  6  7  9 10 11 17 18 19 21 22 23 25 26 27]
 [ 4  5  6  8  9 10 12 13 14 20 21 22 24 25 26 28 29 30]
 [ 5  6  7  9 10 11 13 14 15 21 22 23 25 26 27 29 30 31]]
(4, 18)

For convolutional filters first convert them into column vectors then combine them together into W = [W1, W2]

np.random.seed(41)
w = np.random.random(size=(2, 2, 3, 3))
W = np.hstack((w[0, :, :, :].reshape(-1, 1),
               w[1, :, :, :].reshape(-1, 1)))
W

array([[0.25092362, 0.74528235],
       [0.04609582, 0.3982128 ],
       [0.67681624, 0.60822646],
       [0.04346949, 0.72845649],
       [0.1164237 , 0.42175804],
       [0.60386569, 0.39390821],
       [0.19093066, 0.23222257],
       [0.66851572, 0.4416646 ],
       [0.91744785, 0.37302139],
       [0.41878009, 0.58360604],
       [0.33225985, 0.10003138],
       [0.28303364, 0.74135188],
       [0.18628227, 0.08319793],
       [0.31711047, 0.12622394],
       [0.48116867, 0.32289153],
       [0.06952047, 0.64292729],
       [0.70498257, 0.99947173],
       [0.31467693, 0.28100165]])

fmap = np.dot(X, W)
fmap.T.reshape(2, 2, 2)

array([[[ 88.38632016,  95.0086239 ],
        [114.87553514, 121.49783888]],

       [[102.08763737, 110.31109367],
        [134.98146258, 143.20491888]]])

F.conv2d(tensor(x[None, :, :, :]).float(),
         tensor(w).float(),
         bias=None,
         stride=1
        )

tensor([[[[ 88.3863,  95.0086],
          [114.8755, 121.4978]],

         [[102.0876, 110.3111],
          [134.9815, 143.2049]]]])

Checking with the pytorch's implementation to ensure correctness.

Batch of inputs with multi-channels and multiple convolutional filters.

Extending to batch input would only require us to concatenate the output of the individual inputs in the batch to form X

Let' take an example where we have only 2 inputs in our batch.

x = np.arange(64, dtype=np.int32).reshape(2, 2, 4, 4)
x

array([[[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11],
         [12, 13, 14, 15]],

        [[16, 17, 18, 19],
         [20, 21, 22, 23],
         [24, 25, 26, 27],
         [28, 29, 30, 31]]],


       [[[32, 33, 34, 35],
         [36, 37, 38, 39],
         [40, 41, 42, 43],
         [44, 45, 46, 47]],

        [[48, 49, 50, 51],
         [52, 53, 54, 55],
         [56, 57, 58, 59],
         [60, 61, 62, 63]]]], dtype=int32)

X11 = patches_to_row(x[0, 0, :, :], kh=3, kw=3)
X12 = patches_to_row(x[0, 1, :, :], kh=3, kw=3)

X1  = np.hstack((X11, X12))

X21 = patches_to_row(x[1, 0, :, :], kh=3, kw=3)
X22 = patches_to_row(x[1, 1, :, :], kh=3, kw=3)

X2  = np.hstack((X21, X22))
X   = np.vstack((X1, X2))
X.shape

(8, 18)

Each input produces 4 row vectors so joining the outputs of 2 inputs would yield us a matrix of shape (8, 18).

np.random.seed(41)
w = np.random.random(size=(2, 2, 3, 3))
W = np.hstack((w[0, :, :, :].reshape(-1, 1),
               w[1, :, :, :].reshape(-1, 1)))
W

array([[0.25092362, 0.74528235],
       [0.04609582, 0.3982128 ],
       [0.67681624, 0.60822646],
       [0.04346949, 0.72845649],
       [0.1164237 , 0.42175804],
       [0.60386569, 0.39390821],
       [0.19093066, 0.23222257],
       [0.66851572, 0.4416646 ],
       [0.91744785, 0.37302139],
       [0.41878009, 0.58360604],
       [0.33225985, 0.10003138],
       [0.28303364, 0.74135188],
       [0.18628227, 0.08319793],
       [0.31711047, 0.12622394],
       [0.48116867, 0.32289153],
       [0.06952047, 0.64292729],
       [0.70498257, 0.99947173],
       [0.31467693, 0.28100165]])

fmap = np.dot(X, W)
fmap.T.reshape(2, 2, 2, 2)

array([[[[ 88.38632016,  95.0086239 ],
         [114.87553514, 121.49783888]],

        [[300.30003998, 306.92234373],
         [326.78925496, 333.4115587 ]]],


       [[[102.08763737, 110.31109367],
         [134.98146258, 143.20491888]],

        [[365.238239  , 373.4616953 ],
         [398.1320642 , 406.3555205 ]]]])

F.conv2d(tensor(x).float(),
         tensor(w).float(),
         bias=None,
         stride=1
        )

tensor([[[[ 88.3863,  95.0086],
          [114.8755, 121.4978]],

         [[102.0876, 110.3111],
          [134.9815, 143.2049]]],


        [[[300.3000, 306.9224],
          [326.7893, 333.4115]],

         [[365.2382, 373.4617],
          [398.1320, 406.3555]]]])

matching again!!

Using an efficient method in Numpy to perform covolutions.

We have so far seen that extracting patch from an image based on the kernel size and patch requires most of the effort. We naively implemented it with a double loop assuming a stride of size 1 but there is a faster way to get these patches of an input based on the filter size and stride using the method

np.lib.stride_tricks.as_strided

Let's see how this works out

Let's consider your input is X and the convlutional kernel is of size ( 2, 2 ) with stride 1 then matrix should look something like this

X = np.arange(1, 10, dtype=np.int32).reshape(3, 3)
A = np.lib.stride_tricks.as_strided(X, shape=(2, 2, 2, 2), strides=(12, 4, 12, 4))
A

array([[[[1, 2],
         [4, 5]],

        [[2, 3],
         [5, 6]]],


       [[[4, 5],
         [7, 8]],

        [[5, 6],
         [8, 9]]]], dtype=int32)

Lot's of stuff to unpack here, let's start with function signature and see what does individual parameters mean.

np.lib.stride_tricks.as_strided?

np.lib.stride_tricks.as_strided(x, shape=None, strides=None, subok=False, writeable=True)

This method creates a view on the matrix x, where shape and strides are attributes of the view created. We are all familiar with what shape signifies for a matrix but stride is something new.

Numpy officially describes strides as how many bytes we need to jump over in the data buffer to go from one item to the next. So to create our desired view we need to pass it's shape and strides in the func args.

Let's try to understand it with an example.

a = np.arange(9, dtype=np.int32).reshape(3,3)
print(a)

[[0 1 2]
 [3 4 5]
 [6 7 8]]

so for moving from a[i, 0] to a[i, 1] we need to jump 4 bytes since dtype is int32 and across ith dimension to move from a[0, j] to a[1, j] we need to jump 12 bytes, hence strides for a would be (12, 4)

a.strides

(12, 4)

Coming back to original array A how would we determine it's strides, we could start with the lowest dimension. How many bytes we need to jump when moving from A[i, j, k, l] to A[i, j, k, l+1] and compare it with X. We know moving from previous exercise moving across lowest dimension would require only 4 bytes so strides would be (?, ?, ?, 4), also moving to higher dimension jump required to move from A[i, j, k, l] to A[i, j, k+1, l] would be 12 bytes when comparing with our previous example leading to (?, ?, 12, 4).

For the next dimension j we need to look at the bytes we need to jump to reach from 1 to 2 marked in the above figure. Essentially we are trying to figure out jump between a[i, j, k, l] to a[i, j+1, k, l]. If we look at the original array a both 1 and 2 are beside each other in 0th row so it would require jump of only 4 bytes, hence strides would be (?, 4, 12, 4)

For the highest dimension we are looking at jump between 1 and 4 marked in red in the figure above. If we map these numbers to our original array X we see that these two numbers are in the same colunmn but in adjacent rows hence it would take 12 bytes to reach from 1 to 4 hence the stide would be 12, so the final value of strides for the array A would be (12, 4, 12, 4)

Multi-channel inputs

X = np.arange(1, 19, dtype=np.int32).reshape(2, 3, 3)
X

array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]]], dtype=int32)

If the size of the convlutional kernel is (2, 2) then we will get 8 matrices of size (2, 2) so therefore shape of A would be (2, 2, 2, 2, 2). If you notice it has one more dimension than the previous case and it is because we have added 2 channels instead of one so shape of A would be

A.shape = (X.shape[0], *A.shape)

For stride calcualtion, rest everything is same we have only added an additional channel dimension hence strides value would be (?, 12, 4, 12, 4) so we only need to find the value of highest dimension.

Let's look at the strides of the input for reference here

X.strides

(36, 12, 4)

For the highest dimension taking an example here, we want to find the jump between 1 and 10. If we compare the positions of 1 and 10 in X we find out that they are 9 positions apart and to move between a single position it takes around 4 bytes so hence to move 9 positions it would take 36 bytes.

Hence the final values of strides would be (36, 12, 4, 12, 4). If we pay close attention there is a relationship between X.strides and A.strides which is the following

A.strides = (*X.strides[:-2], X.strides[-2]*s, X.strides[-1]*s, *X.strides[-2:])

where s would be the stride used by the convolution filter to go over the input X.

Next question would be how to determine the shape of A, since X is 4-dim tensor with shape (N, C, H, W) and shape of convolutional kernel being kh and kw, then shape of A could be determined as (N, C, ?, ?, kh, kw).

The output feature map should have height oh and width ow respectively which could be calculated as following

oh = ( H - kh ) // stride + 1
ow = ( W - kw ) // stride + 1

def get_mat_strides(X, kh, kw, s):
  N, C, H, W = X.shape
  oh = ( H - kh ) // s + 1
  ow = ( W - kw ) // s + 1

  strides = (*X.strides[:-2], X.strides[-2]*s, X.strides[-1]*s, *X.strides[-2:])
  A = np.lib.stride_tricks.as_strided(X, shape=(N, C, oh, ow, kh, kw), strides=strides)
  return A


X = np.arange(1, 33, dtype=np.int32).reshape(1, 2, 4, 4)
A = get_mat_strides(X, kh=2, kw=2, s=1)
A

array([[[[[[ 1,  2],
           [ 5,  6]],

          [[ 2,  3],
           [ 6,  7]],

          [[ 3,  4],
           [ 7,  8]]],


         [[[ 5,  6],
           [ 9, 10]],

          [[ 6,  7],
           [10, 11]],

          [[ 7,  8],
           [11, 12]]],


         [[[ 9, 10],
           [13, 14]],

          [[10, 11],
           [14, 15]],

          [[11, 12],
           [15, 16]]]],


        [[[[17, 18],
           [21, 22]],

          [[18, 19],
           [22, 23]],

          [[19, 20],
           [23, 24]]],


         [[[21, 22],
           [25, 26]],

          [[22, 23],
           [26, 27]],

          [[23, 24],
           [27, 28]]],


         [[[25, 26],
           [29, 30]],

          [[26, 27],
           [30, 31]],

          [[27, 28],
           [31, 32]]]]]], dtype=int32)

Numpy provides a method named np.tensordot to peform tensor multiplication. Here we would perform tensor multiplication between A and the convolution filters W.

Let's consider a single filter matching 2 channels of the input X.

np.random.seed(41)
W = np.random.random(size=(1, 2, 2, 2))

W

array([[[[0.25092362, 0.04609582],
         [0.67681624, 0.04346949]],

        [[0.1164237 , 0.60386569],
         [0.19093066, 0.66851572]]]])

X.shape, A.shape, W.shape

((1, 2, 4, 4), (1, 2, 3, 3, 2, 2), (1, 2, 2, 2))

# Shape of filter -> (kn, C, kh, kw)
# Shape of feature map -> (N, kn, oh, ow)

fmap = np.tensordot(A, W, axes=[(1, 4, 5), (1, 2, 3)])
fmap.transpose(0, 3, 1, 2) # transpose to match the output of `F.conv2d`

array([[[[35.55368844, 38.15072938, 40.74777032],
         [45.94185221, 48.53889315, 51.1359341 ],
         [56.33001598, 58.92705693, 61.52409787]]]])

conv = F.conv2d(tensor(X).float(), tensor(W), bias=None, stride=1)
conv

tensor([[[[35.5537, 38.1507, 40.7478],
          [45.9419, 48.5389, 51.1359],
          [56.3300, 58.9271, 61.5241]]]])

As you can see both the values match one from our implmentation's of convolution using np.lib.stride_tricks.as_strided followed by np.tensordot with F.conv2d.

Note: here we have assumed that X is necessarily padded before.

%timeit -n 1000 np.tensordot(A, W, axes=[(1, 4, 5), (1, 2, 3)])

21 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit -n 1000 F.conv2d(tensor(X).float(), tensor(W), bias=None, stride=1)

39.4 µs ± 4.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

It matches with the performance of Pytorch's implementatio as well.

It is useful to implement the core building blocks of deep learning from multiple ways to better understand what's happening behind the scenes and gain insights. This exercise gave a different perspective of why convolutions need to be modeled as matrix multiplications to gain the performance speedup of accelerators like GPU.