im2col
Implement convolutions as matrix operations using im2col
Convolutional is a mathematical operation which in the context of image processing is used to transform an image which could lead to blurring, sharpening, edge detection or more. It can take input from two-dimensional matrices to four-dimensional matrices and the convolutional filters which also range from two-dimensional matrices to four-dimensional matrices. The above equation gives a unified view of this, where x
is the input and k
is the convolutional kernel giving us a feature map called a
.
In the above picture x
of shape ( 4, 4 )
represents the input, w
represents the convolutional filter of shape ( 3, 3 )
giving us A
which would be our feature map. The process of convolution is to divide the input x
into several smaller patches (matrices) of same size as the convolutional kernel and then performing elementwise multiply and sum with the convolutional kernel.
In the following example we have shown patches of size ( 3, 3 )
created out of input x
and then performing convolution with the filter w
would lead us value a11
. Matrix of size (4,4)
with kernel of size (3,3)
and stride = 1 should give us 4
patches and hence the final result has 4
values.
Now the element-wise multiply and sum would require us to iterate over element using for loops and then returning the sum like shown below.
from fastai.vision.all import *
np.set_printoptions(linewidth=140)
x = np.arange(16, dtype=np.int32).reshape(4, 4)
w = np.random.random(size=(3, 3))
x1 = x[:3, :3]
c = 0
for i in range(x1.shape[0]):
for j in range(x1.shape[1]):
c += (x1[i, j] * w[i, j])
print(c)
Key insight is to view convlution as matrix multiplication which we can clearly see from the above two loops that it is doing element-wise multiplication and addition which is what we will get if we do a dot product of a row-vector and a column-vector so the idea is to transform the patch into a row-vector and the weights into a column vector.
Perform the same operation over all the 4
patches giving us a new matrix which we will call X
In the same spirit we can transform w
as a column vector and we will call it W
Now convolution is just a matrix multiplication between X
and W
Let's see how could we perform these operations using numpy, we will validate the results using the inbuild F.conv2d
provided by Pytorch
.
x = np.arange(16, dtype=np.int32).reshape(4, 4)
x
The first patch would look something like
x1 = x[:3, :3]
x1
Now transform this into row vector
x1 = x1.reshape(1, -1)
x1
Do this for rest of the patches and stack them together to get the matrix
x2 = x[:3, 1:4]
x2 = x2.reshape(1, -1)
x3 = x[1:4, :3]
x3 = x3.reshape(1, -1)
x4 = x[1:4, 1:4]
x4 = x4.reshape(1, -1)
X = np.vstack((x1, x2, x3, x4))
X
Now all of the row vectors have been stacked together to form X
Transform convolution kernel into a weight column vector.
np.random.seed(41)
w = np.random.random(size=(3, 3))
w
W = w.reshape(-1, 1)
W
conv = np.dot(X, W)
conv
conv.reshape(2, 2)
Verify if 2D convolution from Pytorch yields the same value or not
F.conv2d(tensor(x[None, None, :, :]).float(), tensor(w[None, None, :, :]), bias=None, stride=1)
Indeed the values match up!!
Let's take a look at how does convolution operations changes when we consider multi-channel input and multiple filters.
x = np.arange(32, dtype=np.int32).reshape(2, 4, 4)
x
In all of the cases presented here we have assumed stride=1
for simplicity, let's write a method that captures a patch and transforms into a row vector and then stack all of the row vectors together to form a matrix which we are referring to as X
.
def patches_to_row(inp, kh, kw):
H, W = inp.shape
xm = H - kh
ym = W - kw
X = []
for i in range(0,xm+1):
for j in range(0,ym+1):
rv = inp[i:i+kh, j:j+kw]
rv = rv.reshape(1, -1)
X.append(rv)
return np.vstack(X)
mat = patches_to_row(np.arange(16, dtype=np.int32).reshape(4, 4), 3, 3)
mat
For multi-channel input we perform the same operation for both the channels and then stack them together to give us X
X1 = patches_to_row(x[0, :, :], kh=3, kw=3)
X2 = patches_to_row(x[1, :, :], kh=3, kw=3)
X1
X2
Now we would like to concatenate 1st row of X1 with 1st row of X2.
X = np.hstack((X1, X2))
X
We need to ensure that number of channels in the convolutional kernel matches up with the input.
np.random.seed(41)
w = np.random.random(size=(2, 3, 3))
W = w.reshape(-1, 1)
W
fmap = np.dot(X, W)
fmap.reshape(2, 2)
F.conv2d(tensor(x[None, :, :, :]).float(),
tensor(w[None, :, :, :]).float(),
bias=None,
stride=1
)
As we can see the two values match, so far so good.
x = np.arange(32, dtype=np.int32).reshape(2, 4, 4)
x
Now we would like to consider 2
weight filters so our filter's shape would be (2, 2, 3, 3)
representing (kn, C, kh, kw)
-> kn
number of filters, C
number of channels, kh
- height of the filter, kw
width of filter.
np.random.seed(41)
w = np.random.random(size=(2, 2, 3, 3))
w
X1 = patches_to_row(x[0, :, :], kh=3, kw=3)
X2 = patches_to_row(x[1, :, :], kh=3, kw=3)
X = np.hstack((X1, X2))
print(X)
print(X.shape)
For convolutional filters first convert them into column vectors then combine them together into W = [W1, W2]
np.random.seed(41)
w = np.random.random(size=(2, 2, 3, 3))
W = np.hstack((w[0, :, :, :].reshape(-1, 1),
w[1, :, :, :].reshape(-1, 1)))
W
fmap = np.dot(X, W)
fmap.T.reshape(2, 2, 2)
F.conv2d(tensor(x[None, :, :, :]).float(),
tensor(w).float(),
bias=None,
stride=1
)
Checking with the pytorch's implementation to ensure correctness.
x = np.arange(64, dtype=np.int32).reshape(2, 2, 4, 4)
x
X11 = patches_to_row(x[0, 0, :, :], kh=3, kw=3)
X12 = patches_to_row(x[0, 1, :, :], kh=3, kw=3)
X1 = np.hstack((X11, X12))
X21 = patches_to_row(x[1, 0, :, :], kh=3, kw=3)
X22 = patches_to_row(x[1, 1, :, :], kh=3, kw=3)
X2 = np.hstack((X21, X22))
X = np.vstack((X1, X2))
X.shape
Each input produces 4
row vectors so joining the outputs of 2
inputs would yield us a matrix of shape (8, 18)
.
np.random.seed(41)
w = np.random.random(size=(2, 2, 3, 3))
W = np.hstack((w[0, :, :, :].reshape(-1, 1),
w[1, :, :, :].reshape(-1, 1)))
W
fmap = np.dot(X, W)
fmap.T.reshape(2, 2, 2, 2)
F.conv2d(tensor(x).float(),
tensor(w).float(),
bias=None,
stride=1
)
matching again!!
Using an efficient method in Numpy to perform covolutions.
We have so far seen that extracting patch from an image based on the kernel size and patch requires most of the effort. We naively implemented it with a double loop assuming a stride of size 1
but there is a faster way to get these patches of an input based on the filter size and stride using the method
np.lib.stride_tricks.as_strided
Let's see how this works out
Let's consider your input is X
and the convlutional kernel is of size ( 2, 2 )
with stride 1
then matrix should look something like this
X = np.arange(1, 10, dtype=np.int32).reshape(3, 3)
A = np.lib.stride_tricks.as_strided(X, shape=(2, 2, 2, 2), strides=(12, 4, 12, 4))
A
Lot's of stuff to unpack here, let's start with function signature and see what does individual parameters mean.
np.lib.stride_tricks.as_strided?
np.lib.stride_tricks.as_strided(x, shape=None, strides=None, subok=False, writeable=True)
This method creates a view on the matrix x
, where shape
and strides
are attributes of the view created. We are all familiar with what shape
signifies for a matrix but stride
is something new.
Numpy officially describes strides
as how many bytes we need to jump over in the data buffer to go from one item to the next
. So to create our desired view we need to pass it's shape and strides in the func args.
Let's try to understand it with an example.
a = np.arange(9, dtype=np.int32).reshape(3,3)
print(a)
so for moving from a[i, 0] to a[i, 1] we need to jump 4 bytes since dtype is int32
and across ith
dimension to move from a[0, j] to a[1, j] we need to jump 12 bytes, hence strides for a
would be (12, 4)
a.strides
Coming back to original array A
how would we determine it's strides, we could start with the lowest dimension. How many bytes we need to jump when moving from A[i, j, k, l] to A[i, j, k, l+1] and compare it with X
. We know moving from previous exercise moving across lowest dimension would require only 4 bytes so strides would be (?, ?, ?, 4)
, also moving to higher dimension jump required to move from A[i, j, k, l]
to A[i, j, k+1, l]
would be 12
bytes when comparing with our previous example leading to (?, ?, 12, 4)
.
For the next dimension j
we need to look at the bytes we need to jump to reach from 1
to 2
marked in the above figure. Essentially we are trying to figure out jump between a[i, j, k, l]
to a[i, j+1, k, l]
. If we look at the original array a
both 1
and 2
are beside each other in 0th row so it would require jump of only 4 bytes, hence strides would be (?, 4, 12, 4)
For the highest dimension we are looking at jump between 1
and 4
marked in red
in the figure above. If we map these numbers to our original array X we see that these two numbers are in the same colunmn but in adjacent rows hence it would take 12 bytes to reach from 1
to 4
hence the stide would be 12
, so the final value of strides for the array A
would be (12, 4, 12, 4)
X = np.arange(1, 19, dtype=np.int32).reshape(2, 3, 3)
X
If the size of the convlutional kernel is (2, 2)
then we will get 8 matrices of size (2, 2)
so therefore shape of A would be (2, 2, 2, 2, 2)
. If you notice it has one more dimension than the previous case and it is because we have added 2 channels instead of one so shape of A
would be
A.shape = (X.shape[0], *A.shape)
For stride calcualtion, rest everything is same we have only added an additional channel dimension hence strides value would be (?, 12, 4, 12, 4)
so we only need to find the value of highest dimension.
Let's look at the strides of the input for reference here
X.strides
For the highest dimension taking an example here, we want to find the jump between 1
and 10
. If we compare the positions of 1
and 10
in X
we find out that they are 9 positions apart and to move between a single position it takes around 4
bytes so hence to move 9 positions it would take 36
bytes.
Hence the final values of strides would be (36, 12, 4, 12, 4)
. If we pay close attention there is a relationship between X.strides and A.strides which is the following
A.strides = (*X.strides[:-2], X.strides[-2]*s, X.strides[-1]*s, *X.strides[-2:])
where s
would be the stride
used by the convolution filter to go over the input X
.
Next question would be how to determine the shape of A
, since X
is 4-dim tensor with shape (N, C, H, W)
and shape of convolutional kernel being kh
and kw
, then shape of A
could be determined as (N, C, ?, ?, kh, kw)
.
The output feature map should have height oh
and width ow
respectively which could be calculated as following
oh = ( H - kh ) // stride + 1
ow = ( W - kw ) // stride + 1
def get_mat_strides(X, kh, kw, s):
N, C, H, W = X.shape
oh = ( H - kh ) // s + 1
ow = ( W - kw ) // s + 1
strides = (*X.strides[:-2], X.strides[-2]*s, X.strides[-1]*s, *X.strides[-2:])
A = np.lib.stride_tricks.as_strided(X, shape=(N, C, oh, ow, kh, kw), strides=strides)
return A
X = np.arange(1, 33, dtype=np.int32).reshape(1, 2, 4, 4)
A = get_mat_strides(X, kh=2, kw=2, s=1)
A
Numpy provides a method named np.tensordot
to peform tensor multiplication.
Here we would perform tensor multiplication between A
and the convolution filters W
.
Let's consider a single filter matching 2 channels of the input X
.
np.random.seed(41)
W = np.random.random(size=(1, 2, 2, 2))
W
X.shape, A.shape, W.shape
# Shape of filter -> (kn, C, kh, kw)
# Shape of feature map -> (N, kn, oh, ow)
fmap = np.tensordot(A, W, axes=[(1, 4, 5), (1, 2, 3)])
fmap.transpose(0, 3, 1, 2) # transpose to match the output of `F.conv2d`
conv = F.conv2d(tensor(X).float(), tensor(W), bias=None, stride=1)
conv
As you can see both the values match one from our implmentation's of convolution using np.lib.stride_tricks.as_strided
followed by np.tensordot
with F.conv2d
.
Note: here we have assumed that X is necessarily padded before.
%timeit -n 1000 np.tensordot(A, W, axes=[(1, 4, 5), (1, 2, 3)])
%timeit -n 1000 F.conv2d(tensor(X).float(), tensor(W), bias=None, stride=1)
It matches with the performance of Pytorch's implementatio as well.
It is useful to implement the core building blocks of deep learning from multiple ways to better understand what's happening behind the scenes and gain insights. This exercise gave a different perspective of why convolutions need to be modeled as matrix multiplications to gain the performance speedup of accelerators like GPU
.