Objective

Deep CNN architecutures are made up of many components like conv layer, activation function, pooling layers, batch-norm layer etc.All of them are designed for specific reasons and to better understand the effect of these components it is important to play with them. For example, we include layers like pooling, strided convolution etc to reduce the size of the input.

If we look at the architecture of VGG below, we see that lot of max_pooling layers are used. The idea is to increase the receptive field and decrease the number of parameters by reducing the size of the input. But then a question arises, do we really need to downsample? Luckily I came across this repository where this specific question is addressed and the author has tried to replace downsampling layers with dilated convolutions or large kernels with appropriate padding to address this issue. The purpose of this post is to implement and validate these ideas by performing the above mentioned experiments on CIFAR-10 dataset using fastai. The intention is to show how easy it is for us to experiment using fastai.

Libraries

let's start by installing fastai2 and keras

!pip install fastai2 keras > /dev/null
from fastai2.vision.all import *
from keras.datasets import cifar10

Dataset

We are going to train our network on CIFAR-10 dataset. CIFAR-10 is an established computer-vision dataset used for object recognition. It is a subset of the 80 million tiny images dataset and consists of 60,000 32x32 color images containing one of 10 object classes, with 6000 images per class. It was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.

(x_train,y_train),(x_test,y_test) = cifar10.load_data()

Datasets API

This is where fastai comes in with its flexible api to form a dataset that is easily consumable by the models. We are passing a list of pairs of (image, label) to the pipeline. No data augmentation is applied. To validate our ideas we will keep a separate holdout set. If you want to understand more about Datasets api please read official docs

items = np.array(list(zip(x_train, y_train.ravel())))

# 80-20 percent split
splits = RandomSplitter(seed=42)(items)
tfms   = [[lambda x: x[0], PILImage.create], [lambda x: x[1], Categorize()]]

dsets = Datasets(items, tfms, splits=splits)
dls = dsets.dataloaders(bs=64, after_item=[ToTensor(), IntToFloatTensor()])
dls.show_batch(figsize=(4, 4))

VGG ( 4 layer ) network

Let's try to train a 4 layer VGG based network with downsampling and see the performance on CIFAR-10.

class VGG_4(nn.Module):
    def __init__(self, c_in=3, n_out=10):
        super(VGG_4, self).__init__()
        
        self.n_out = n_out
        self.model = nn.Sequential(nn.Conv2d(in_channels=c_in, out_channels=16, padding=(1, 1), kernel_size=(3, 3)),
                                   nn.BatchNorm2d(16),
                                   nn.ReLU(inplace=True),
                                   nn.MaxPool2d(kernel_size=2, stride=2),
                                   nn.Conv2d(in_channels=16, out_channels=24, padding=(1, 1), kernel_size=(3, 3)),
                                   nn.BatchNorm2d(24),
                                   nn.ReLU(inplace=True),
                                   nn.MaxPool2d(kernel_size=2, stride=2),
                                   nn.Conv2d(in_channels=24, out_channels=32, padding=(1, 1), kernel_size=(3, 3)),
                                   nn.BatchNorm2d(32),
                                   nn.ReLU(inplace=True),
                                   nn.MaxPool2d(kernel_size=2, stride=2),
                                   nn.Conv2d(in_channels=32, out_channels=48, padding=(1, 1), kernel_size=(3, 3)),
                                   nn.BatchNorm2d(48),
                                   nn.ReLU(inplace=True),
                                   nn.MaxPool2d(kernel_size=2, stride=2),
                                   nn.AdaptiveAvgPool2d(output_size=(1, 1)),
                                   nn.Conv2d(in_channels=48, out_channels=self.n_out, kernel_size=(1, 1))
                                  )
    
    def forward(self, x):
        x = self.model(x)
        x = x.view(-1, self.n_out)
        return x
learn = Learner(dls, VGG_4(), loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.lr_find()

Let's train it for 30 epochs

learn.fit_one_cycle(30, 1e-2)

learner.summary()

VGG with dilated convolution

Idea is to progressively increase the dilation from 1, 2, 4, 8 to increase the receptive field of the network.

class VGG4_Dilation(nn.Module):
    def __init__(self, c_in=3, n_out=10):
        super(VGG4_Dilation, self).__init__()
          
        self.n_out = n_out
        self.model = nn.Sequential(nn.Conv2d(in_channels=c_in, out_channels=16, padding=(1, 1), kernel_size=(3, 3), dilation=1),
                                  nn.BatchNorm2d(16),
                                  nn.ReLU(inplace=True),
                                  nn.Conv2d(in_channels=16, out_channels=24, padding=(1, 1), kernel_size=(3, 3), dilation=2),
                                  nn.BatchNorm2d(24),
                                  nn.ReLU(inplace=True),
                                  nn.Conv2d(in_channels=24, out_channels=32, padding=(1, 1), kernel_size=(3, 3), dilation=4),
                                  nn.BatchNorm2d(32),
                                  nn.ReLU(inplace=True),
                                  nn.Conv2d(in_channels=32, out_channels=48, padding=(1, 1), kernel_size=(3, 3), dilation=8),
                                  nn.BatchNorm2d(48),
                                  nn.ReLU(inplace=True),
                                  nn.AdaptiveAvgPool2d(output_size=(1, 1)),
                                  nn.Conv2d(in_channels=48, out_channels=self.n_out, kernel_size=(1, 1))
                                  )
      
    def forward(self, x):
        x = self.model(x)
        x = x.view(-1, self.n_out)
        return x

In the head of the model instead of using a fully connected layer we are using AdaptiveAvgPool2d with 1x1 convolution layer, it is because researchers have observed that using AdaptiveAvgPool2d with 1x1 convolution layer decreases the total number of parameters without taking a hit on the performance.

learn = Learner(dls, VGG4_Dilation(), loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.lr_find()

learn.fit_one_cycle(30, 1e-2)

learn.summary()

Note: There is no change in number of trainable in params in vanilla vgg 4 layer model with dowsampling and vgg 4 layer model with dilation.

VGG with large kernels

We plan to progressively increase the size of the kernels from 3 to 9. Increasing the kernel size would enable us to increase the receptive field of the network but we have to make sure that we use adequate padding so as to not shrink our input.

class VGG4_large_filter(nn.Module):
    def __init__(self, c_in=3, n_out=10):
        super(VGG4_large_filter, self).__init__()
          
        self.n_out = n_out
        self.model = nn.Sequential(nn.Conv2d(in_channels=c_in, out_channels=16, padding=(1, 1), kernel_size=(3, 3)),
                                   nn.BatchNorm2d(16),
                                   nn.ReLU(inplace=True),
                                   nn.Conv2d(in_channels=16, out_channels=24, padding=(2, 2), kernel_size=(5, 5)),
                                   nn.BatchNorm2d(24),
                                   nn.ReLU(inplace=True),
                                   nn.Conv2d(in_channels=24, out_channels=32, padding=(3, 3), kernel_size=(7, 7)),
                                   nn.BatchNorm2d(32),
                                   nn.ReLU(inplace=True),
                                   nn.Conv2d(in_channels=32, out_channels=48, padding=(4, 4), kernel_size=(9, 9)),
                                   nn.BatchNorm2d(48),
                                   nn.ReLU(inplace=True),
                                   nn.AdaptiveAvgPool2d(output_size=(1, 1)),
                                   nn.Conv2d(in_channels=48, out_channels=self.n_out, kernel_size=(1, 1))
                                  )
      
    def forward(self, x):
        x = self.model(x)
        x = x.view(-1, self.n_out)
        return x
learn = Learner(dls, VGG4_large_filter(), loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.lr_find()

learn.fit_one_cycle(30, 5e-3)

learn.summary()

Even though there is an improvement in terms of loss but number of trainable parameters have jumped from 25,474 to 172, 930

Conclusion

  • If we look at the performance of 4 layer VGG network with downsampling and compare it with dilated and large kernel size, we observe that their is an increase in the performance.
  • Using a large kernel will improve performance but at the cost of increased number of trainable parameters.
  • Using dilation would improve performance without increasing the number of trainable parameters

Next steps

  • To make it generalizable, we would have to test this idea on other architectures e.g. (resnet18) and see if it increases performance or not.
  • Also it would be interesting to see what kind of features will our models learn if we use dilation.