Bilinear CNN model with Attention module for Fine-Grained Visual Recognition (FGVR)

Shreya Pamecha
4 min readNov 9, 2021
Courtesy: Codes of Interest

Introduction

Fine-Grained images are objects that have similar appearances but are challenging to classify. For instance, a Californian Gull and a Ring-billed Gull — differ by the color of a mark on their beaks (orange or black). A crow and a raven — differ by the shape of their beaks (straight or pointed). We may not get desired classification results on fine-grained images only by using traditional CNN models.

Courtesy: Difference between a crow & a raven

Thus, in this article, we will work on the Bilinear CNN (BCNN)model and employ different state-of-the-art techniques to increase its accuracy, which is as follows:

  1. BCNN with Bilinear Pooling
  2. Usage of transfer learning (fine-tuning)
  3. Convolutional Block Attention Module (CBAM)

Dataset Used: CUB dataset (consists of 11,788 images of 200 bird species)

Bilinear Model with Bilinear Pooling — Architecture & Understanding of the layers used

The Bilinear CNN model consists of 2 parallel VGG Net(D or M), which act as feature extractors (feature extractor A & B), pooling, and classification functions. Generally, we use two parallel layers or two-stream architectures to analyze videos in two aspects: temporal and spatial. However, here they are used as part and local features extractors. (Local features are patterns, edges, some points)

Courtesy: Bilinear CNN model for Image Classification

The outputs are forwarded to the pooling function that is nothing but a matrix multiplication (MxN), so the dimensions of the feature extractor outputs must be cxM and cxN. Then, we reshape it to get an image descriptor of size MN x 1. The pairwise interactions of the bilinear features in the pooling function allow them to be conditioned on each other. The outer product/matrix multiplication captures the part-features interactions. The mere matrix multiplication is called a bilinear sum-pooling function because it aggregates/accumulates bilinear features across the image. We have another choice here, to capture the correlation between the features: max-pooling, where we would take the maximum value rather than adding them up. The image-descriptor obtained is also called orderless as the sum-pooling and max-pooling operations don’t consider pixel locations. It is passed through a softmax layer for the classification in the end.

This network/model is end-to-end trainable. Since the domain-specific dataset is limited (fewer images for all the classes), it would be challenging to train the model well and get prudent results with just training on CUB dataset after random initialization of weights.

Employing Transfer Learning

To overcome this issue, we use a widely-used Transfer Learning technique — fine-tuning. Instead of randomly initializing weights, we will be using pre-trained VGG weights on the ImageNet dataset as it generalizes well with the other classification problems. We would fine-tune these weights using the CUB dataset. You can say this is analogous to finding a new path filled with hurdles (minima) than walking on an already directed way by someone (pre-trained weights). In other words, passing knowledge acquired while training a model to another model with similar architecture.

Convolution Block Attention Module (CBAM)

Courtesy: CBAM- Convolution Block Attention Module

To increase the accuracy, we employ a Convolution Block Attention Module (CBAM) with two sub-modules: Channel and Spatial. CBAM module infers attention maps from two dimensions — channel & spatial; then, these attention maps are multiplied to the input features for adaptive feature refinement. The channel submodule exploits the inter-channel relationship between the two images. We squeeze the spatial dimension of the input feature map using average or max-pooling (it gathers essential features for distinctive objects; thus, we obtain better channel-wise attention). It focuses upon ‘what’ the object is. The spatial submodule utilizes the spatial relationship. Both can be put together either in parallel or sequential fashion. Here, we squeeze the channel dimension using average or max-pooling. It focuses upon ‘where’ the object is.

After performing some random permutations, we observe that putting the channel before the spatial submodule (‘what’ -> ‘where’) in a sequential fashion renders better results for the CUB dataset.

The attention block boosts the representative power of CNNs rather than using them independently and thereby imparting improved features.

You may also read about Channel-Spatial Attention Module (CSAM) and Few-Shot Learning which is used when you have limited number of training samples.

Hope you got some insightful information from this article.

Happy Learning!!

References

  1. CBAM: Convolutional Block Attention Module | DeepAI.
  2. Bilinear CNN Models for Fine-Grained Visual Recognition

--

--

Shreya Pamecha

Machine Learning Enthusiast and Hardware Design Engineer