A brief review of different architectures of Optical Flow algorithms
Optical flow is the motion between consecutive frames of sequences caused by relative motion between a camera and an object. Optical flow algorithms predict motion by estimating optical flow vectors for every pixel (known as Dense optical flow) or at some significant positions (known as Sparse optical flow). In this article, we will only be talking about Dense optical flow algorithms. A pair of optical flow vectors render information about the horizontal and vertical displacement of a pixel. For two images of sizes (H x W x 3), the size of the optical flow vectors is (H x W x 2). Here, 2 is displacement in horizontal and vertical directions for every pixel (in terms of pixels).
This optical flow estimation has several applications: object tracking, video compression, detecting structure from motion, segmentation, correct for camera jitter (stabilization), panoramic image construction, for an autonomous machine/robot so that it can interact with the environment and also in a surveillance system.
Ways to estimate optical flow
The distinctive approaches to measuring it are:
- Classical energy-based approaches: It is an energy minimization problem. It is computationally expensive, therefore, not real-time implementable.
- Using CNN as a feature extractor combined with classical optimizer: extracting the useful features using CNNs and then considering the optical flow estimation as an energy minimization problem. It yields less blurry results than the 3rd approach but is not end-to-end trainable. Their runtime in the training phase is significantly longer.
- CNN regression architecture: Using CNN for feature extraction and also to estimate the optical flow vectors. These are end-to-end trainable and also real-time implementable. These can compete with the 1st approach in terms of accuracy.
CNN regression architectures
In this section, we will understand different CNN regression architectures, their components/modules, advantages, limitations, and the subsequent improvement in the architecture.
There are two variants of FlowNet: FlowNetSimple (or FlowNetS) and FlowNetCorr. Both of them have Auto-encoder architecture (encoder & decoder — refinement module). Auto-encoder architecture is generally used when the input and output are of similar sizes. The encoder and the decoder use CNNs for feature extraction and estimating the optical flow vectors, respectively. After passing through all the CNN layers, the output resolution is four times smaller than that of the input. Thus, it is bilinearly upsampled in the end.
In FlowNetSimple, the two input images are stacked together and fed into the network, passing through CNN layers (encoder -> decoder). In FlowNetCorr, both the images go separately through the convolutional layers; the obtained feature maps are correlated using a descriptor matching unit or a correlation layer, then it proceeds further.
Advantages: CNNs are good at extracting high-level abstract features. This architecture is the advent for unleashing the potential of machine learning to solve this problem.
Limitations: The accuracy or error rate obtained is not comparable to the classical approaches. This model fails to estimate large displacements of objects.
2. Flownet 2.0
It is a cascade of variants of FlowNet (FlowNetS & FlowNetCorr). In other words, it is a stack of multiple networks where every network refines the previous estimate. The output of the 1st network is used to warp the 2nd image (intuitively, bringing the 2nd image closer to the 1st one). The warped second image, the first image, and the previous flow estimate are used as inputs for the next network in the stack. The sequential top three networks only took care of large displacements. So, they put another network (FlowNetSD) in parallel for accomodating small displacements, which has an elongated decoder part. Noise tends to be a problem with small displacements, so they have added convolutions in the decoder to obtain smoother estimates. The results of both layers are fused via a dense layer to obtain the final output.
Advantages: It accommodates both small and large displacements. The accuracy / error rate obtained using this model is comparable to that of classical approaches.
Limitations: It requires a memory footprint of 640M parameters (therefore, not suitable for hardware implementation). It includes dense layers or fully connected layers, making it less hardware-friendly (energy consumption is high). It renders blurry estimates near the motion boundaries.
The model intends to render decent results with lesser memory requirements. It brings the concept of a coarse-to-fine strategy with residual flow calculation. Let’s say the images are downsampled six times; we will have six pyramidal levels (L5, L4, L3, L2, L1, L0). L0 and L5 are the smallest (coarser) and the largest (finer) ones, respectively. We will start with the L0 pyramidal level (coarser). The two images in L0 are fed into a series of convolutional layers (G0). G0 outputs the optical flow estimate, which is added to the previous estimate: zero (for the 1st level). The images in L1 are fed into G1. The output obtained is the residual flow, which is then added to the upsampled previous level’s (L0) estimate. Upsampling of the previous level estimate is required to match it with the resolution of the residual flow of this level. Summarily, the model residually updates the flow across the spatial pyramidal levels used in a coarse-to-fine fashion.
Advantages: It demonstrates better accuracy than FlowNet and with fewer parameters. The concept of coarse-to-fine strategy along with residual flow calculation aids in accomodating large displacements. It has a smaller network size (1.2 M). It doesn’t include any dense layer in FlowNet 2.0, making it a little bit hardware-friendly.
Limitations: It doesn’t give state-of-the-art results. It renders vague motion/flow boundaries. We cannot observe less-magnitude flow artifacts.
The name itself suggests it is the lighter version of FlowNet 2.0 but with more accurate results. The architecture consists of NetC (pyramidal feature extractor) and NetE ( optical flow estimator). NetC generates 6-level feature maps, and NetE predicts flow fields for levels 6 to 2. Level 1 is the most extensive feature map (finer). It also employs a coarse-to-fine strategy along with residually updating the flow vectors. It uses feature-warping instead of image-warping. Intuitively, instead of bringing image 2 closer to the 1st image, we bring the feature map of the 2nd image closer to that of 1st, thus reducing the feature-space distance across the levels. Therefore, effectively solving the long-range matching problem between feature maps (which would require lesser cost volume computation).
Note: The numbering of the levels is different from that of in SPyNet.
NetC contains the feature descriptor. NetE includes cascaded flow inference [M:S] and flow regularization module [R]. M is the descriptor matching unit (also known as cost volume layer or a correlation layer), and S is the sub-pixel refinement unit. The descriptor matching unit correlates the feature maps of both images. The sub-pixel refinement unit pre-processes the optical flow estimate to prevent erroneous flows from being amplified by upsampling later, thus improving the sub-pixel accuracy. R is the flow regularization module that uses a feature-driven local convolution. It ameliorates the issue of outliers and vague flow boundaries, thus smoothening the flow field.
Advantages: It is 30 times lighter/smaller than FlowNet 2.0. It is 1.36 times faster in running speed than FlowNet 2.0. It addresses large displacements (coarse-to-fine with residual flow calculation) and detail-preserving flows. It refines and regularizes output based on feature-driven local convolution. The sub-pixel refinement unit makes the model observe small magnitude flow artifacts represented by light color in the output. It solves the issue of vague flow boundaries.
Limitations: Due to the coarse-to-fine strategy, the model tends to miss small, fast-moving objects.
It has a similar approach as LiteFlowNet, i.e., employing a coarse-to-fine strategy and residually updating the optical flow vectors. It uses feature-warping instead of image-warping. The last feature maps obtained in the pyramidal feature extractor are fed into a correlation layer and then passed through a decoder consisting of CNN layers. The flow estimate obtained is upsampled and used to warp the feature maps of the 2nd image in the 2nd level, which is then passed through a correlation layer and an optical flow decoder, and it goes on. In the last level, before giving out the final output, the output of the decoder goes to a Context network which uses dilated convolution to refine the flow vector. Context network integrates the contextual information,i.e., the relationship of the nearby pixels.
Advantages: it is 17 times faster than FlowNet2 because they used accelerators to facilitate the process. It is also two times smaller than FlowNet 2.0. The warping operation compensates for geometric distortions. It addresses large displacements because of the employment of a coarse-to-fine strategy with residual flow calculation.
Limitations: Due to the coarse-to-fine strategy, the model tends to miss small, fast-moving objects.
If you see the 1st figure, we have FlowNet 2.0: a stack of networks (FlowNetS & FlowNetCorr) with different weights for the encoders and decoders of both the networks. If I keep the weights of the encoder the same and all the decoders the same along with residually adding all the networks’ outputs, what we get is an iterative residual refinement model. ‘Iterative’ due to similar weights. ‘Residual’ due to residual flow calculation. Figure 3 is the rolled version of the 2nd figure. This figure outputs not only optical flow vectors but also occlusion maps.
Imagine some parts of the 1st image are occluded in the next image, but the warping operation is executed without considering the occlusions. It tends to give erroneous results. That’s where IRR-PWCNet comes to the rescue. It estimates both occlusion maps and the flow. However, it is still unclear how the model uses occlusion maps to enhance the result.
This joint estimation setup includes the bidirectional estimation, bilateral refinement of flow and occlusion estimation, and an occlusion upsampling layer.
Bidirectional estimation: In the top layer of the model, the feature map of the 2nd image is warped using the forward flow, and in the bottom layer of the model, the feature map of the 1st image is warped using the backward flow. Intuitively both the feature maps are coming towards one another in respective layers. So, the unmatched part of this bidirectional estimation should give us some notion of the occluded pixels.
Bilateral refinement: It refines both flow and occlusion estimation using a bilateral filter that is an edge-preserving and noise-reducing smoothening filter.
Occlusion upsampling layer: Flow estimate is upsampled bilinearly, but when the occlusion estimate was upsampled in such a way, it encounters a heavy loss of accuracy. Thus, the occlusion upsampling layer is designated only to upsample the occlusion estimates.
Note: number of iterations (i.e., the sequence of the same encoder and decoder) == number of pyramidal levels
Advantages: It has 26.4 % fewer parameters than PWCNet and improved accuracy by 17.7 %. It has 2.4 times lesser parameters than LiteFlowNet. Less memory is required because of the notion of shared weights and iterative procedures instead of stacked networks. Bilateral refinement helps in refining the flow.
Limitations: The dataset required for training should include ground truth values for forward flow, backward flow, and occlusion masks. Due to the coarse-to-fine strategy, the model tends to miss small, fast-moving objects.
Some points which should be kept in the mind during the training phase:
- The schedule of presenting the data is essential, or the order of presenting with different properties matter. Initially, the dataset should be a generalized one, then difficult ones for fine-tuning. For instance, train on Chairs and then fine-tune on Things3D. The simple Chairs dataset aids the network to learn the general concept of color matching without developing confusing priors for 3D motion and realistic lighting too early.
- Not only the kind of data is crucial, but also its order.
- The number of pyramidal levels can be increased or decreased.
- Use learning rate schedulers.
- Use data augmentation schemes.
- Use transfer learning to improve the accuracy of the model.
- FlowNet: Learning Optical Flow with Convolutional Networks
- FlowNet 2.0: Evolution of Optical FlowEstimation with Deep Networks
- Optical Flow Estimation using a Spatial Pyramid Network
- LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation
- PWC-Net: CNNs for Optical Flow using Pyramid, Warping, and Cost Volume
- Iterative Residual Refinement for Joint Optical Flow and Occlusion Estimation
I hope you enjoyed this article. Happy Learning!