Deep Learning: EfficientNetV2 Model in Computer Vision

This blog provides to you the concept and the architecture of EfficientNetV2 family model. You will gain an understanding of this deep neural network, its performance and applications in computer vision field.

February 7, 2022 • 10 mins read

By Doan Anh Tien

Introduction

There is no doubt that machine learning is becoming well-known these days not only to developer, but also common businesses due to its simplicity, all thanks to the efforts of the creator to make the libraries, tools and framework to be more concise and user-friendly.

As as student that major in Computer Science and Engineering, my interests in Data Science and ML has been rousen with the help of the Machine Learning Engineer course by DIVE INTO CODE. Being a learner for more than one year, the course has convey a lot of useful theories and practical problem to brush my skills. This blog is created to demonstrate the a portion of the final assignment for my graduation at DIVE INTO CODE.

For the project, I chose to work on an old contest called Zalo AI Challenge 2021 , the fourth-year of an annual online competition for Vietnam’s AI engineers to explore AI technologies and impact life in exciting new ways. In 2021, the competition consist of 3 main problems, and my work is related to one of them, the 5K Compliance .

Project Overview

In this section, I will briefly explain about the meaning of the name, the description and rules of the challenge, and the dataset that need to be interpreted on.

During the Covid-19 outbreak, the Vietnamese government pushed the "5K" public health safety message. In the message, masking and keeping a safe distance are two key rules that have been shown to be extremely successful in preventing people from contracting or spreading the virus. Enforcing these principles on a large scale is where technology may help. In this challenge, you will create algorithm to detect whether or not a person or group of individuals in a picture adhere to the mask and distance standards.

Basic rules

We are given the dataset contains images of people either wearing mask or not and they are standing either close of far from each other. Our mission is to predict whether the formation of these people adhere the 5k standard.

The 5k standard is also based on the two conditions, mask (0 = not wearing, 1 = wearing) and distancing (0 = too close, 1 = far enough). People that adhere the 5k standard will not likely to expose the virus to each other in case they did caught it before, and it is to prevent the spread of the COVID-19 pandamic through people interactions.

EfficientNet

As mentioned, in this blog particularly, we only discuss about the model that my project has implemented and applied, which is EfficientNetV2. However, I'd also like to write about its predecessor, the EfficientNet This will enable reader to understand the mechanism behind them and how useful they are in this specific image detection field.

Origin

Convolutional Neural Networks (ConvNets) are frequently constructed with a limited resource budget and then scaled up for improved accuracy as more resources become available. Creators of EfficientNet has conducted a research paper, in which they study model scaling and learn that better accuracy or resources efficiency can be reached if we could balance the network depth, width, and resolution. They proposed a new scaling method that uses a simple but very effective mechanism to scale all depth/width/resolution dimensions equally.

The researchers also design a family of models to evaluate its performance and size. The dataset used in their study is ImageNet which is a large database of annotated photographs intended for computer vision research. It has more than 14 million images that capture different objects and indicate what they are, and in at least one million of the images, bounding boxes are also provided.

Model scaling for ConvNets — Figure 1. Model Scaling. (a) is a baseline network example; (b)-(d) are conventional scaling that only increases one dimension of network width, depth, or resolution. (e) is the proposed compound scaling method that uniformly scales all three dimensions with a fixed ratio by creators of EfficientNet

When training model with more computational resources, we can either increase the network, width or resolution. The factors (coefficients) can be determined by a small grid search on the original smaller model. Figure 1 demonstrate the difference between the compound scaling of the creators and other convetional methods.

Optimization Problem

As the model scaling factor is quite complex yet still be one of the interesting of these researches, so I had tried my best to make the explanation simpler.

First, the ConvNet Layer i can be defined with a function $$Y_i = F_i(X_i) \tag{1}\label{eq1}$$ where: $$ \scriptsize{ \begin{array}{lll} \\ F_i & = & \text{can be seen as the activation function used in that layer} \\ X_i & = & \text{input tensor with shape $\langle$ H_i, W_i, C_i $\rangle$} \\ Y_i & = & \text{output tensor} \\ H_i, W_i & = & \text{spatial dimensions} \\ C_i & = & \text{channel dimension} \end{array} } $$

Then, a whole ConvNet $η$ is composed of layers with Hadamard Product (component-wise multiplication for matrices): $$ \eta = F_k \odot ... \odot F_2 \odot F_1(X_1) = \odot_{j=1...k} F_j(X_1) \tag{2}\label{eq2} $$

To explain the notation of equation $\eqref{eq2}$, the hadamard product can be described as the product of two 3x3 matrices below: $$ \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \end{bmatrix}\odot\begin{bmatrix} j & k & l \\ m & n & o \\ p & q & r \end{bmatrix}= \begin{bmatrix} aj & bk & cl \\ dm & ne & fo \\ gp & hp & ir\end{bmatrix} $$

In a network, the operation of layers are often repeated $L_i$ times in each stage, thus the general formula of η: $$η = ⊙_{j = 1...s}F_i^{L_i}(X⟨H_i, W_i, C_i⟩) \tag{3}\label{eq3}$$

So far, we have know the formula that describe the architecture Fi. Most regular ConvNet designs focus on finding the best $F_i$, meanwhile, the model scaling opt to change the network length ($L_i$), network width $(L_i)$, and/or resolution ($H_i, W_i$) without changing $F_i$ predefined in the baseline network. However, by changing these factors, it lead to the fact that a large design space will occurs in order to investigate different $L_i, C_i, H_i, W_i$ for each layer.

Creators of EfficientNet came up with an idea that restrict all layers to be scaled up uniformly by a constant ratio. Their target is to maximize the model accuracy for any given resources constraints, which can be described as follow:

Figure 2. Formula with coefficients for optimization

The yellow, orange and blue rectangles are the scaling constants for the method, the white box will then take all required parameters to form the formula, and the red rectangles represent for their target by applying the model scaling. In next section, we will discuss some problems of this optimization problem.

Scaling Dimensions

As we have know that the formula from Figure 2, choosing the optimal parameters d, w, r is quite tricky since they depend on each other, and in some different resource constraints these values may change. Therefore, the conventional method like Figure 1 (a), (b), (c), (d) from mostly scale ConvNets in one of these dimensions, and here is the table that sum-up the performance of each way:

Dimensions	Pros	Cons
Depth (d)	Capture richer & more complex features	Face vanishing gradient problem
Width (w)	Capture more fine-grained features Easier to train	Hard to capture higher level feature (wide but shallow network) Face vanishing gradient problem
Resolution (r)	Capture more fine-grained patterns Higher resolution means higher accuracy, capable of achieves state-of-the-art	Face vanishing gradient problem

From their observation, the researchers conclude that "scaling up any dimension of network width, depth, or resolution improves accuracy, but the accuracy gain diminishes for bigger models" (EfficientNetV2: Smaller Models and Faster Training, 2020, p.4).

Compound Scaling

In the later experiment, the researchers pointed out that balancing different dimensions scaling ratio would be better than conventional single-dimension scaling. And to validate this point, they use different network depths and resolutions, altogether with the width scaling. Results show that, by using width scaling without changing depth (d) and resolution (r), the accuracy will be diminished.

However, with deeper d and higher resolution r, width scaling method achieves better accuracy while maintaining the same Floating Point Operations Per Second (FLOPS) cost. As a result, they gave a second point of view that "In order to pursue better accuracy and efficiency, it is critical to balance all dimensions of network width, depth, and resolution during ConvNet scaling." (EfficientNetV2: Smaller Models and Faster Training, 2020, p.4).

As said, the creators of EfficientNet proposed a new compound scaling method, which use a coefficient ϕ to uniformly scales network width, depth and resolution: $$ \begin{aligned} depth: d = \alpha^\phi\\ width: w = \beta^\phi\\ resolution: r = \gamma^\phi \end{aligned} \tag{4}\label{eq4} $$

The coefficient ϕ will determine how many more resources that will be used for model scaling operation and it is specified by user. Meanwhile, α, β, γ will specify how to assign these extra resources to the network width, depth and resolution.

What is more notable is that when increase d, w, r in convolution operation, the FLOPS also increases (proportion to those factors). Specifically, when double the depth, FLOPS will proportional to d and is doubled. And when we double the width or resolution will cause the FLOPS to be increased four times since it is proportional to w2, r2. As convolution operation is a major part in ConvNets, scaling the model with equation $\eqref{eq4}$ will increase the FLOPS by $(α\cdot β^2 \cdot γ^2)\cdot ϕ$. The researchers opted to limit this increase by putting a constraint $α \cdot β^2 \cdot γ^2 ≈ 2$, thus the total FLOPS will approximately increase by 2ϕ for any given ϕ.

Model Architecture

The researchers decided to design a new mobile-size baseline called EfficientNet to evaluate their scaling method. They adapt and use the multi-objective neural architecture search inspired from MnasNet (Mingxing Tan et al. 2019) to develop the baseline that optimizes both accuracy and FLOPS. The researchers come up with sort of optimization formula that involve accuracy, FLOPS, the target FLOPS and a hyperparameter for controlling the trade-off between accuracy and FLOPS.

From the beginning, they completed the baseline EfficientNet-B0, and then applied the proposed compound scaling method to scale it up with two steps:

Figure 3. Architecture expansion using compound scaling method

Performance experiments

First, the researchers evaluate their scaling method on the existing ConvNets such as MobileNets and ResNet, along with the ImageNet database. The results showed that their compound scaling method, comparing to other single-dimension scaling method, has improved the accuracy on all evaluated models and prove its effectiveness.

Second, they train the EfficientNet models on the ImageNet database using an RMSProp optimizer with the configured parameters as that of MnasNet, and other customized settings. The EfficienNet achieved much better accuracy than GPipe, ConvNets, ResNets while being dramatically smaller in size and computational cheaper than mentioned models (fewer FLOPS). One example is that the EfficientNet-B3 achieves higher accuracy than ResNeXt101 using 18x fewer FLOPS.

Achievements

The creators of EfficientNet and compound scaling method has studied and proposed their solution that can balance the network width, depth and resolution in a more principled way for any given resources constraints. Based on this factor, they develop a mobile-size EfficientNet model that can be scaled effectively to different sizes, with a much be better state-of-the-art accuracy and fewer parameters and FLOPS. Their work has been evaluated on the ImageNet and other five commonly used datasets and proved its effectiveness and influence.

EfficientNetV2

A year later, the same creators Mingxing Tan and Quoc V.Le also proposed EfficientNetV2, a new family of convolutional networks and also an upgraded version of EfficientNet, which has faster training speed and and better parameter efficiency compared to the previous models.

The researchers use a combination of training-aware Neural Architecture Search (NAS) and scaling. They indicated that the training process can be sped up by increasing the image size, but it may results in the drop of accuracy. Therefore, they proposed a new method of progressive learning which adaptively adjusts regularization along with image size. The EfficientNetV2 achieved a better accuracy than previous models, performed significantly faster while using the same computational resources (FLOPS).

Progressive learning? That sounds cool though.

NAS & Model architecture

Before heading to the explanation of progressive learning, we will have a look at the NAS and the model structure of EfficientNetV2.

The researchers develop a training-aware NAS framework with EfficientNet as backbone that optimize accuracy, parameter efficiency, and training efficiency. The NAS search adapt an mechanism which is a search space that consist of multiple design choices for convolutional operation types, number of layers, kernel size, expansion ratio.

In short, the researchers will use the proposed training-aware NAS to search for the best combinations of architecture design in order to improve the training speed.

As mentioned, the search space includes many options, the researchers managed to remove unnecessary search options like pooling skip operation since it was not used in the EfficientNet. This lead to the smaller search space so they can add reinforcement learning into it. And in addition to that, they reuse the same channel sizes from the backbone which has been used in the study of EfficientNet.

They come up with the first version EfficientNetV2-S (S stands for Small). There are major differences between the EfficientNetV2 and the backbone EfficientNet:

Use both MBConv from MobileNetV2, their EfficientNet and the fused-MBConv

Figure 4. Comparision of MBConv and fused-MBConv structure
EfficientNetV2 prefers smaller expansion ratio for MBConv since smaller expansion ratios tend to have less memory access overhead
EfficientNetV2 prefers smaller 3x3 kernel sizes, but it adds more layers to compensate the reduced receptive field resulted from the smaller kernel size
EfficientNetV2 completely removes the last stride-1 stage in the original EfficientNet

Similarly to their previous work, they also scale up the EfficientNetV2-S to obtain the expanded EfficientNetV2-M/L by using the compound scaling method. A small experiment has been conducted to compare the training speed of EfficientNetV2 and other models in the scenario: without progressive learning. With the training-aware NAS and scaling, the EfficientNetV2 model witnessed a better training speed than the other recent models.

Progressive Learning

As the term progressive learning has been indicated once in Section EfficientNetV2, here we will investigate further the effect of this proposed method.

Basically, some works dynamically change image sizes, which may cause in drop of accuracy. This maybe come from the unbalance regularization. Therefore, the researchers consider that instead of having fixed regularization, we should adjust it accordingly to the image size changes.

Since the procedure of progressive learning is hard to described in a detailed mathematically way, I will re-use a statement from their paper that can briefly demonstrate its mechanism: "..in the early training epochs, we train the network with smaller images and weak regularization, such that the network can learn simple representations easily and fast. Then, we gradually increase image size but also making learning more difficult by adding stronger regularization." Some of the types of regularization are Dropout, RandAugment, Mixup.

ImageNet1k and ImageNet21k

The ImageNet1k is a dataset contains about 1.28M training images and 50,000 validation images with 1000 classes. Meanwhile, the ImageNet21k (Full ImageNet, Fall 2011 release) contains about 13M training images with 21,841 classes. The researchers did use the ImageNet21k to pretrain the EfficientNetV2, following by fine-tuning it on the ImageNet1k using the cosine learning rate decay. In the end, the EfficientNetV2 that is trained on imagenet-21k and fine-tuned on ImageNet1K has improved the accuracy, used 2.5 times fewer parameters and 3.6 times fewer FLOPS, while running 6-7 times faster.

Conclusion

In conclusion, the EfficientNetV2 surpasses earlier models while being more quicker and more efficient in parameters, thanks to training-aware NAS and model scaling. The researchers presented an enhanced approach of progressive learning that raises picture size and regularization simultaneously during training to speed up the process even more. Extensive testing shows that their EfficientNetV2 performs well on ImageNet and CIFAR/Flowers/Cars.

In addition, EfficientNetV2 trains up to 11 times quicker while being 6.8 times smaller than EfficientNet and other recent research. They also scale up the baseline EfficientNetV2-S into larger size model like EfficientV2-M/L, while also scale down into different smaller size like EfficientNetV2-B0/B1/B2/B3 to compare with original EfficientNet variants, along with some pretrained and finetuned version.

Personal comment

As for myself, the problem and competition is very interesting and give me a lot of experience as well as practices. I have learned a lot from carry out the graduation assignment.

The EfficientNet and EfficientNetV2 are indeed interesting architecture with impressive mechanism and methods integrated with them. As said, it would be a good experience or even a good choice using them in some specific problem where you need faster training time, decent accuracy while having less computational resources under constraints.

Acknowledgements

I deeply thank Quan Thanh Tho, Noro Hiroyoshi, Mouhamed Diop, Jules Ntaganda, Iradukunda Peter Yves, Cedrick Justin, other mentors and staffs of DIVE INTO CODE for offering this course and their support throughout the lessons. Withour their help, I could not carrying out the work nor writing up this blog by myself.

References

Saining Xie et al. Aggregated Residual Transformations for Deep Neural Networks. 2017.
arXiv: 1611.05431 [cs.CV].
Mark Sandler et al. MobileNetV2: Inverted Residuals and Linear Bottle- necks. 2019.
arXiv: 1801.04381 [cs.CV]
Mingxing Tan et al. MnasNet: Platform-Aware Neural Architecture Search for Mobile. 2019.
arXiv: 1807.11626 [cs.CV]
Suyog Gupta and Berkin Akin. Accelerator-aware Neural Network Design using AutoML. 2020.
arXiv: 2003.02838 [eess.SP]
Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. 2020.
arXiv: 1905.11946 [cs.LG]
Mingxing Tan and Quoc V. Le. EfficientNetV2: Smaller Models and Faster Training. 2021.
arXiv: 2104.00298 [cs.CV]