【EdgeYOLO】《EdgeYOLO: An Edge-Real-Time Object Detector》

在这里插入图片描述

Liu S, Zha J, Sun J, et al. EdgeYOLO: An edge-real-time object detector[C]//2023 42nd Chinese Control Conference (CCC). IEEE, 2023: 7507-7512.

CCC-2023

源码：https://github.com/LSH9832/edgeyolo

论文：https://arxiv.org/pdf/2302.07483

文章目录

1、Background and Motivation
2、Related Work
3、Advantages / Contributions
4、Method
- 4.1、Enhanced-Mosaic & Mixup
- 4.2、Lite-Decoupled Head
- 4.3、Staged Loss Function
5、Experiments
- 5.1、Datasets and Metrics
- 5.2、Results & Comparison
- 5.3、Ablation Study
- 5.4、Tricks for Edge Computing Devices
6、Conclusion（own） / Future work

1、Background and Motivation

边缘计算设备的需求增长
现有物体检测器的局限性（传统的两阶段物体检测器（如R-CNN系列）虽然在精度上表现较好，但由于其复杂的结构和较高的计算需求，难以在边缘设备上实现实时运行。而一些轻量级的一阶段检测器（如MobileNet和ShuffleNet）虽然能在边缘设备上运行，但往往以牺牲精度为代价。）
YOLO系列算法的发展（随着YOLO系列版本的更新，虽然精度不断提高，但在边缘设备上的实时性能却难以保证）
小物体检测的挑战
在设计和评估物体检测器时，考虑整个检测任务的完整性，包括预处理、模型推理和后处理时间，以确保在边缘设备上实现真正的实时性能。

This paper proposes an efficient, low-complexity and anchor-free object detector based on the state-of-the-art YOLO framework, which can be implemented in real time on edge computing platforms

2、Related Work

Anchor-free Object Detector
- anchor-point-based（本文）
- keypoint-based
Data Augmentation
- geometric augmentation
- photometric augmentation（eg HSV & brightness adjustment）
Model Reduction
- lossy reduction（有损压缩，builds smaller networks）
- lossless reduction（无损压缩，eg re-parameterizing techniques）
Decoupled Regression
- different tasks use the same convolution kernel if they are closely related. However, relations between the object’s location, confidence and category are not close enough in numerical logic
- 优点，accelerate the loss convergence
- 缺点， brings extra inference costs.
Small Object Detecting Optimization
- 小目标信息有限
- small objects always account for a less proportion of loss in total loss while training
- 解决方式：（1）replication augmentation，（2）zoomed（指的是大目标缩小成小目标，提高了小目标的占比） and spliced，（3）Loss function
- 解决方式（1）的缺点：scale mismatch and background mismatch，本文作者探索的是（2）（3）

3、Advantages / Contributions

anchor-free object detector is designed——EdgeYOLO
a more powerful data augmentation method is proposed（ensures the quantity and validity of training data）
设计了轻量级的解耦头结构，Structures that can be re-parameterized are used（减少推理时间）
A loss function is designed to improve the precision on small objects.
在公开数据集上取得了优异性能
开源了代码和模型权重
多进程/多线程计算架构等优化技巧，进一步提高了EdgeYOLO在边缘设备上的实时性能。

4、Method

4.1、Enhanced-Mosaic & Mixup

在这里插入图片描述
还是 mosaic 和 mixup 的混搭，作者 mosaic 的时候做了个分组，然后 mixup，group = 2（the group number can be set according to the richness of the average number of labels in a single picture in the dataset.）

看论文的描述没有 get 到作者的意思，举得例子也仅仅是图片中数量上的差异导致的区别

在这里插入图片描述

是提高了 mosaic 的图片数量吗？比如原来 4 张，现在 8 张？

4.2、Lite-Decoupled Head

在这里插入图片描述

基于 FCOS 的decouple head 进行了轻量化改进，引入了 re-parameterization 技术（推理的时候部分结构合并到一起）和 implicit konwledge 技术

With the method of re-parameterizing, implicit representation layers are integrated into convolutional layers for lower inference costs.

implicit konwledge 出自

Wang C Y, Yeh I H, Liao H Y M. You only learn one representation: Unified network for multiple tasks[J]. arXiv preprint arXiv:2105.04206, 2021.

yolov7 中也采用了这个技术

【YOLOv7】《YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors》

4.3、Staged Loss Function

整体 loss 结构， $L_{\Delta}$ 是 regulation loss
在这里插入图片描述

loss 分为三个阶段，每个阶段不一致

第一阶段

gIOU loss for IOU loss, Balanced Cross Entropy loss for classification loss and object loss, regulation loss 被设置为 0

第二阶段

at the last few data-augmentation-enabled epochs

分类和目标损失采用的是 Hybrid-Random Loss，应该是作者原创，没有看到列出的参考文献

在这里插入图片描述

基于交叉熵损失的改进

第三阶段

close data augmentation

set L1 loss as our regulation loss, and replace gIOU loss by cIOU loss

5、Experiments

训练时网络配置参数

默认参数

# models & weights------------------------------------------------------------------------------------------------------
model_cfg: "params/model/edgeyolo.yaml"              # model structure config file
weights: "output/train/edgeyolo_coco/last.pth"       # contains model_cfg, set null or a no-exist filename if not use it
use_cfg: false                                       # force using model_cfg instead of cfg in weights to build model

# output----------------------------------------------------------------------------------------------------------------
output_dir: "output/train/edgeyolo_coco"             # all train output file will save in this dir
save_checkpoint_for_each_epoch: true                 # save models for each epoch (epoch_xxx.pth, not only best/last.pth)
log_file: "log.txt"                                  # log file (in output_dir)

# dataset & dataloader--------------------------------------------------------------------------------------------------
dataset_cfg: "params/dataset/coco.yaml"              # dataset config
batch_size_per_gpu: 8                                # batch size for each GPU
loader_num_workers: 4                                # number data loader workers for each GPU
num_threads: 1                                       # pytorch threads number for each GPU

# device & data type----------------------------------------------------------------------------------------------------
device: [0, 1, 2, 3]                                 # training device list
fp16: false                                          # train with fp16 precision
cudnn_benchmark: false                               # it's useful when multiscale_range is set zero

# train hyper-params----------------------------------------------------------------------------------------------------
optimizer: "SGD"                                     # or Adam
max_epoch: 300                                       # or 400
close_mosaic_epochs: 15                              # close data augmentation at last several epochs

# learning rate---------------------------------------------------------------------------------------------------------
lr_per_img: 0.00015625                               # total_lr = lr_per_img * batch_size_per_gpu * len(devices)
warmup_epochs: 5                                     # warm-up epochs at the beginning of training
warmup_lr_ratio: 0.0                                 # warm-up learning rate start from value warmup_lr_ratio * total_lr
final_lr_ratio: 0.05                                 # final_lr_per_img = final_lr_ratio * lr_per_img

# training & dataset augmentation---------------------------------------------------------------------------------------
#      [cls_loss, conf_loss, iou_loss]
loss_use: ["bce", "bce", "giou"]  # bce: BCE loss. bcf: Balanced Focal loss. hyb: HR loss, iou, c/g/s iou is available
input_size: [640, 640]            # image input size for model
multiscale_range: 5               # real_input_size = input_size + randint(-multiscale_range, multiscale_range) * 32
weight_decay: 0.0005              # optimizer weight decay
momentum: 0.9                     # optimizer momentum
enhance_mosaic: true              # use enhanced mosaic method
use_ema: true                     # use EMA method
enable_mixup: true                # use mixup
mixup_scale: [0.5, 1.5]           # mixup image scale
mosaic_scale: [0.1, 2.0]          # mosaic image scale
flip_prob: 0.5                    # flip image probability
mosaic_prob: 1                    # mosaic probability
mixup_prob: 1                     # mixup probability
degrees: 10                       # maximum rotate degrees
hsv_gain: [0.0138, 0.664, 0.464]  # hsv gain ratio

# evaluate--------------------------------------------------------------------------------------------------------------
eval_at_start: false              # evaluate loaded model before training
val_conf_thres: 0.001             # confidence threshold when doing evaluation
val_nms_thres: 0.65               # NMS IOU threshold when doing evaluation
eval_only: false                  # do not train, run evaluation program only for all weights in output_dir
obj_conf_enabled: true            # use object confidence when doing inference
eval_interval: 1                  # evaluate interval epochs

# show------------------------------------------------------------------------------------------------------------------
print_interval: 100               # print result after every $print_interval iterations

# others----------------------------------------------------------------------------------------------------------------
load_optimizer_params: true       # load optimizer params when resume train, set false if there is an error.
train_backbone: true              # set false if you only want to train yolo head
train_start_layers: 51            # if not train_backbone, train from this layer, see params/models/edgeyolo.yaml
force_start_epoch: -1             # set -1 to disable this option

5.1、Datasets and Metrics

VisDrone2019-DET dataset：https://github.com/VisDrone/VisDrone-Dataset
MS COCO2017

metric 是 COCO 数据集的 mAP

5.2、Results & Comparison

baseline 是 yolov7 的 ELAN-Darknet

在这里插入图片描述
作者的方法在小目标上的提升尤为明显

VisDrone 数据上的模型 pre-trained on MS COCO2017-train.

FPS 在 device Jetson AGX Xavier 测试得到的

5.3、Ablation Study

（1）Decoupled head

在这里插入图片描述

改进后又快又好

（2）Segmentation labels (poor effect)

旋转增广后 bbox 可能框的没有那么准（由于bbox没有角度平行于边界导致），作者用分割的标签辅助生成旋转后的 bbox，不会产生 contain more invalid background information 的现象了

When the data augmentation is enabled and the loss enters a stable decline phase, using segmentation labels can bring a significant increase by 2% - 3% AP.

训练末期的时候，关掉了数据增强， all labels become more accurate，even if the segmentation labels are not used, the final accuracy decreases only by about 0.04% AP（这说明 bbox 没有 segmentation 的标签准？？？）

（3）Loss function

在这里插入图片描述

To sum up, a better precision can be obtained by using HR loss and cIOU loss in later training stages

5.4、Tricks for Edge Computing Devices

（1）Input size adaptation.

训练的时候 640x640，部署的时候适配 device 的尺寸，4:3 or 16:9，可以显著提速

在这里插入图片描述

（2）Multi-process & multi-thread computing architecture

用多线程或者多进程来提速网络运行时的三个阶段

pre-process, model input and post-process

achieve about 8%-14% FPS increase.

可视化的结果展示

在这里插入图片描述

6、Conclusion（own） / Future work

pre-process, model inference and post-process
edge computing device
time latency in post-processing is almost proportional to the number of anchors of each grid cell
Decouple，However, relations between the object’s location, confidence and category are not close enough in numerical logic
Multi-process & multi-thread computing architecture
we believe that the framework can be extended to other pixel level recognition tasks such as instance segmentation
Jetson AGX Xavier