ShuffleNetV2-Yolov5 更輕更快易于部署的yolov5

【GiantPandaCV導(dǎo)語】畢設(shè)的一部分,前段時間,在yolov5上進行一系列的消融實驗,讓他更輕(Flops更小,內(nèi)存占用更低,參數(shù)更少),更快(加入shuffle channel,yolov5 head進行通道裁剪,在320的input_size至少能在樹莓派4B上一秒推理10幀),更易部署(摘除Focus層和四次slice操作,讓模型量化精度下降在可接受范圍內(nèi))。版權(quán)屬于GiantPandaCV,未經(jīng)允許請勿轉(zhuǎn)載
一、消融實驗結(jié)果比對
| ID | Model | Input_size | Flops | Params | Size(M) | ||
|---|---|---|---|---|---|---|---|
| 001 | yolo-faster | 320×320 | 0.25G | 0.35M | 1.4 | 24.4 | - |
| 002 | nanodet-m | 320×320 | 0.72G | 0.95M | 1.8 | - | 20.6 |
| 003 | shufflev2-yolov5 | 320×320 | 1.43G | 1.62M | 3.3 | 35.5 | - |
| 004 | nanodet-m | 416×416 | 1.2G | 0.95M | 1.8 | - | 23.5 |
| 005 | shufflev2-yolov5 | 416×416 | 2.42G | 1.62M | 3.3 | 40.5 | 23.5 |
| 006 | yolov4-tiny | 416×416 | 5.62G | 8.86M | 33.7 | 40.2 | 21.7 |
| 007 | yolov3-tiny | 416×416 | 6.96G | 6.06M | 23.0 | 33.1 | 16.6 |
| 008 | yolov5s | 640×640 | 17.0G | 7.3M | 14.2 | 55.4 | 36.7 |
| 注:yolov5原FLOPS計算腳本有bug,請使用thop庫進行計算: |
input = torch.randn(1, 3, 416, 416)
flops, params = thop.profile(model, inputs=(input,))
print('flops:', flops / 900000000*2)
print('params:', params)
二、檢測效果






三、Relect Work
shufflev2-yolov5的網(wǎng)絡(luò)結(jié)構(gòu)實際上非常簡單,backbone主要使用的是含shuffle channel的shuffle block,頭依舊用的是yolov5 head,但用的是閹割版的yolov5 head
shuffle block:

yolov5 head:

yolov5 backbone:
在原先U版的yolov5 backbone中,作者在特征提取的上層結(jié)構(gòu)中采用了四次slice操作組成了Focus層

對于Focus層,在一個正方形中每 4 個相鄰像素,并生成一個具有 4 倍通道數(shù)的feature map,類似與對上級圖層進行了四次下采樣操作,再將結(jié)果concat到一起,最主要的功能還是在不降低模型特征提取能力的前提下,對模型進行降參和加速。
1.7.0+cu101 cuda _CudaDeviceProperties(name='Tesla T4', major=7, minor=5, total_memory=15079MB, multi_processor_count=40)
Params FLOPS forward (ms) backward (ms) input output
7040 23.07 62.89 87.79 (16, 3, 640, 640) (16, 64, 320, 320)
7040 23.07 15.52 48.69 (16, 3, 640, 640) (16, 64, 320, 320)
1.7.0+cu101 cuda _CudaDeviceProperties(name='Tesla T4', major=7, minor=5, total_memory=15079MB, multi_processor_count=40)
Params FLOPS forward (ms) backward (ms) input output
7040 23.07 11.61 79.72 (16, 3, 640, 640) (16, 64, 320, 320)
7040 23.07 12.54 42.94 (16, 3, 640, 640) (16, 64, 320, 320)
從上圖可以看出,F(xiàn)ocus層確實在參數(shù)降低的情況下,對模型實現(xiàn)了加速。
但!這個加速是有前提的,必須在GPU的使用下才可以體現(xiàn)這一優(yōu)勢,對于云端部署這種處理方式,GPU不太需要考慮緩存的占用,即取即處理的方式讓Focus層在GPU設(shè)備上十分work。
對于的芯片,特別是不含GPU、NPU加速的芯片,頻繁的slice操作只會讓緩存占用嚴(yán)重,加重計算處理的負擔(dān)。同時,在芯片部署的時候,F(xiàn)ocus層的轉(zhuǎn)化對新手極度不友好。
四、輕量化的理念
shufflenetv2的設(shè)計理念,在資源緊缺的芯片端,有著許多參考意義,它提出模型輕量化的四條準(zhǔn)則:
(G1)同等通道大小可以最小化內(nèi)存訪問量 (G2)過量使用組卷積會增加MAC (G3)網(wǎng)絡(luò)過于碎片化(特別是多路)會降低并行度 (G4)不能忽略元素級操作(比如shortcut和Add)
shufflev2-yolov5
設(shè)計理念:(G1)摘除Focus層,避免多次采用slice操作
(G2)避免多次使用C3 Leyer以及高通道的C3 Layer
C3 Leyer是YOLOv5作者提出的CSPBottleneck改進版本,它更簡單、更快、更輕,在近乎相似的損耗上能取得更好的結(jié)果。但C3 Layer采用多路分離卷積,測試證明,頻繁使用C3 Layer以及通道數(shù)較高的C3 Layer,占用較多的緩存空間,減低運行速度。
(為什么通道數(shù)越高的C3 Layer會對cpu不太友好,主要還是因為shufflenetv2的G1準(zhǔn)則,通道數(shù)越高,hidden channels與c1、c2的階躍差距更大,來個不是很恰當(dāng)?shù)谋扔?,想象下跳一個臺階和十個臺階,雖然跳十個臺階可以一次到達,但是你需要助跑,調(diào)整,蓄力才能跳上,可能花費的時間更久)
class C3(nn.Module):
# CSP Bottleneck with 3 convolutions
def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5): # ch_in, ch_out, number, shortcut, groups, expansion
super(C3, self).__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv(c1, c_, 1, 1)
self.cv3 = Conv(2 * c_, c2, 1)
self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])
# self.m = nn.Sequential(*[CrossConv(c_, c_, 3, 1, g, 1.0, shortcut) for _ in range(n)])
(G3)對yolov5 head進行通道剪枝,剪枝細則參考G1
(G4)摘除shufflenetv2 backbone的1024 conv 和 5×5 pooling
這是為imagenet打榜而設(shè)計的模塊,在實際業(yè)務(wù)場景并沒有這么多類的情況下,可以適當(dāng)摘除,精度不會有太大影響,但對于速度是個大提升,在消融實驗中也證實了這點。
五、What can be used for?
(G1)訓(xùn)練
這不廢話嗎。。。確實有點廢話了,shufflev2-yolov5基于yolov5第五版(也就是最新版)上進行的消融實驗,所以你可以無需修改直接延續(xù)第五版的所有功能,比如:
導(dǎo)出熱力圖:
導(dǎo)出混淆矩陣進行數(shù)據(jù)分析:
導(dǎo)出PR曲線:
(G2)導(dǎo)出onnx后無需其他修改(針對部署而言)
(G3)DNN或ort調(diào)用不再需要額外對Focus層進行拼接(之前玩yolov5在這里卡了很久,雖然能調(diào)用但精度也下降挺多):
(G4)ncnn進行int8量化可保證精度的延續(xù)(在下篇會講)
(G5)在0.1T算力的樹莓派上玩yolov5也能實時
以前在樹莓派上跑yolov5,是一件想都不敢想的事,單單檢測一幀畫面就需要1000ms左右,就連160*120輸入下都需要200ms左右,實在是啃不動。
但現(xiàn)在shufflev2-yolov5做到了,畢設(shè)的檢測場景在類似電梯轎廂和樓道拐角處等空間,實際檢測距離只需保證3m即可,分辨率調(diào)整為160*120的情況下,shufflev2-yolov5最高可達18幀,加上后處理基本也能穩(wěn)定在15幀左右。
除去前三次預(yù)熱,設(shè)備溫度穩(wěn)定在45°以上,向前推理框架為ncnn,記錄兩次benchmark對比:
# 第四次
pi@raspberrypi:~/Downloads/ncnn/build/benchmark $ ./benchncnn 8 4 0
loop_count = 8
num_threads = 4
powersave = 0
gpu_device = -1
cooling_down = 1
shufflev2-yolov5 min = 90.86 max = 93.53 avg = 91.56
shufflev2-yolov5-int8 min = 83.15 max = 84.17 avg = 83.65
shufflev2-yolov5-416 min = 154.51 max = 155.59 avg = 155.09
yolov4-tiny min = 298.94 max = 302.47 avg = 300.69
nanodet_m min = 86.19 max = 142.79 avg = 99.61
squeezenet min = 59.89 max = 60.75 avg = 60.41
squeezenet_int8 min = 50.26 max = 51.31 avg = 50.75
mobilenet min = 73.52 max = 74.75 avg = 74.05
mobilenet_int8 min = 40.48 max = 40.73 avg = 40.63
mobilenet_v2 min = 72.87 max = 73.95 avg = 73.31
mobilenet_v3 min = 57.90 max = 58.74 avg = 58.34
shufflenet min = 40.67 max = 41.53 avg = 41.15
shufflenet_v2 min = 30.52 max = 31.29 avg = 30.88
mnasnet min = 62.37 max = 62.76 avg = 62.56
proxylessnasnet min = 62.83 max = 64.70 avg = 63.90
efficientnet_b0 min = 94.83 max = 95.86 avg = 95.35
efficientnetv2_b0 min = 103.83 max = 105.30 avg = 104.74
regnety_400m min = 76.88 max = 78.28 avg = 77.46
blazeface min = 13.99 max = 21.03 avg = 15.37
googlenet min = 144.73 max = 145.86 avg = 145.19
googlenet_int8 min = 123.08 max = 124.83 avg = 123.96
resnet18 min = 181.74 max = 183.07 avg = 182.37
resnet18_int8 min = 103.28 max = 105.02 avg = 104.17
alexnet min = 162.79 max = 164.04 avg = 163.29
vgg16 min = 867.76 max = 911.79 avg = 889.88
vgg16_int8 min = 466.74 max = 469.51 avg = 468.15
resnet50 min = 333.28 max = 338.97 avg = 335.71
resnet50_int8 min = 239.71 max = 243.73 avg = 242.54
squeezenet_ssd min = 179.55 max = 181.33 avg = 180.74
squeezenet_ssd_int8 min = 131.71 max = 133.34 avg = 132.54
mobilenet_ssd min = 151.74 max = 152.67 avg = 152.32
mobilenet_ssd_int8 min = 85.51 max = 86.19 avg = 85.77
mobilenet_yolo min = 327.67 max = 332.85 avg = 330.36
mobilenetv2_yolov3 min = 221.17 max = 224.84 avg = 222.60
# 第八次
pi@raspberrypi:~/Downloads/ncnn/build/benchmark $ ./benchncnn 8 4 0
loop_count = 8
num_threads = 4
powersave = 0
gpu_device = -1
cooling_down = 1
nanodet_m min = 84.03 max = 87.68 avg = 86.32
nanodet_m-416 min = 143.89 max = 145.06 avg = 144.67
shufflev2-yolov5 min = 84.30 max = 86.34 avg = 85.79
shufflev2-yolov5-int8 min = 80.98 max = 82.80 avg = 81.25
shufflev2-yolov5-416 min = 142.75 max = 146.10 avg = 144.34
yolov4-tiny min = 276.09 max = 289.83 avg = 285.99
nanodet_m min = 81.15 max = 81.71 avg = 81.33
squeezenet min = 59.37 max = 61.19 avg = 60.35
squeezenet_int8 min = 49.30 max = 49.66 avg = 49.43
mobilenet min = 72.40 max = 74.13 avg = 73.37
mobilenet_int8 min = 39.92 max = 40.23 avg = 40.07
mobilenet_v2 min = 71.57 max = 73.07 avg = 72.29
mobilenet_v3 min = 54.75 max = 56.00 avg = 55.40
shufflenet min = 40.07 max = 41.13 avg = 40.58
shufflenet_v2 min = 29.39 max = 30.25 avg = 29.86
mnasnet min = 59.54 max = 60.18 avg = 59.96
proxylessnasnet min = 61.06 max = 62.63 avg = 61.75
efficientnet_b0 min = 91.86 max = 95.01 avg = 92.84
efficientnetv2_b0 min = 101.03 max = 102.61 avg = 101.71
regnety_400m min = 76.75 max = 78.58 avg = 77.60
blazeface min = 13.18 max = 14.67 avg = 13.79
googlenet min = 136.56 max = 138.05 avg = 137.14
googlenet_int8 min = 118.30 max = 120.17 avg = 119.23
resnet18 min = 164.78 max = 166.80 avg = 165.70
resnet18_int8 min = 98.58 max = 99.23 avg = 98.96
alexnet min = 155.06 max = 156.28 avg = 155.56
vgg16 min = 817.64 max = 832.21 avg = 827.37
vgg16_int8 min = 457.04 max = 465.19 avg = 460.64
resnet50 min = 318.57 max = 323.19 avg = 320.06
resnet50_int8 min = 237.46 max = 238.73 avg = 238.06
squeezenet_ssd min = 171.61 max = 173.21 avg = 172.10
squeezenet_ssd_int8 min = 128.01 max = 129.58 avg = 128.84
mobilenet_ssd min = 145.60 max = 149.44 avg = 147.39
mobilenet_ssd_int8 min = 82.86 max = 83.59 avg = 83.22
mobilenet_yolo min = 311.95 max = 374.33 avg = 330.15
mobilenetv2_yolov3 min = 211.89 max = 286.28 avg = 228.01
(G6)shufflev2-yolov5與yolov5s的對比

注:隨機抽取一百張圖片進行推理,四舍五入計算每張平均耗時。
六、后語
之前使用自己的數(shù)據(jù)集跑過yolov3-tiny,yolov4-tiny,nanodet,efficientnet-lite等輕量級網(wǎng)絡(luò),但效果都沒有達到預(yù)期,反而使用yolov5取得了超過自己預(yù)想的效果,但也確實,yolov5并不在輕量級網(wǎng)絡(luò)設(shè)計理念內(nèi),于是萌生了對yolov5修改的idea,希望能在它強大的數(shù)據(jù)增強和正負anchor機制下能取得滿意的效果??偟膩碚f,shufflev2-yolov5在基于yolov5的平臺進行訓(xùn)練,對少樣本數(shù)據(jù)集還是很work的。
沒有太多復(fù)雜的穿插并行結(jié)構(gòu),盡最大限度保證網(wǎng)絡(luò)模型的簡潔,shufflev2-yolov5純粹為了工業(yè)落地而設(shè)計,更適配Arm架構(gòu)的處理器,但你用這東西跑GPU,性價比賊低。

那么!!!
shufflev2-yolov5在速度與精度均衡下超過nanodet了嗎?并沒有,在320×[email protected]:0.95的條件下遜于nanodet。
shufflev2-yolov5在速度上超過yolo-fastest了嗎,也沒有,被yolo-fastest按在地上摩擦。
對于上個月剛出的yolox,那更是被吊起來錘。

優(yōu)化這玩意,一部分基于情懷,畢竟前期很多工作是基于yolov5開展的,一部分也確實這玩意對于我個人的數(shù)據(jù)集十分work(確切的說,應(yīng)該是對于極度匱乏數(shù)據(jù)集資源的我來說,yolov5的各種機制對于少樣本數(shù)據(jù)集確實魯棒)。
項目地址:https://github.com/ppogg/shufflev2-yolov5
另外,會持續(xù)更新和迭代此項目,歡迎star和fork!
最后插個題外話,其實一直都在關(guān)注YOLOv5的動態(tài),最近U版大神更新的頻率快了許多,估計很快YOLOv5會迎來第六版~
- END -
歡迎加入GiantPandaCV交流群,添加以下微信加群
