1,248 | 11 | 14 |
下载次数 | 被引频次 | 阅读次数 |
针对YOLOv4-Tiny算法复杂度高、计算量和参数量大以及不易在资源较少嵌入式平台部署等问题,提出了一种软硬件联合优化方案。在算法上采用GhostNet残差结构替换原算法残差部分所构建的主干网络,再对网络通道剪枝以对算法进行压缩,改进后的网络相较于YOLOv4-Tiny压缩了97%。为了提高硬件资源效率,对权值和偏置采用16 bit动态定点数量化;增加总线数据读写突发长度提高带宽;设计高度并行流水化的传统卷积、通道卷积、池化和上采样等算子以提高网络效率。实验表明,改进算法在PYNQ-Z2上获得4.04 GOP/s的性能。相较于YOLOv4-Tiny在ARM Cortex-A9,改进的网络在FPGA上实现35.2倍加速。因此,软硬件结合优化能够更好地加速算法运算。
Abstract:In view of the high complexity, large amount of computation and large number of parameters of YOLOv4-Tiny algorithm as well as the difficulty of implementing YOLOv4-Tiny algorithm on embedded platform with fewer resources, a joint optimization scheme of software and hardware is proposed.In this scheme, the backbone network built by the residual part of the original algorithm is replaced by GhostNet residual structure, and then the network channel is pruned to compress the algorithm.Compared with YOLOv4-Tiny, the improved network is compressed by 97%.In order to improve the efficiency of hardware resources, 16 bit dynamic fixed-point quantization is used for weight and bias.The burst length of bus data read and write is increased to improve the bandwidth.The traditional convolution, channel convolution, pooling and upsampling operators are designed to improve the network efficiency.Experimental results show that the performance of the improved algorithm is 4.04 GOP/s on PYNQ-Z2.Compared with YOLOv4-Tiny on ARM Cortex-A9,the improved network achieves 35.2 times acceleration on FPGA.Therefore, the combination of hardware and software optimization can better accelerate the algorithm operation.
[1] KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet Classification with Deep Convolutional Neural Networks[J].Advances in Neural Information Processing Systems,2012,25:1097-1105.
[2] SIMONYAN K,ZISSERMAN A.Very Deep Convolutional Networks for Large-scale Image Recognition[J/OL].[2021-11-09].https://arxiv.org/abs/1409.1556.
[3] SZEGEDY C,LIU W,JIA Y,et al.Going Deeper with Convolutions[C]//IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE,2015:1-9.
[4] HE K M,ZHANG X Y,REN S Q,et al.Deep Residual Learning for Image Recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Las Vegas:IEEE,2016:770-778.
[5] HUANG G,LIU Z,MAATEN L V D,et al.Densely Connected Convolutional Networks[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Honolulu:IEEE,2017:2261-2269.
[6] MOLCHANOV P,MALLYA A,TYREE S,et al.Importance Estimation for Neural Network Pruning[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Long Beach:IEEE,2019:11256-11264.
[7] HAN S,POOL J,TRAN J,et al.Learning Both Weights and Connections for Efficient Neural Networks[C]//28th International Conference on Neural Information Processing Systems.Montreal:NIPS,2015:1135-1143.
[8] DUAN J,ZHANG R X,HUANG J H,et al.The Speed Improvement by Merging Batch Normalization into Previously Linear Layer in CNN[C]//2018 International Conference on Audio,Language and Image Processing (ICALIP).Shanghai:IEEE,2018:67-72.
[9] IANDOLA F N,HAN S,MOSKEWICZ M W,et al.SqueezeNet:AlexNet-level Accuracy with 50x Fewer Parameters and<0.5 MB Model Size[J/OL].[2021-12-11].https://arxiv.org/abs/1602.07360.
[10] HOWARD A G,ZHU M,CHEN B,et al.MobileNets:Efficient Convolutional Neural Networks for Mobile Vision Applications[J/OL].[2021-05-10].https://arxiv.org/abs/1704.04861.
[11] CHOLLET F.Xception:Deep Learning with Depthwise Separable Convolutions[C]//IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE,2017:1251-1258.
[12] 曹远杰,高瑜翔.基于GhostNet残差结构的轻量化饮料识别网络[J].计算机工程,2022,48(3):310-314.
[13] NURVITADHI E,VENKATESH G,SIM J,et al.Can FPGAs Beat GPUs in Accelerating Next-generation Deep Neural Networks[C]//ACM /SIGDA International Symposium on Field Programmable Gate Arrays.New York:ACM,2017:5-14.
[14] 余乐,李任伟,王瑶,等.综述:面向 SoC-FPGA的开源处理器[J].电子学报,2018,46(4):992-1004.
[15] SHEN Y M,FERDMAN M,MILDER P.Maximizing CNN Accelerator Efficiency through Resource Partitioning[C]∥44th Annual International Symposium on Computer Architecture.Toronto:IEEE,2017:535-547.
[16] ZHANG C,FANG Z M,ZHOU P P,et al.Caffeine:Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks[C]//International Conference on Computer-aided Design.Austin:IEEE,2016:1-8.
[17] LI H M,FAN X T,JIAO L,et al.A High Performance FPGA-based Accelerator for Large-scale Convolutional Neural Networks[C]//26th International Conference on Field Programmable Logic and Applications.Lausanne:IEEE,2016:1-9.
[18] REDMON J,DIVVALA S,GIRSHICK R,et al.You Only Look Once:Unified,Real-time Object Detection[C]//IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE,2016:779-788.
[19] REDMON J,FARHADI A.YOLO9000:Better,Faster,Stronger[C]//IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE,2017:6517-6525.
[20] IOFFE S,SZEGEDY C.Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift[J/OL].[2021-10-10].https://arxiv.org/abs/1502.03167.
[21] SHAN L,ZHANG M X,DENG L,et al.A Dynamic Multi-precision Fixed-point Data Quantization Strategy for Convolutional Neural Network[C]//20th CCF Conference on Computer Engineering and Technology.Xi’an:Springer,2016:102-111.
[22] 陈辰,柴志雷,夏珺.基于Zynq7000 FPGA异构平台的YOLOv2加速器设计与实现[J].计算机科学与探索,2019,13(10):1677-1693.
基本信息:
DOI:
中图分类号:TP391.41;TN791
引用信息:
[1]曹远杰,高瑜翔,杜鑫昌等.基于改进YOLOv4-Tiny的FPGA加速方法[J].无线电工程,2022,52(04):604-611.
基金信息:
四川省教育厅高校创新团队项目(15TD0022)~~