nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg searchdiv qikanlogo popupnotification paper
2024 11 v.54 2547-2557
基于知识蒸馏的视频描述轻量化模型及性能优化
基金项目(Foundation): 国家自然科学基金(62171145); 广西自然科学基金面上项目(2021GXNSFAA220058)~~
邮箱(Email):
DOI:
中文作者单位:

广西大学计算机与电子信息学院;

摘要(Abstract):

视频描述生成是利用计算机视觉和自然语言处理技术将视频内容转化为文字描述的过程,具有广泛的应用场景,包括信号识别与解码、网络视频会议、视频监控和安防、视频翻译和内容检索等。基于深度学习的视频描述生成模型在性能方面取得了显著突破。这些模型的计算量和复杂度往往较高,难以在计算资源有限的移动通信终端上部署和应用。为了解决这一问题,提出了2种轻量化模型,分别用于通用视频描述生成和密集视频描述生成任务。以UniVL模型为基准,通过实验确定了满足视频描述任务的最小模型架构。为进一步减小模型的大小,提出了自适应嵌入的压缩策略,根据不同视频数据集类型进行模型压缩。采用了不同层信息的知识蒸馏技术对所提出的轻量化模型进行优化训练,与教师模型进行充分的信息交互,提高轻量化模型的性能。实验结果表明,与基准模型相比,所提出的轻量化模型的参数量能够降低75%,性能指标下降不超过10%。

关键词(KeyWords): 视频描述生成;模型压缩;轻量化;知识蒸馏;预训练模型
参考文献

[1] HU H Y,YE Q H,YAN M,et al.mPLUG-2:A Modularized Multi-modal Foundation Model Across Text,Image and Video[EB/OL].(2023-02-01)[2024-01-10].https://arxiv.org/abs/2302.00402.

[2] HE X J,CHEN S H,MA F,et al.VLAB:Enhancing Video Language Pre-training by Feature Adapting and Blending[EB/OL].(2023-05-22)[2024-01-10].https://arxiv.org/abs/2305.13167.

[3] CHEN S H,HE X J,GUO L T,et al.VALOR:Vision-audio-language Omni-perception Pretraining Model and Dataset[EB/OL].(2023-04-17)[2024-01-10].https://arxiv.org/abs/2304.08345.

[4] HSIEH H Y,LEY J S,HUANG S A.Implementing a Real-time Image Captioning Service for Scene Identification Using Embedded System[C]//2019 16th Annual IEEE International Conference on Sensing,Communication,and Networking (SECON).Boston:IEEE,2019:1-2.

[5] KRISHNA R,HATA K,REN F,et al.Dense-captioning Events in Videos[C]//Proceedings of the IEEE International Conference on Computer Vision.Venice:IEEE,2017:706-715.

[6] YANG A,NAGRANI A,SEO P H,et al.Vid2Seq:Large-scale Pretraining of a Visual Language Model for Dense Video Captioning[EB/OL].[2024-01-20].https://openaccess.thecvf.com/content/CVPR2023/papers/Yang_Vid2Seq_Large-Scale_Pretraining_of_a_Visual_Language_Model_for_Dense_CVPR_2023_paper.pdf.

[7] LUO H S,JI L,SHI B T,et al.UniVL:A Unified Video and Language Pre-training Model for Multimodal Understanding and Generation[EB/OL].(2020-02-15)[2024-01-20].https://arxiv.org/abs/2002.06353.

[8] 王军,刘小芳.基于注意力机制生成对抗网络的遥感图像增强算法[J].无线电工程,2023,53(6):1382-1389.

[9] 梁礼明,阳渊,何安军,等.跨级可变形Transformer编解码视网膜图像分割算法[J].无线电工程,2023,53(9):1990-2001.

[10] HINTON G,VINYALS O,DEAN J.Distilling the Knowledge in a Neural Network[EB/OL].(2015-05-09)[2024-01-20].https://arxiv.org/abs/1503.02531.

[11] WANG Y,CHENG L,DUAN M,et al.Improving Knowledge Distillation via Regularizing Feature Norm and Direction[EB/OL].(2023-05-26)[2024-01-20].https://arxiv.org/abs/2305.17007.

[12] MILES R,MIKOLAJCZYK K.A Closer Look at the Training Dynamics of Knowledge Distillation[EB/OL].(2023-05-20)[2024-01-20].https://arxiv.org/pdf/2303.11098v1.

[13] CHAI D,WU W,HAN Q H,et al.Description Based Text Classification with Reinforcement Learning[EB/OL].(2020-02-08)[2024-01-10].https://arxiv.org/abs/2002.03067.

[14] MIECH A,ZHUKOV D,ALAYRAC J B,et al.HowTo100M:Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips[C]//2019 IEEE/CVF International Conference on Computer Vision.Seoul:IEEE,2019:2630-2640.

[15] XIE S M,SUN C,HUANG J,et al.Rethinking Spatiotemporal Feature Learning:Speed-accuracy Trade-offs in Video Classification[EB/OL].(2017-11-13)[2024-01-20].https://arxiv.org/abs/1712.04851.

[16] JOULIN A,CISSé M,GRANGIER D,et al.Efficient Softmax Approximation for GPUs[EB/OL].(2016-09-14)[2024-01-22].https://arxiv.org/abs/1609.04309.

[17] JIAO X,YIN Y,SHANG L,et al.TinyBERT:Distilling BERT for Natural Language Understanding[EB/OL].(2019-09-23)[2024-01-22].https://arxiv.org/abs/1909.10351.

[18] LOSHCHILOV I,HUTTER F.Decoupled Weight Decay Regularization[EB/OL].(2017-11-14)[2024-01-23].https://arxiv.org/abs/1711.05101.

[19] XU J,MEI T,YAO T,et al.MSR-VTT:A Large Video Description Dataset for Bridging Video and Language[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE,2016:5288-5296.

[20] ZHOU L W,XU C L,CORSO J J.Towards Automatic Learning of Procedures from Web Instructional Videos[EB/OL].(2017-05-28)[2024-01-11].https://arxiv.org/abs/1703.09788.

[21] PAPINENI K,ROUKOS S,WARD T,et al.BLEU:A Method for Automatic Evaluation of Machine Translation[EB/OL].(2001-09-17)[2024-01-25].http://www1.cs.columbia.edu/nlp/sgd/bleu.pdf.

[22] BANERJEE S,LAVIE A.METEOR:An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.Ann Arbor:ACL,2005:65-72.

[23] LIN C Y.ROUGE:A Package for Automatic Evaluation of Summaries[C]//Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004).Barcelona:ACL,2004:74-81.

[24] VEDANTAM R,ZITNICK L C,PARIKH D.CIDE:Consensus-based Image Description Evaluation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE,2015:4566-4575.

[25] HOLLINGSHEAD A B,ZIPF G K.Human Behavior and the Principle of Least Effort:An Introduction to Human Ecology.[J].American Sociological Review,1949,14(6):822.

[26] RAJAGOPALAN S S,MORENCY L P,BALTRUSAITIS T,et al.Extending Long Short-term Memory for Multi-view Structured Learning[C]//Computer Vision-ECCV 2016.Amsterdam:Springer,2016:338-353.

[27] ZHOU L W,ZHOU Y B,CORSO J J,et al.End-to-End Dense Video Captioning with Masked Transformer[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:8739-8748.

[28] ZHU L C,YANG Y.ActBERT:Learning Global-Local Video-Text Representations[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle:IEEE,2020:8746-8755.

[29] WANG J K,CHEN D D,WU Z X,et al.OmniVL:One Foundation Model for Image-language and Video-language Tasks[EB/OL].[2024-01-25].https://www.lamda.nju.edu.cn/conf/mla22/poster/wjk-NeurIPS%202022.pdf.

[30] CHEN Y Y,WANG S H,ZHANG W G,et al.Less is More:Picking Informative Frames for Video Captioning[EB/OL].(2018-05-05)[2024-01-25].https://arxiv.org/abs/1803.01457.

[31] PEI W J,ZHANG J Y,WANG X R,et al.Memory-attended Recurrent Network for Video Captioning[EB/OL].[2024-01-25].https://openaccess.thecvf.com/content_CVPR_2019/papers/Pei_Memory-Attended_Recurrent_Network_for_Video_Captioning_CVPR_2019_paper.pdf.

[32] LIU S,REN Z,YUAN J S.SibNet:Sibling Convolutional Encoder for Video Captioning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,43(9):3259-3272.

[33] ALAVI S A,JAVADIPOUR M,MEHRAN K.State Monitoring for Situational Awareness in Rural Microgrids Using the IoT Infrastructure[EB/OL].(2019-06-02)[2024-01-26].https://arxiv.org/pdf/1906.00437.

[34] ZHANG Z Q,SHI Y Y,YUAN C F,et al.Object Relational Graph with Teacher-recommended Learning for Video Captioning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle:IEEE,2020:13275-13285.

基本信息:

DOI:

中图分类号:TP391.41

引用信息:

[1]陈凯,唐振华,崔振雷等.基于知识蒸馏的视频描述轻量化模型及性能优化[J].无线电工程,2024,54(11):2547-2557.

基金信息:

国家自然科学基金(62171145); 广西自然科学基金面上项目(2021GXNSFAA220058)~~

检 索 高级检索

引用

GB/T 7714-2015 格式引文
MLA格式引文
APA格式引文