nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg journalInfo searchdiv qikanlogo popupnotification paper paperNew
2025, 09, v.55 1727-1742
多模态大模型的发展综述及思考
基金项目(Foundation):
邮箱(Email):
DOI:
摘要:

近年来,随着以ChatGPT为代表的大语言模型(Large Language Models, LLMs)在通用人工智能方向上取得突破性进展,国内外掀起了大模型应用的研究热潮。人类获取和处理信息的方式往往涉及视觉、听觉、文本等多种模态,单纯依赖文本的语言模型难以充分理解和表达复杂的现实世界信息。因此,研究者开始探索将LLMs扩展到多模态领域,通过统一建模文本、图像、视频等不同类型的数据,构建具有跨模态理解能力的多模态大模型(Multimodal Large Models, MLMs)。对MLMs的发展现状进行了全面梳理,重点介绍了当前主流的模型架构、训练策略以及评估方法,并分析了该领域面临的挑战和未来发展方向。随着模型参数规模和训练数据的大幅扩展,MLMs在跨模态任务中的性能显著超越了传统方法,为通用人工智能的发展奠定了重要基础。这些模型在视觉问答(Visual Question Answering, VQA)、图像描述、多模态对话等典型任务中展现出卓越的理解与生成能力。然而,当前MLMs仍存在长序列处理效率、计算资源需求以及模型可靠性等方面的技术瓶颈。未来研究将致力于在保持模型性能的前提下提升计算效率,并推动技术从通用框架向领域专用解决方案的转化,为通用人工智能的实现和产业智能化转型提供关键技术基础。

Abstract:

In recent years, Large Language Models(LLMs) represented by ChatGPT have achieved breakthroughs in artificial general intelligence, sparking a global research boom in large model applications.Human information acquisition and processing typically involve multiple modalities including vision, audition, and text, making text-only language models insufficient to fully understand and express complex real-world information.Consequently, researchers have begun exploring the extension of LLMs to multimodal domains, constructing Multimodal Large Models(MLMs) with cross-modal understanding capabilities through uniformly modeling of diverse data types such as text, images, and videos.A comprehensive review of the current development status of multimodal large models is presented, with particular emphasis on mainstream model architectures, training strategies, and evaluation methods.The challenges and future directions in this field are analyzed.With substantial expansion in model parameter scale and training data, MLMs have demonstrated performance that far surpasses traditional methods in cross-modal tasks, laying a crucial foundation for the development of artificial general intelligence.These models demonstrate exceptional understanding and generation capabilities in typical tasks such as Visual Question Answering(VQA),image captioning, and multimodal dialogue.However, current MLMs still suffer from technical bottlenecks including long sequence processing efficiency, computational resource requirements, and model reliability.Future research will focus on improving computational efficiency while maintaining model performance, and promoting the transformation from general frameworks to domain-specific solutions, thus laying a key technological foundation for the realization of artificial general intelligence and the intelligent transformation of industries.

参考文献

[1] 罗锦钊,孙玉龙,钱增志,等.人工智能大模型综述及展望[J].无线电工程,2023,53(11):2461-2472.

[2] 刘畅行,陈思衡,杨峰.基于MLMs的智能无人机系统:总结与展望[J].无线电工程,2024,54(12):2923-2932.

[3] BALTRU?AITIS T,AHUJA C,MORENCY L P,et al.Multimodal Machine Learning:A Survey and Taxonomy[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,41(2):423-443.

[4] VASWANI A,SHAZEER N,PARMAR N,et al.Attention Is All You Need[C]// NIPS’17:Proceedings of the 31st International Conference on Neural Information Processing Systems.Long Beach:Curran Associates Inc.,2017:6000-6010.

[5] DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Minneapolis:ACM,2019:4171-4186.

[6] BROWN T,MANN B,RYDER N,et al.Language Models are Few-shot Learners[C]// IPS’20:Proceedings of the 34th International Conference on Neural Information Processing Systems.Vancouver:Curran Associates Inc.,2020:1877-1901.

[7] CHEN X,DJOLONGA J,PADLEWSKI P,et al.PaLI-X:On Scaling up a Multilingual Vision and Language Model[EB/OL].(2023-05-29) [2025-03-14].https://arxiv.org/abs/2305.18565.

[8] YU Z,YU J,CUI Y H,et al.Deep Modular Co-attention Networks for Visual Question Answering[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE,2019:6274-6283.

[9] GAO P,JIANG Z K,YOU H X,et al.Dynamic Fusion with Intra-and Inter-modality Attention Flow for Visual Question Answering[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE,2019:6632-6641.

[10] DAI W,LI J N,LI D X,et al.InstructBLIP:Towards General-purpose Vision-Language Models with Instruction Tuning[C]// NIPS’23:Proceedings of the 37th International Conference on Neural Information Processing Systems.New Orleans:Curran Associates Inc.,2023,36:49250-49267.

[11] XU K,BA J L,KIROS R,et al.Show,Attend and Tell:Neural Image Caption Generation with Visual Attention[C]// ICML’15:Proceedings of the 32nd International Conference on International Conference on Machine Learning.Lille:JMLR.org,2015:2048-2057.

[12] LI J N,LI D X,SAVARESE S,et al.BLIP-2:Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models[C]// ICML’23:Proceedings of the 40th International Conference on Machine Learning.Honolulu:JMLR.org,2023:19730-19742.

[13] ALAYRAC J B,DONAHUE J,LUC P,et al.Flamingo:A Visual Language Model for Few-shot Learning[C]// NIPS’22:Proceedings of the 36th International Conference on Neural Information Processing Systems.New Orleans:Curran Associates Inc.,2022:23716-23736.

[14] LIU H T,LI C Y,WU Q Y,et al.Visual Instruction Tuning[C]// NIPS’23:Proceedings of the 37th International Conference on Neural Information Processing Systems.New Orleans:Curran Associates Inc.,2023:34892-34916.

[15] OpenAI,ACHIAM J,ADLER S,et al.GPT-4 Technical Report[EB/OL].(2023-03-15)[2025-05-10].https://arxiv.org/abs/2303.08774.

[16] ANIL R,BORGEAUD S,ALAYRAC J B,et al.Gemini:A Family of Highly Capable Multimodal Models[EB/OL].(2023-12-19) [2025-03-14].https://arxiv.org/abs/2312.11805.

[17] ZHU D Y,CHEN J,SHEN X Q,et al.MiniGPT-4:Enhancing Vision-Language Understanding with Advanced Large Language Models[EB/OL].(2023-04-20)[2025-03-14].https://arxiv.org/abs/2304.10592.

[18] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An Image Is Worth 16×16 Words:Transformers for Image Recognition at Scale[EB/OL].(2020-10-22)[2025-03-14].https://arxiv.org/abs/2010.11929.

[19] ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Video Vision Transformer[C]//2021 IEEE/CVF International Conference on Computer Vision.Montreal:IEEE,2021:6836-6846.

[20] HU E J,SHEN Y L,WALLIS P,et al.LoRA:Low-rank Adaptation of Large Language Models[EB/OL].(2021-06-17)[2025-03-14].https://arxiv.org/abs/2106.09685.

[21] OUYANG L,WU J,JIANG X,et al.Training Language Models to Follow Instructions with Human Feedback[C]// NIPS’22:Proceedings of the 36th International Conference on Neural Information Processing Systems.New Orleans:Curran Associates Inc.,2022:27730-27744.

[22] RADFORD A,KIM J W,HALLACY C,et al.Learning Transferable Visual Models from Natural Language Supervision[EB/OL].(2021-02-26)[2025-03-14].https://arxiv.org/abs/2103.00020.

[23] LIU Y H,OTT M,GOYAL N,et al.RoBERTa:A Robustly Optimized BERT Pretraining Approach[EB/OL].(2019-07-26)[2025-03-14].https://arxiv.org/abs/1907.11692.

[24] TOUVRON H,LAVRIL T,IZACARD G,et al.LLaMA:Open and Efficient Foundation Language Models[EB/OL].(2023-02-27)[2025-03-14].https://arxiv.org/abs/2302.13971.

[25] HE K M,ZHANG X Y,REN S Q,et al.Deep Residual Learning for Image Recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Las Vegas:IEEE,2016:770-778.

[26] LIU Z,LIN Y T,CAO Y,et al.Swin Transformer:Hierarchical Vision Transformer Using Shifted Windows[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV).Montreal:IEEE,2021:10012-10022.

[27] HE K M,CHEN X L,XIE S N,et al.Masked Autoencoders Are Scalable Vision Learners[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).New Orleans:IEEE,2022:16000-16009.

[28] LI J N,LI D X,XIONG C M,et al.BLIP:Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation[C]// Proceedings of the 39th International Conference on Machine Learning.Baltimore:PMLR,2022:12888-12900.

[29] ROMBACH R,BLATTMANN A,LORENZ D,et al.High-resolution Image Synthesis with Latent Diffusion Models[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE,2022:10684-10695.

[30] DHARIWAL P,NICHOL A.Diffusion Models Beat GANs on Image Synthesis[EB/OL].(2021-05-11)[2025-03-14].https://arxiv.org/abs/2105.05233.

[31] RADFORD A,KIM J W,XU T,et al.Robust Speech Recognition via Large-scale Weak Supervision[C]// ICML’23:Proceedings of the 40th International Conference on Machine Learning.Honolulu:JMLR.org,2023:28492-28518.

[32] VAN DEN OORD A,DIELEMAN S,ZEN H,et al.WaveNet:A Generative Model for Raw Audio[EB/OL].(2016-09-12)[2025-03-14].https://arxiv.org/abs/1609.03499.

[33] WANG Y X,SKERRY-RYAN R J,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[EB/OL].(2017-03-29)[2025-03-14].https://arxiv.org/abs/1703.10135.

[34] WANG C Y,CHEN S Y,WU Y,et al.Neural Codec Language Models Are Zero-shot Text to Speech Synthesizers[EB/OL].(2023-01-05)[2025-03-14].https://arxiv.org/abs/2301.02111.

[35] AGOSTINELLI A,DENK T I,BORSOS Z,et al.MusicLM:Generating Music from Text[EB/OL].(2023-01-26)[2025-03-14].https://arxiv.org/abs/2301.11325.

[36] TRAN D,BOURDEV L,FERGUS R,et al.Learning Spatiotemporal Features with 3d Convolutional Networks[C]// 2015 IEEE International Conference on Computer Vision (ICCV).Santiago:IEEE,2015:4489-4497.

[37] CARREIRA J,ZISSERMAN A.Quo Vadis,Action Recognition?A New Model and the Kinetics Dataset[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Honolulu:IEEE,2017:6299-6308.

[38] SUN C,MYERS A,VONDRICK C,et al.VideoBERT:A Joint Model for Video and Language Representation Learning[C]//2019 IEEE/CVF International Conference on Computer Vision.Seoul:IEEE,2019:7464-7473.

[39] LIU Y X,ZHANG K,LI Y,et al.Sora:A Review on Background,Technology,Limitations,and Opportunities of Large Vision Models [EB/OL].(2024-02-27)[2025-03-14].https://arxiv.org/abs/2402.17177.

[40] DRIESS D,XIA F,SAJJADI M S M,et al.PaLM-E:An Embodied Multimodal Language Model[C]// ICML’23:Proceedings of the 40th International Conference on Machine Learning.Honolulu:JMLR.org,2023:8469-8488

[41] MU Y,ZHANG Q L,HU M K,et al.EmbodiedGPT:Vision-Language Pre-training via Embodied Chain of Thought[C]// NIPS’23:Proceedings of the 37th International Conference on Neural Information Processing Systems.New Orleans:Curran Associates Inc.,2023:25081-25094.

[42] AHN M,BROHAN A,BROWN N,et al.Do as I Can,not as I Say:Grounding Language in Robotic Affordances[EB/OL].(2022-04-04)[2025-03-14].https://arxiv.org/abs/2204.01691.

[43] SCHUHMANN C,BEAUMONT R,VENCU R,et al.LAION-5b:An Open Large-scale Dataset for Training Next Generation Image-Text Models[EB/OL].(2022-10-19)[2025-03-14].https://arxiv.org/abs/2210.08402.

[44] BAI J Z,BAI S,YANG S S,et al.A Versatile Vision-Language Model for Understanding,Localization,Text Reading,and Beyond[EB/OL].(2023-08-24)[2025-03-14].https://arxiv.org/abs/2308.12966.

[45] CHEN L,LI J S,DONG X Y,et al.ShareGPT4V:Improving Large Multi-modal Models with Better Captions[C]//European Conference on Computer Vision.Milan:Springer,2024:370-387.

[46] SHARMA P,DING N,GOODMAN S,et al.Conceptual Captions:A Cleaned,Hypernymed,Image Alt-text Dataset for Automatic Image Captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Melbourne:ACL,2018:2556-2565.

[47] RAFAILOV R,SHARMA A,MITCHELL E,et al.Direct Preference Optimization:Your Language Model Is Secretly a Reward Model[C]// IPS’23:Proceedings of the 37th International Conference on Neural Information Processing Systems.New Orleans:Curran Associates Inc.,2023:53728-53741.

[48] FU C Y,CHEN P X,SHEN Y H,et al.MME:A Comprehensive Evaluation Benchmark for Multimodal Large Language Models[EB/OL].(2023-06-23)[2025-03-14].https://arxiv.org/abs/2306.13394.

[49] LU P,MISHRA S,XIA T,et al.Learn to Explain:Multimodal Reasoning via Thought Chains for Science Question Answering[C]// NIPS’22:Proceedings of the 36th International Conference on Neural Information Processing Systems.New Orleans:Curran Associates Inc.,2022:2507-2521.

[50] MAAZ M,RASHEED H,KHAN S,et al.Video-ChatGPT:Towards Detailed Video Understanding via Large Vision and Language Models[EB/OL].(2023-06-08)[2025-03-14].https://arxiv.org/abs/2306.05424.

[51] NING M N,ZHU B,XIE Y J,et al.Video-Bench:A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models[EB/OL].(2023-11-27)[2025-03-14].https://arxiv.org/abs/2311.16103.

[52] LI Y F,DU Y F,ZHOU K,et al.Evaluating Object Hallucination in Large Vision-Language Models[EB/OL].(2023-05-17)[2025-03-14].https://arxiv.org/abs/2305.10355.

[53] LI C Y,WONG C,ZHANG S,et al.LLaVA-Med:Training a Large Language-and-Vision Assistant for Biomedicine in One Day[C]// NIPS’23:Proceedings of the 37th International Conference on Neural Information Processing Systems.New Orleans:Curran Associates Inc.,2023:28541-28564.

[54] YE Q H,XU H Y,XU G H,et al.mPLUG-Owl:Modularization Empowers Large Language Models with Multimodality[EB/OL].(2023-04-27)[2025-03-14].https://arxiv.org/abs/2304.14178.

[55] YANG Z Y,LI L J,WANG J F,et al.MM-REACT:Prompting ChatGPT for Multimodal Reasoning and Action[EB/OL].(2023-03-20)[2025-03-14].https://arxiv.org/abs/2303.11381.

[56] YIN S K,FU C Y,ZHAO S R,et al.Woodpecker:Hallucination Correction for Multimodal Large Language Models[J].Science China Information Sciences,2024,67(12):220105.

基本信息:

DOI:

中图分类号:TP18

引用信息:

[1]王金桥,杨蓓莹.多模态大模型的发展综述及思考[J].无线电工程,2025,55(09):1727-1742.

基金信息:

检 索 高级检索

引用

GB/T 7714-2015 格式引文
MLA格式引文
APA格式引文