无线电工程

2024, 11, v.54 2576-2584

基于多头注意力融合的场景文本识别

基金项目(Foundation): 国家自然科学基金(62171135,62071131); 福建省杰青项目(2022J06010); 省教育厅重点攻关项目(2023XQ004)~~

邮箱(Email):

DOI:

361	1	48
下载次数	被引频次	阅读次数

引用本文下载本文

PDF

引用导出

GB/T 7714-2015 MLA APA Refworks EndNote NoteExpress NoteFirst

摘要全文参考文献出版信息相关文章

摘要：

场景文本识别(Scene Text Recognition, STR)使计算机能够获取自然场景图像中的文本信息。在STR的研究中识别准确性始终是关注重点。对于计算资源受限的边缘设备，模型的参数量和计算效率也同样重要。针对该问题，提出了基于多头注意力融合的自然场景文本识别(Scene Text Recognition Based on Multi-Head Attention Fusion, MAF)算法。通过利用多头注意力(Multi-Head Attention, MHA)机制设计了视觉编码器，实现对规则和不规则场景文本图像的视觉特征深度提取。为了增强模型对字符间间距变化和语义相似性的感知能力，提出了增强位置编码以及结合输入上下文和置换模型的语义编码器。通过MHA将视觉和语义特征信息融合，提高在复杂环境背景下的文本字符识别准确率。实验结果表明，MAF的参数量仅为7.6×10⁶FLOPS为1.0×10⁹在真实STR数据集上的平均识别准确率达到95.6%,有效平衡了识别任务的准确性和计算效率，具有一定的应用潜力。

关键词： 计算机视觉; 场景文本识别; 注意力机制; 特征信息关联;

Abstract：

Scene Text Recognition(STR) enables computers to read text in scene images.Accuracy of recognition has always been the focus of STR research.However, speed and computational efficiency are equally important for edge devices with limited computational resources.To address this issue, a Scene Text Recognition Based on Multi-Head Attention Fusion(MAF)algorithm is proposed.By utilizing the Multi-Head Attention(MHA) mechanism to design the visual encoder, the deep extraction of visual features from both regular and irregular scene text images is achieved.In order to enhance the perception of changes in character spacing and semantic similarity, enhanced position encoding and a semantic encoder that combines input context and permutation models are proposed.Finally, the visual and the semantic feature information are fused using MHA to improve the accuracy of text character recognition in complex environmental backgrounds.Experiment results show that MAF has a parameter size of only 7.6×10⁶and FLOPS of 1.0×10⁹and achieves an average recognition accuracy of 95.6% on real STR datasets.It effectively balances the accuracy and computational efficiency of the recognition task, showing promising application potential.

KeyWords： computer vision; STR; attention mechanism; feature information association;

如需获取全文，请访问cnki.net

参考文献

[1] LONG S B,HE X,YAO C.Scene Text Detection and Recognition:The Deep Learning Era[J].International Journal of Computer Vision,2021,129:161-184.

[2] LIN H,YANG P,ZHANG F L.Review of Scene Text Detection and Recognition[J].Archives of Computational Methods in Engineering,2020,27(2):433-454.

[3] WANG W J,XIE E Z,LIU X B,et al.Scene Text Image Super-resolution in the Wild[C]//Computer Vision-ECCV 2020:16th European Conference.Glasgow:Springer,2020:650-666.

[4] JADERBERG M,SIMONYAN K,VEDALDI A,et al.Reading Text in the Wild with Convolutional Neural Networks[J].International Journal of Computer Vision,2016,116:1-20.

[5] NAIEMI F,GHODS V,KHALESI H.Scene Text Detection and Recognition:A Survey[J].Multimedia Tools and Applications,2022,81(14):20255-20290.

[6] ATIENZA R.Vision Transformer for Fast and Efficient Scene Text Recognition[C]//International Conference on Document Analysis and Recognition.Lausanne:Springer,2021:319-334.

[7] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is All You Need[C]//NIPS'17:Proceedings of the 31st International Conference on Neural Information Processing Systems.Long Beach:Curran Associates Inc.,2017:6000-6010.

[8] WANG S N,LI B Z,KHABSA M,et al.Linformer:Self-attention with Linear Complexity[EB/OL].(2020-06-14)[2023-11-22].https://arxiv.org/abs/2006.04768.

[9] CHEN X,JIN L,ZHU Y,et al.Text Recognition in the Wild:A Survey[J].ACM Computing Surveys (CSUR),2021,54(2):1-35.

[10] LEE C Y,OSINDERO S.Recursive Recurrent Nets with Attention Modeling for OCR in the Wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE,2016:2231-2239.

[11] SIMONYAN K,ZISSERMAN A.Very Deep Convolutional Networks for Large-scale Image Recognition[EB/OL].(2014-09-04)[2023-11-22].https://arxiv.org/abs/1409.1556.

[12] HE P,HUANG W L,QIAO Y,et al.Reading Scene Text in Deep Convolutional Sequences[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Phoenix:AAAI Press,2016:3501-3508.

[13] GRAVES A,FERNáNDEZ S,GOMEZ F,et al.Connectionist Temporal Classification:Labelling Unsegmented Sequence Data with Recurrent Neural Networks[C]//Proceedings of the 23rd International Conference on Machine Learning.Pittsburgh:ACM,2006:369-376.

[14] CAI H X,SUN J,XIONG Y C.Revisiting Classification Perspective on Scene Text Recognition[EB/OL].(2021-06-12)[2023-11-22].https://arxiv.org/abs/2102.10884.

[15] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An Image is Worth 16×16 Words:Transformers for Image Recognition at Scale[EB/OL].(2020-10-22)[2023-11-22].https://arxiv.org/abs/2010.11929.

[16] LIU Z,LIN Y T,CAO Y,et al.Swin Transformer:Hierarchical Vision Transformer Using Shifted Windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Montreal:IEEE,2021:10012-10022.

[17] SHI B G,BAI X,YAO C.An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,39(11):2298-2304.

[18] YU D L,LI X,ZHANG C Q,et al.Towards Accurate Scene Text Recognition with Semantic Reasoning Networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle:IEEE,2020:12113-12122.

[19] SHI B G,YANG M K,WANG X G,et al.ASTER:An attentional Scene Text Recognizer with Flexible Rectification[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,41(9):2035-2048.

[20] FANG S C,XIE H T,WANG Y X,et al.Read like Humans:Autonomous,Bidirectional and Iterative Language Modeling for Scene Text Recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE,2021:7098-7107.

[21] WANG Y X,XIE H T,FANG S C,et al.From Two to One:A New Scene Text Recognizer with Visual Language Modeling Network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Montreal:IEEE,2021:14194-14203.

[22] KASAI J,PAPPAS N,PENG H,et al.Deep Encoder,Shallow Decoder:Reevaluating Non-autoregressive Machine Translation[EB/OL].(2020-06-18)[2023-11-22].https://arxiv.org/abs/2006.10369.

[23] YANG Z L,DAI Z H,YANG Y M,et al.XLNet:Generalized Autoregressive Pretraining for Language Understanding[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.Vancouver:Curran Associates Inc.,2019:5753-5763.

[24] BAUTISTA D,ATIENZA R.Scene Text Recognition with Permuted Autoregressive Sequence Models[C]//European Conference on Computer Vision.Tel Aviv:Springer,2022:178-196.

[25] BAEK J H,KIM G W,LEE J Y,et al.What is Wrong with Scene Text Recognition Model Comparisons?Dataset and Model Analysis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Seoul:IEEE,2019:4715-4723.

[26] BAEK J H,MATSUI Y,AIZAWA K.What If We Only Use Real Datasets for Scene Text Recognition?Toward Scene Text Recognition with Fewer Labels[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE,2021:3113-3122.

[27] MISHRA A,ALAHARI K,JAWAHAR C V.Scene Text Recognition Using Higher Order Language Priors[C]//BMVC-British Machine Vision Conference.Surrey:HAL,2012:1-11.

[28] WANG K,BABENKO B,BELONGIE S.End-to-End Scene Text Recognition[C]//2011 International Conference on Computer Vision.Barcelona:IEEE,2011:1457-1464.

[29] KARATZAS D,SHAFAIT F,UCHIDA S,et al.ICDAR 2013 Robust Reading Competition[C]//2013 12th International Conference on Document Analysis and Recognition.Washington D.C.:IEEE,2013:1484-1493.

[30] KARATZAS D,GOMEZ-BIGORDA L,NICOLAOU A,et al.ICDAR 2015 Competition on Robust Reading[C]//2015 13th International Conference on Document Analysis and Recognition (ICDAR).Tunis:IEEE,2015:1156-1160.

[31] PHAN T Q,SHIVAKUMARA P,TIAN S X,et al.Recognizing Text with Perspective Distortion in Natural Scenes[C]//Proceedings of the IEEE International Conference on Computer Vision.Sydney:IEEE,2013:569-576.

[32] RISNUMAWAN A,SHIVAKUMARA P,CHAN C S,et al.A Robust Arbitrary Text Detection System for Natural Scene Images[J].Expert Systems with Applications,2014,41(18):8027-8048.

基本信息:

DOI：

中图分类号:TP391.41

引用信息:

[1]黄俊炀,陈宏辉,王嘉宝等.基于多头注意力融合的场景文本识别[J].无线电工程,2024,54(11):2576-2584.

基金信息:

国家自然科学基金(62171135,62071131); 福建省杰青项目(2022J06010); 省教育厅重点攻关项目(2023XQ004)~~

请选择需要下载的pdf数据

无线电工程

Summary

引用

GB/T 7714-2015 格式引文

MLA格式引文

APA格式引文