Abstract
Scene text recognition is an application of Computer Vision that analyses the scene image and recognizes the text present on it. This task has many applications and will gain more importance if it can be used in handheld devices. The problem with existing methods is that if the model has a huge number of parameters and complex architectures, then the model will have a huge file size which will be problematic to deploy the application on mobile devices. Therefore, the aim of this paper is to propose a light-weight model that is a model with less number of parameters, small file size and less complexity that can be used in platforms with limited resources while achieving a comparable accuracy with those of the heavy weight models. The proposed models rely on deep learning to handle most of the steps automatically, consume less time and give precise results after facing many challenges. The proposed scene text recognition model is in the form of a Convolutional-Recurrent Neural network where the Convolution network extracts the features from the cropped images of scene text and the Recurrent network processes the sequential data of varying length present in the cropped images. After training, the scene text recognition model generates a weight file of 12 MB with 1 M parameters. To reduce number of parameters, weight of files and to show trade-off between efficiency and accuracy, MobileNetV2 is used in place of Convolution network that generates weight file of 6 MB with 0.5 M parameters. The performance on ICDAR 2013, IIIT 5K and Total-Text datasets shows that the proposed work performs well in detecting and recognizing texts from natural scene images.
Similar content being viewed by others
Availability of data and materials
The details of data and materials are given under Section 3 of this paper.
References
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado G-S, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org. https://www.tensorflow.org/
Baek J, Kim G, Lee J, Park S, Han D, Yun S, Oh SJ, Lee H (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 4715–4723
Baek Y, Lee B, Han D, Yun S, Lee H (2019) Character region awareness for text detection. In: Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition. pp 9365–9374
Bagi R, Dutta T, Gupta HP (2020) Cluttered textspotter: An end-to-end trainable light-weight scene text spotter for cluttered environment. IEEE Access 8:111433–111447
Bisong E (2019) Google Colaboratory. pp 59–64. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-4470-87https://doi.org/10.1007/978-1-4842-4470-8-7
Borisyuk F, Gordo A, Sivakumar V (2018) Rosetta: Large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp 71–79
Bradski G (2000) The opencv library. Dr. Dobb’s Journal of Software Tools
Cheng Z, Bai F, Xu Y, Zheng G, Pu S, Zhou S (2017) Focusing attention:Towards accurate text recognition in natural images. In: Proceedings of the IEEE International Conference on Computer Vision. pp 5076–5084
Ch’ng CK, Chan CS (2017) Total-text: A comprehensive dataset for scene text detection and recognition. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). vol. 01, pp. 935–942. https://doi.org/10.1109/ICDAR.2017.157
Chollet F, et al (2015) Keras. https://keras.io
Deng D, Liu H, Li X, Cai D (2018) Pixellink: Detecting scene text via instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32
Fu K, Sun L, Kang X, Ren F (2019) Text detection for natural scene based on mobilenet v2 and u-net. In: 2019 IEEE International Conference on Mechatronics and Automation (ICMA). pp 1560–1564. https://doi.org/10.1109/ICMA.2019.8816384
Ghosh J, Talukdar AK, Sarma KK (2021) Design of a light-weight natural scene text detection system. In: 2021 5th International Conference on Computer, Communication and Signal Processing (ICCCSP). pp 21–227. https://doi.org/10.1109/ICCCSP52374.2021.9465515
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, del R’ıo JF, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s41586-020-2649-2
He Y, Chen C, Zhang J, Liu J, He F, Wang C, Du B (2021) Visual semantics allow for textual reasoning better in scene text recognition. ar**v preprint ar**v:2112.12916
Hu J, Liao X, Wang W, Qin Z (2021) Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolutional network. IEEE Transactions on Circuits and Systems for Video Technology. 1.https://doi.org/10.1109/TCSVT.2021.3074259
Hunter JD (2007) Matplotlib: A 2d graphics environment. Computing in Science Engineering 9(3):90–95. https://doi.org/10.1109/MCSE.2007.55
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014) Deep structured output learning for unconstrained text recognition. ar**v preprint ar**v:1412.5903
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. International journal of computer vision 116(1):1–20
Kluyver T, Ragan-Kelley B, P’erez F, Granger B, Bussonnier M, Frederic J, Kelley K, Hamrick J, Grout J, Corlay S, Ivanov P, Avila D, Abdalla S, Willing C (2016) Jupyter Notebooks - a Publishing Format for Reproducible Computational Workflows. In: Loizides F, Schmidt B (eds.) Positioning and Power in Academic Publishing:Players, Agents and Agendas. pp 87–90. IOS Press
Li H, Wang W (2020) Reinterpreting ctc training as iterative fitting. Pattern Recognition 105:107392
Liao X, Li K, Zhu X, Liu KJR (2020) Robust detection of image operator chain with two-stream convolutional neural network. IEEE Journal of Selected Topics in Signal Processing 14(5):955–968. https://doi.org/10.1109/JSTSP.2020.3002391
Liao M, Shi B, Bai X, Wang X, Liu W (2017) Textboxes: A fast text detector with a single deep neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 31
Liao M, Wan Z, Yao C, Chen K, Bai X (2020) Real-time scene text detection with differentiable binarization. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 1474–11481
Li B, Tang X, Qi X, Chen Y, **ao R (2020) Hamming ocr: A locality sensitive hashing neural network for scene text recognition. ar**v preprint ar**v:2009.10874
Liu Z, Shen Q, Wang C (2018) Text detection in natural scene images with text line construction. In: 2018 IEEE International Conference on Information Communication and Signal Processing (ICICSP). pp 59–63. https://doi.org/10.1109/ICICSP.2018.8549799
Long S, Ruan J, Zhang W, He X, Wu W, Yao C (2018) Textsnake: A flexible representation for detecting text of arbitrary shapes. In: Proceedings of the European Conference on Computer Vision (ECCV). pp 20–36
Lu N, Yu W, Qi X, Chen Y, Gong P, **ao R, Bai X (2021) Maser: Multi-aspect non-local network for scene text recognition. Pattern Recognition 117:107980
Lundh F, Lundh F (2001) Python Standard Library. O’Reilly & amp; Associates, Inc., USA
Munjal RS, Prabhu AD, Arora N, Moharana S, Ramena G (2021) Stride: Scene text recognition in-device. ar**v preprint ar**v:2105.07795
Ong YL, Lau B, Chai A, Mccarthy C (2018) A model for automatic recognition of vertical texts in natural scene images. https://doi.org/10.1109/ICCSCE.2018.8685019
Risnumawan A, Shivakumara P, Chan CS, Tan CL (2014) A robust arbitrary text detection system for natural scene images. Expert Systems with Applications 41:8027–8048. https://doi.org/10.1016/j.eswa.2014.07.008
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L (2018) Mobilenetv2:Inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 4510–4520. https://doi.org/10.1109/CVPR.2018.00474
Shi B, Bai X, Yao C (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence. 39(11):2298–2304
Shi B, Wang X, Lyu P, Yao C, Bai X (2016) Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 4168–4176
Shi B, Yang M, Wang X, Lyu P, Yao C, Bai X (2019) Aster: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(9):2035–2048. https://doi.org/10.1109/TPAMI.2018.2848939
Umesh P (2012) Image processing in python. CSI Communications 23
Van Rossum G, Drake FL (2009) Python 3 Reference Manual. CreateSpace, Scotts Valley, CA
Wang W, **e E, Li X, Hou W, Lu T, Yu G, Shao S (2019) Shape robust text detection with progressive scale expansion network. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp 9328–9337. https://doi.org/10.1109/CVPR.2019.00956
Wang W, **e E, Song X, Zang Y, Wang W, Lu T, Yu G, Shen C (2019) Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) pp 8439–8448 . https://doi.org/10.1109/ICCV.2019.00853
**e E, Zang Y, Shao S, Yu G, Yao C, Li G (2019) Scene text detection with supervised pyramid context network. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp 9038–9045
Ye J, Chen Z, Liu J, Du B (2020) Textfusenet: Scene text detection with richer fused features. In: Bessiere, C. (ed.) Proceedings of the Twenty Ninth International Joint Conference on Artificial Intelligence, IJCAI International Joint Conferences on Artificial Intelligence Organization, ??? 20:pp 516-522. https://doi.org/10.24963/ijcai.2020/72 Main track. https://doi.org/10.24963/ijcai.2020/72
Yu D, Li X, Zhang C, Liu T, Han J, Liu J, Ding E (2020) Towards accurate scene text recognition with semantic reasoning networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 12113–12122
Hu J, Liao X, Wang W, Qin Z (2021) Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolutional network. IEEE Transactions on Circuits and Systems for Video Technology. 1.https://doi.org/10.1109/TCSVT.2021.3074259
Funding
This project is funded by the author itself.
Author information
Authors and Affiliations
Contributions
Jyoti Ghosh proposed the models, performed the experiment and wrote this paper. Anjan Kumar Talukdar supervised the whole work and the experiments conducted. Kandarpa Kumar Sarma supervised the originality and novelty of the whole work. The manuscript is reviewed and edited by all the authors. The final manuscript is approved by all the authors.
Corresponding author
Ethics declarations
Conflict of interest/Competing interests
The authors declare that they have no Conflict of interest or Competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A
Appendix B
Appendix C
Appendix D
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ghosh, J., Talukdar, A.K. & Sarma, K.K. A light-weight natural scene text detection and recognition system. Multimed Tools Appl 83, 6651–6683 (2024). https://doi.org/10.1007/s11042-023-15696-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15696-0