Abstract
Named entity recognition (NER) is an information extraction subtask that attempts to recognize and categorize named entities in unstructured text into predefined categories such as the names of people, organizations, and locations. Recently, machine learning approaches, such as hidden Markov model (HMM) as well as hybrid methods, are frequently used to solve Name Entity Recognition. To the best of our knowledge, publicly available data sets for NER in Persian do not exist in any machine learning-based Persian NER system. Because of HMM innate weaknesses, in this paper, we have used both hidden Markov model and rule-based method to recognize named entities in Persian texts. The combination of rule-based method and machine learning method results in a high accurate recognition. The proposed system in its machine learning section uses HMM and Viterbi algorithms, and in its rule-based section employs a set of lexical resources and pattern bases for the recognition of named entities including the names of people, locations and organizations. During this study, we annotate our own training and testing data sets for use in the related phases. Our hybrid approach performs on Persian language with 89.73% precision, 82.44% recall, and 85.93% F-measure using an annotated test corpus including 32,606 tokens.
Similar content being viewed by others
Notes
Surrounding words are defined as the words that are around the named entities (usually before them) and help with identifying named entities.
References
Bikel DM, Schwartz R, Weischedel RM (1999) An algorithm that learns what’s in a name. Mach Learn 34(1-3):211–231
Blunsom P (2004) Hidden markov models. Tech. rep., Human Language Technology. University of Melbourne, Victoria. http://digital.cs.usu.edu/~cyan/CS7960/hmm-tutorial.pdf. Accessed 1 May 2015
Borthwick A (1999) A maximum entropy approach to named entity recognition. Ph.D. Thesis, New York University
Brill E (1995) Transformation-based error-driven learning andnatural language processing: a case study in part-of-speechtagging. Comput Linguist 21(4):543–565
Cohen WW, Sarawagi S (2004) Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In: Proceedings of the Tenth ACM SIGKDD International conference on knowledge discovery and data mining, ACM
Dowman M, Tablan V, Cunningham H, Popov B (2005) Web-assisted annotation, semantic indexing and search of television and radio news. In: Proceedings of the 14th international conference on world wide web, ACM, pp 225–234
Grishman R, Sundheim B (1996) Message understanding conference-6: a brief history. In: Proceedings of the 16th International conference on computational linguistics (COLING 96), Copenhagen, pp 466–471
Isozaki H, Kazawa H (2002) Efficient support vector classifiers for named entity recognition. In: Proceedings of the 19th International conference on computational linguistics-volume 1. association for computational linguistics
Lee C-s, Chen Y, Jian Z (2003) Ontology-based fuzzy event recognition agent for Chinese e-news summarization. Expert Syst Appl 25(3):431–447
Marrero M, Urbano J, Sánchez-Cuadrado S, Morato J, Gómez-Berbís JM (2013) Named entity recognition: fallacies, challenges and opportunities. Comput Stand Interfaces 35(5):482–489
Mihalcea R, Moldovan DL (2001) Document indexing using named entities. Stud Inform Control 10(1):21–28
Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvist Investig 30(1):3–26
Saggion H, Cunningham H, Bontcheva K, Maynard D, Hamza O, Wilks Y (2004) Multimedia indexing through multi-source and multi-language information extraction: the MUMIS project. Data Knowl Eng 48(2):247–264
Seng J-L, Lai JT (2010) An intelligent information segmentation approach to extract financial data for business valuation. Expert Syst Appl 37(9):6515–6530
Shamsfard M, Mortazavi P-S (2009) Named entity recognition in persian texts. In: 15th International conference of Irainian computer community, Tehran (In Persian)
Sung NH, Chang YS (2004) Business information extraction from semi-structured webpages. Expert Syst Appl 26(4):575–582
Tsai T, Chou W-C, Wu S-H, Sung T-Y, Hsiang J, Hsu W-L (2006) Integrating linguistic knowledge into a conditional random fieldframework to identify biomedical named entities. Expert Syst Appl 30(1):117–128
Zhou GD, Su J (2002) Named entity recognition using an HMM-based chunk tagger. In: Proceedings of the 40th annual meeting on association for computational linguistics. association for computational linguistics
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Moradi, H., Ahmadi, F. & Feizi-Derakhshi, MR. A Hybrid Approach for Persian Named Entity Recognition. Iran J Sci Technol Trans Sci 41, 215–222 (2017). https://doi.org/10.1007/s40995-017-0209-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40995-017-0209-x