1 Introduction

Lemmatization is the process of getting the base morphological form back from a given word considering its context. Lemmatization is required for applications like text mining [35], biomedical studies [17], chatbot [9, 15, 32], question answering [33, 36]. Unlike stemming, the lemma of an inflected word is always a valid, meaningful dictionary word. Besides, as lemmatization depends on context, same inflected word can have different roots based on its context. For example, the word ’কমলে’ can be pronounced as either ’kamoley’ or ’komley’ in Bangla and has different meanings.

Table 1 Context Sensitive Lemmatization Example

Table 1 shows two different lemma for the word ’কমলে’ in two different sentences (contexts): in the first sentence it functions as a noun meaning ’lotus’ and in the second sentence it is verb meaning ’decrease’. This context dependent root word derivation is not possible in stemmming. There are numerous similar cases in Bangla making lemmatizer a better choice than stemmer for semantic analysis.

Much work has been done for different languages to lemmatize, but heavily inflected languages like Bangla, Hindi, and Sanskrit were always out of the limelight. Before the popularity of deep learning, rule-based approaches dominated the design phase of lemmatizer for any specific language [26, 28, 29, 37]. When artificial deep neural networks, especially recurrent architecture, arrived, languages that exhibit slight morphological variations readily observed significant improvement in lemmatization. In order to achieve decent performance in rule-based lemmatization for morphologically rich languages like Bangla, Russian, and Hungarian, a vast set of rules needs to be identified. On the contrary, for deep learning-based approaches, an extensively large annotated dataset is required. Thus the choice of lemmatization process for a morphologically rich but low resource language like Bangla is utterly dilemmatic. In 2014, Bhattacharyya et al. [3] proposed a method that creates a tree and ranks the candidate root words. However, it does not consider the context of the lexicon. In 2016, Abhisek et al. [5] proposed an approach called Benlem. Nevertheless, its accuracy was significantly low when the “bag of word” size is enormous. The shallow depth of previous works and any large open-source Bangla dataset’s unavailability is the driving force behind this work.

In this paper, we have proposed a Bangla Neural Lemmatizer (BaNeL) which will take an inflected word and return the possible lemma considering the context (POS: parts-of-speech instead of entire sentence). The dataset we have created has more than 22,300+ inflected words and 6,300+ root words with their POS. In our proposed approach, input and expected output for that are given below.

Input(A Bangla sentence and POS) with pronunciation:

শফিককে(Shafiqke) ঢাকাতে(Dhakatey) আসতে(ashtey) বলেছিলাম(bolesilam)। Parts-of-Speech: Noun; Noun; Verbal; Verb

Translation in English: I told Shafiq to come to Dhaka.

Output(Lemma): শফিক(shafiq); ঢাকা(Dhaka); আসা(asha); বলা(bola)

Although Bangla is widely spoken by around 260 million people worldwide, minimal work has been done in lemmatization. Due to the lack of Bangla lemmatizer, many NLP (Natural Language Processing) systems are yet to be developed for this language. This work intends to create a pathway that can be easier for other researchers to build systems like voice recognition, handwriting recognition, text summarization, topic modeling, and sentiment analysis in the future. The main challenge in develo** the BaNeL model was the Bangla language itself. The vast diversity within this language’s veins is so high that it is tough to create a system that can understand what means what. Every character is unique in its way, and the melange of characters creates different meaningful outputs in various waysFootnote 1. We hope this system will help other researchers build more NLP systems for the Bangla language and make it more technology-friendly and more accessible to the whole world. The main contributions of this paper are:

  • A larger (in terms of number of unique inflected words) dataset for Bangla Lemmatization

  • An efficient Encoder-Decoder Based Lemmmatizer called ’BaNeL’ that given better accuracy than state-of-the-art Bangla lemmatizers.

The paper is organized as follows: Section 2 discusses the current state of Bangla lemmatizer along with commonly followed practices in lemmatizer development for other high-resource languages; Section 3 discusses the preparation ideologies and statistical information about our dataset; Section 4 focuses on data preprocessing and BaNeL model architecture; Section 5 presents model accuracy depicting hyper-parameter tuning. Finally in Section 6, findings of this study and applicability and potentiality of the proposed BaNeL framework is mentioned.

2 Related works

Throughout the digitalization of languages, many works have been done to improve communication through machines. There are roughly 6500 languages spoken all around the world right nowFootnote 2. Many languages have been extinct from time to time due to their racial dependency and structural complications. Some languages have become more popular around the world and are widely spoken by different communities. Much work has been done on those particular languages like English, Chinese, and French. Languages like Bangla and Hindi are widely spoken, but their internal structure is more complicated, and morphological varieties in word-formation are much broader than others. Structural complexity and richness in morphological diversity are why there is less work on these languages in the NLP segment. Although there are very few works on the Bangla language, this paper discusses different languages to get the readers accustomed to various theories and established methodologies.

2.1 Non-bangla lemmmatizers

One of the fascinating works in this field has been done by Wolfgang Lezius et al. [16] in the year 1998 when there were very limited resources, and the pipeline of knowledge sharing was very narrow. In that paper, they have proposed a POS tagging system-based lemmatizer for the German language. They have created a morphology module, which provides all possible lemmata of a context. Later, the POS tagger matches the POS of words in the context and the morphology module’s lemmata. If both forms do not fall into the same grammatical category, the lemma has been discarded. They found that if the tag set is large, the accuracy is 99.2%, and for a small tag set, it is 99.3%. Their work was cited many times by the other researchers who worked in this field.

Another work has been done for the German language by Praharshana Perera et al. [27] in the year 2005, in which they mainly worked on nouns. To find the maximum number of nouns within a lexicon, they have used POS tagger [34]. Their proposed algorithm can continuously train itself from the constructive documents for higher coverage and accuracy. Moreover, they integrated the system with the automatic lexicon generation and the lemmatization algorithm.

Snigdha Paul et al. [25] developed a lemmatizer for the Hindi language in 2013. They have proposed a rules-based approach to extract the suffixes from a given word to find the lemma. They have done the task manually, worked on a dataset that includes around 20000 sentences, and created 112 rules. Their proposed rules mainly emphasized time optimization, and they found 91% accuracy from their developed system. Later, they came up with a knowledge base where they stored the exceptional root words. However, their knowledge base was limited to commonly used terms except for the nouns. As they emphasized more on the time optimization, they found 89.08% accuracy [26].

Svanhvit Ingolfsdottir et al. [12] proposed a system named Nefnir, which they have compared with two other lemmatizers named CST (Center for SprogTeknologi/Center for Language and Technology) lemmatizer, which is a rule-based lemmatizer [13] and Lemmald, [11] which is a part of IceNLP toolkit. Nefnir mainly follows the suffix substitutions rule. It lemmatizes tagged texts, and for tagging, they have used an automatic POS tagger (IceTagger Tags) and manually typed POS (gold tags). For gold tags, they found 99.55% accuracy, and for IceTagger Tags, 96.88% accuracy was found. Although the result is excellent, being a rule based lemmatizer for language that belongs to a different language family, this lemmatizer is hardly interesting for Bangla.

Akhmetov et al. [1] proposed a random forest classification model for language independent lemmatization. In this paper, authors have proposed to create a character co-occurrence embedding from inflected_words-lemma pairs of 25 languages. This character co-occurrence embedding is then applied in decision tree algorithm to generate the final model. Accuracy of their proposed model ranges from 28% (for Farsi) to 96% (for Turkish) with an average of 72%. Although Bangla is not included in these 25 languages, we have used this approach as one of the baseline algorithm for comparison of our proposed model.

As a part of Stanza toolkit [31] lemmatizer is implemented for some selected languages which is an ensemble of a dictionary based lemmatizer and a seq2seq [38] based lemmatizer. This seq2seq based lemmatizer is constructed using a bidirectional LSTM encoder, soft dot attention layer and a greedy decoder. Milintsevich et al. [20] attempts to enhance the performance of Stanza lemmatizer using an external system that supplies additional lemma candidates but achieves only marginal improvement over the original system for 23 supported languages. This paper implements a Stanza-like lemmatizer for Bangla (which does not belong to the original 23 supported languages) by preparaing a new large dataset and making some modification in the peprocessing (character level two-hot encoding with POS) and in the decoding phase during inference (beam decoding).

2.2 Bangla lemmatizers

Alok Ranjan Pal et al. [23] proposed a lemmatization technique for Bangla nouns. They have used a strip** method where the longest suffixes decreased to find the root word. Their input nouns were tagged within a lexicon and then processed through the system to get the lemma or base form. Total 1273 nouns were used to evaluate the system, and they found the system to be 94% accurate. On the contrary, our proposed BaNeL model is not specific for nouns; rather, it can lemmatize words from any parts of speech if sufficient entries are used during the training phase.

In 2019, an information retrieval system was introduced by Md. Kowsher et al. named Bengali information retrieval system (BIRS) [14]. They have proposed two novel techniques to lemmatize Bangla verbs that cannot be lemmatized using rule-based algorithms. The 1st approach was Dictionary Based Search by Removing Affix (DBSRA); it has the lowest time and space complexity and removes prefixes from a given data to find the root. The second one was Trie, a tree-based data structure that retrieves all possible lemmas. In a trie, a single edge contains two nodes, and each node has a single character. Briefly, their proposed approach uses a corpus of root words and for an inflected word, root word is selected based on edit-distance algorithm without considering the context of the sentence. Theoretically this approach cannot be capable of identifying correct root words for cases like those mentioned in Table 1. Alok Ranjan Pal et al. [24] proposed a suitable method for Bangla Word Sense Disambiguation (WSD). Throughout the process, they have used supervised and unsupervised methods. As the dataset, they have used lemmatized words and regular words much similar to our approach. In this process, they found more accuracy in lemmatized terms; however, they did not mention their algorithm or method of lemmatizing the Bangla words. Arijit Das et al. [8] proposed a system to extract the Bangla root verbs automatically using Paninian Grammar. They have used a supervised method to classify the verbs according to tense and person. After the classification, a set of Panini rules has been implemented to extract the root of verbs. They believe their work will significantly impact the segment of NLP of Bangla language as none of the previous lemmatizers can strongly process the inflected verbs due to their morphological variants. They found the accuracy from 61% to 92% in five different Bangla verbs sections. Abhishek Chakrabarty et al. came up with a lemmatization algorithm named Benlem [5]. In their article, they have explored different aspects of NLP. They have used a suffix list following the work of Paik and Parui [22]. As the resource was limited, they created their train and test dataset. Their proposed algorithm includes subtractions and the additions of the suffix and prefix to retrieve the root or lemma. Benlem is a rule-based lemmatizer for Bangla and requires a valid suffix set, and POS annotated inflected words for better accuracy. It achieved 83.82% accuracy. A neural lemmatizer has been proposed by Akshay Chaturvedi et al. [4] for the Bangla language. They have used the word2vec tool [19] to create the figure of 200-dimensional word vectors. Furthermore, they have used a continuous bag of words to train the recurrent neural model of word2vec and used 10-fold cross-validation to create the optimum train-test split. In this method, the root can be selected as a lemma if the surface word vector’s coined similarity from test data has maximum similarity with the surface word vector. However, their proposed model could not extract sequential relations within the characters of a word as they have used feed-forward architecture in their neural model with nearby words as contextual neighbors. As a result, their proposed model yielded just 69.57% accuracy. BLSTM-BLSTM Bangla lemmatizer [6] uses two successive bidirectional LSTM based neural network to capture edit-tree sequence from inflected word to lemma. Lemming [21] is a token-based statistical lemmatizer that outperforms BLSTM-BLSTM (accuracy: 90.84%) with 91.69% accuracy. Their dataset is freely available but despite having around 20000 POS-tagged inflected words, number of unique inflected words is not much. Although these lemmatizers give moderately good performance, potential of encoder-decoder based sequence to sequence model [38] has not yet been explored for Bangla. Lemmatizatizers in Balto-Salvic languages, which are also highly inflectional like Bangla, have achieved significant improvement in accuracy using encoder-decoder based model for lemmatization [18, 30]. Following this intuition, in this work we have applied encoder-decoder model for lemmatization in Bangla. On top of that, we have applied attention layer during the context vector preparation of encoder to better align the input with output of train data. To prevent local greedy choice hurt global output of decoder, during the inference phase we have applied beam-decoding. Combining all these modifications, our proposed framework has been able to achieve significant improvement in accuracy over the existing approaches for lemmatization in Bangla.

3 Data collection

To train a robust neural model, a large dataset is required. There are a few open source datasets available for lemmatization but unfortunately, these datasets are not large enough. Chakrabarty et al. [6] described a dataset containing 20, 257 POS-annotated inflected_word-lemma pairs and is super-set of Benlem dataset[5]. As the dataset is created from continuous text, the dataset contains only 7476 unique inflected words and 4436 unique lemma. To overcome this limitation we have collected inflected words manually. For our data collection’s convenience, we have arranged our structure in spreadsheet format where the first column contains the inflected word, the second column contains the root word or lemma, and the third column includes parts-of-speech.

3.1 Word selection

Widely read Bangla newspapers, novels, plays, articles, blogs, etc., have been considered while creating the vocabulary set. Then the dataset is augmented with lemma and parts-of-speech of the selected vocabulary set. We have given preference to those lemmata that change frequently depending on the context. As nouns show the most inflection, our dataset’s words are mostly lemma of nouns and are inflected with lemma টা(ta), টি(ti/tee), টাকে(taake), টিকে(teeke/tike), তা(ta), বা(ba), বে(be), বি(bi/bee), রা(ra), দের(der). With the lemma article, suffix, inflection, case, and number are added. For example, here ’কুমাররা’ (kumarra): ’রা’ (ra) is added with ’কুমার’ (kumar) and makes the inflected word. Then ’কুমার’ (kumar) connecting with ’ঈ’ (ee) make the inflected word ’কুমারী’ (kumaree). Similarly, ’কলম’ (kalom) from ’কলমটি’ (kalomti), ’কলমটাকে’ (kalomtake), ’কলমের’ (kalomer) and ’পাথর’ (pathor) from ’পাথরের’ (pathorer), ’পাথরটা’ (pathorta), ’পাথরে’ (pathore) are some examples of inflection and base form from the dataset. Such inflected and their corresponding root words are collected and stored in a spreadsheet database. A sample format is shown in Table 2

Table 2 Spreadsheet format (with pronunciation and translation)

Two different annotators have annotated the dataset, and then the annotation has been verified by a native language expert. Language expert’s corrections were considered to be final. There were some ambiguities regarding parts-of-speech like in the case of the word “লেখা  (pronunciation: lekha, meaning: to write (verb) or writing (noun))”. Consider the following two sentences:

  1. 1.

    তখন(tokhon) তিনি(tini) কিছু(kisu) একটা(ekta) লিখছিলেন(likhsilen)। (He was writing something then.)

  2. 2.

    তার(tar) সেই(shei) লেখাটি(lekhati) বেশ(besh) শিক্ষণীয়(shikhoniyo)। (That writing of his is deductive.) In the first sentence, the word “লিখছিলেন(likhsilen)” is the present participle form of the verb “লেখা(lekha)” and in the second sentence, the word “লেখাটি(lekhati)” is a verbal noun, though both have the same lemma “লেখা(lekha)” which can function as both noun or verb in its base form “লেখা(lekha)”. In our dataset, “লিখছিলেন (likhsilen)” is annotated as a verb and “লেখাটি(lekhati)” is annotated as a noun. Although we are presenting these two words with context (sentence), “লিখছিলেন (likhsilen)” can only be used as a verb, and thus contextual information is not necessary to identify it as a verb. The same argument is also valid for the word “লেখাটি(lekhati)” which can only be used as a noun, thus invalidating all ambiguity regarding this type of inflections. However, the word “লেখা(lekha)” although can be described both as a verb or noun, in a Bangla sentence it can never be used as a verb in its base form. So whenever we expect to find the word “লেখা(lekha)” in a sentence, we can annotate it as a noun without considering the context. However, some words can function as different part-of-speech depending on the context, like the word “হাত (haat: hand)”. Consider the following two phrases:

  1. 1.

    হাতের (haater) কাছে(kasey): Near one’s hand

  2. 2.

    হাতের(haater) লেখা(lekha): hand-writing

In both cases, “হাত (haat)” is used in the same inflected form. However, in phrase 1, it performs as a noun, and in phrase 2, it performs as an adjective. In such cases, we have added two tuples in the dataset: (inflected: হাতের(haater), lemma: হাত(haat),POS: বিশেষ্য(Noun)) and (inflected: হাতের(haater), lemma: হাত(haat), POS: বিশেষণ(Adjective)). As we are not using POS as the target value, no ambiguity occurs here. On top of that, it gives the model capability to find the correct lemma of an inflected word in different contexts.

Although parts-of-speech tagging has been done manually, we could have used Bangla POS-TaggerFootnote 3 that uses Stanford Postagger in core. Accuracy of this model is 92%. To ensure 100% correctly POS-tagged train and test data, we avoided using this model. However, till a perfect POS-tagger arrives, our proposed BaNeL framework can be used with this automatic POS-tagger in practical application for decently accurate lemmatization.

3.2 Dataset analysis

We have assembled a total of 22353 unique inflected words with 6382 unique lemma. A particular word can be inflected in various ways and change its parts of speech. However, we have maintained inflected and root word relations without changing parts-of-speech. So, each dictionary word qualifies as a lemma in our dataset if it shows some inflection without changing its parts-of-speech. For example, consider the following words: “সুন্দর” (sundor: Adjective(beautiful)/Noun(beautiful man)), “সুন্দরী” (sundori: Noun(beautiful woman)), “সুন্দরভাবে” (sundorvabey: Adverb(in a beautiful way)), “সৌন্দর্য” (soundorjo: Noun (beauty)). All of these words are derivative of the word “সুন্দর”, but we have considered each of them as a separate lemma, and inflected words are mapped with a lemma that has the same parts-of-speech along with the same meaning. Even borrowed words from other languages which have been able to make their way to regularly used Bangla language with inflections like those found in native Bangla vocabulary are also considered in our dataset.

There are three columns in our spreadsheet dataset. The first one is an inflected word that shows how the root word changes in its contextual neighbors. The second one is the lemma which will have the root word. The last one is parts-of-speech. Valid inflected word has been created (by adding or subtracting ’Inflexion’, ’Case’, ’Suffix’, ’Number’, ’Article’) through some transformations of their respective lemma. The terms of the surface words that are converted from the root word will be the same as the terms of the root word as in [5].

Dataset consists of two separate types of inflection. One follows regular patterns where simple addition or subtraction of a frequent ’Inflexion’, ’Case’, ’Suffix’, ’Number’, ’Article’ from a root produces a surface word[5]. A inflected word can be formed in following ways (not limited to):

  1. 1.

    By subtracting a valid suffix from its lemma (e.g., ‘কর’ = ‘করা’ - ‘া’): to do;

  2. 2.

    By adding a valid suffix with the lemma (e.g., ‘নগরই’ = ‘নগর’ + ‘ই’): town;

  3. 3.

    By subtraction followed by addition with the lemma( e.g., ‘করছেন’ = [‘করা’ - ‘া’] + ‘ছেন’): to do;

  4. 4.

    By adding article with the lemma ( e.g., ‘কলমটি’ = কলম’ + ‘টি’):pen;

  5. 5.

    By adding inflexion with the lemma ( e.g., ‘কলমের’ = ‘কলম’ + ‘এর’):pen;

  6. 6.

    By adding both article and inflexion with the lemma ( e.g., ‘আকাশটিকে’ = [‘আকাশ’ + ‘টি’] + ‘কে’):sky;

    Another type is generating the irregular inflected words for which no well-formed transformation rule works. In most cases, a root word and its morphological variants are both orthographically and semantically similar. However, only semantic similarity exists between a lemma and its variants in this case of irregular inflection. In this rule, the surface word is not formed by adding or subtracting any ’Inflexion’, ’Case’, ’Suffix’, ’Number’, ’Article’.

  7. 7.

    None of the above transformations works, (e.g., ‘যাওয়া(to go)’ -> ‘গিয়েছিলো (went)’ );

Words in our dataset belong to seven types of parts of speech. There are noun, verb, pronoun, adjective, preposition, verbal noun, adverb. Most of the root words have at least five inflected words in the dataset. However, an average of the inflected words with the root word shows that 3.50 inflected words have been created for one root word. Table 3 shows the parts-of-speech distribution of the prepared dataset where noun words dominate massively.

Table 3 POS Distribution of Root Words

To address the coverage of our dataset over the Bangla vocabulary list, we have collected 94,578 articles (784 MB of text) from a popular Bangla newspaper having 98389 unique words. In contrast, our annotated dataset has only 6382 unique lemma. Nevertheless, the assuring fact is that only 27835 words in the collected corpus have a frequency of more than twenty, and 10261 words have a frequency of more than a hundred. Two thousand six hundred forty three of our root words belong in that set of words with a minimum frequency of a hundred. On top of that, these 10261 words also include proper nouns (which we have deliberately avoided while preparing lemma-set) and inflected words. This analysis indicates a high probability of covering the most frequent Bangla root words that show inflections. The dataset is publicly available at https://github.com/cseduashraful/BaNeL-Dataset.

4 Proposed BaNeL framework

This section discusses preprocessing of the dataset, a detailed explanation of BaNeL model architecture, and the implementation process. To present these details comprehensively, alongside theoretical discussion, we have demonstrated the data manipulations using examples.

4.1 Data prepossessing

In order to make the data compatible with neural networks, several levels of encoding and preprocessing are required. A suitable representation of data is crucial for the model to converge towards the goal. We have used character level encoding to train the model. Character level encoding holds structural information of words enabling the model to correctly predict new unseen lemma if the type of the inflection is familiar. For example, if the model has seen inflected words like “করানোর (karanor)”, “খেলানোর (khelanor)”, “পড়ানোর (paranor”), then the model is more likely to predict lemma of the word “মেশানোর (meshanor)” correctly which is “মেশা (meshsa): to mix” as they all follow the style (lemma+inflection নোর) to form inflected words although the model was unaware of this lemma during the training phase. Sections 5.1.1 to 5.1.4 discuss data preprocessing phases in detail.

4.1.1 Character Set Creation

A complete list of the Bangla alphabet has been created along with padding token < $ > as shown in Fig. 1. This alphabet file has been used to map each unique character to a unique integer.

Fig. 1
figure 1

Bangla Character Set with Padding Token < $ >

In this list, we have 61 (0-60) characters, and the 61st index is the symbol < $ >, which will be needed for processing the data to indicate the start and padding tokens.

4.1.2 Integer encoding

Inflected and root words are split into characters. For example, the split of the word “নিয়োগের (niyoger)” is [ ’ন’, ’ি’, ’য়’, ’ো’, ’গ’, ’ে’, ’র’ ]. Then characters are mapped to integers using the character index number from the list of the alphabet. e.g., integer encoding of the word “নিয়োগের” wil be [ 30, 56, 45, 50, 13, 59, 37 ]. For all inflected words, the integer encodings are added to a list called “xlist”, and all the integer encodings of lemmata are added to a list called “ylist”. In the meantime, the maximum word length for inflected words and lemmata are calculated.

4.1.3 One hot encoding of inflected word

Root words (contents of output column) are kept in integer-encoded format. However, inflected words are converted to one-hot encoding. Before that, we have padded each word using padding toke < $ > to max word length of inflected words. In our case, the max word length is 16. For example, the integer encoded of the word ‘নিয়োগের’ =[ 30, 56, 45, 50, 13, 59, 37]. After padding, the encoding list will be [30, 56, 45, 50, 13, 59, 37, 61, 61, 61, 61, 61, 61, 61, 61, 61] assuming longest inflected word in the dataset has 15 characters. The padding character < $ > has index 61 as shown in Fig 1. Using these integer values, each character is represented as a 62-length vector. One-hot encoding for the word ’নিয়োগের’= [ ’ন’, ’ি’, ’য়’, ’ো’, ’গ’, ’ে’, ’র’ ] is: [30, 56, 45, 50, 13, 59, 37, 61, 61, 61, 61, 61, 61, 61, 61, 61] = [ [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], \(\dots\) \(\dots\) [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 ]]

The one-hot encoded size for each inflected word is: (max_word_len*alphabet_len) = (16*62)

4.1.4 Two hot encoding of inflected word—POS pair

Parts-of-speech of the inflected words has been encoded separately. There are eight parts-of-speech in the dataset, so one-hot encoding length for parts-of-speech of each inflected word will be of length eight. We have augmented this information with each character of an inflected word. This eight-length one-hot encoding is appended with the original 62-length one-hot encoding of each character creating a two-hot encoding of each character of length 70. The alphabet list ends at the 61st index, and “POS” starts from 62 as shown in Table 4.

Table 4 POS Index

For input data, each character of a word now contains 70 values, where the last eight indices represent parts-of-speech of that word. Only one index will be binary ‘1’, and the rest of the seven indices will be binary ѰҠ for the last of a word. For example, the final encoding of a word ‘কমে’ where ‘POS’ is ‘বিশেষণ’(adjective) and padded integer encode is [11, 35, 59, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61] is generated by appending [0, 1, 0, 0, 0, 0, 0, 0] at the end of one-hot encoding of each character.

For example, the final encoding of a word ‘কমে’ where ‘POS’ is ‘বিশেষণ’(adjective) and padded integer encode is [11, 35, 59, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61] is given below:

Encoding(‘কমে’) = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], ......, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]]

The final encoding size is: (max_word_len*alphabet_len) = (16*70). Every character is encoded by a 70 length where the first 62 bits are for the alphabet, and the last eight bits are for ‘POS’.

4.2 Proposed Network Architecture

A combination of several recurrent neural structures has been applied to catch the inherent sequential property of word-formation. We have used the ’Sequence to Sequence Encode Decoder’ technique to design the lemmatizer. Two-hot encoded characters are passed as sequential input to encoder. Encoder outputs are then utilized by decoder with Bahdanau Attention [2] to create time-step aligned context vector which is then used to predict characters of lemma sequentially.

4.2.1 Encoder

Encoder takes a batch of character-level two-hot encoded (as described in Section 4.1) POS-tagged inflected words as input. Our encoder is a bidirectional multi-layer stacked Recurrent Neural Network (RNN) where each encoded character of an inflected word functions as the input of the RNN’s each time-step. However, before passing the encoded character to the RNN cell, it goes through a neural embedding layer. We have used PyTorch built-in embedding module for this purpose.

Fig. 2
figure 2

Encoder

Figure 2 shows the diagram of the Encoder of our proposed BaNeL framework. In the figure, we have shown a single layer bidirectional network. However during the hyper-parameter, optimal number of layers can be traced for the used dataset. Each RNN unit can be either an LSTM (Long Short Term Memory) [10] or GRU (Gated Recurrent Unit) [7]

Each vector \(X_i\) in Figure 2 is an embedded two-hot encoded character of the POS-tagged inflected word. Sequentially these vectors are fed as input from start (\(X_1\)) to end (\(X_M)\) in the forward pass and end (\(X_M\)) to start (\(X_1\)) in the backward pass. Here, \(M: max\_word\_len\) is the length of the longest inflected word in the current batch with end of the word token < $ >. Encoder outputs after passing each character for the forward pass (\(O^f_i\)) and the backward pass (\(O^b_{M-i+1}\)) are concatenated together. All such concatenated vectors are recorded as ’Encoder Outputs’ which are later used to produce time-step aligned context vectors during the decoding phase. Last hidden states of forward pass(\(h^f_M\)) and backward pass (\(h^b_M\)) are concatenated together and are used in the decoder as ’Encoder last hidden’.

4.2.2 Decoder

Unlike the encoder, decoder of the BaNeL model is a unidirectional recurrent neural network. Initial hidden state of the decoder RNN unit is set by context vector \(C_1\) which calculated using encoder outputs and encoder’s last hidden state. As demonstrated in Figure 3, encoder’s last hidden state and outputs for an entire batch pass through Bahadanau Attention Layer to generate attention weight \(W^a_1\). Applying batch matrix multiplication encoder outputs and \(W^a_1\), first time-step aligned context vector \(C_1\) is generated. Later context vectors \(C_2, C_3, C_4, \dots , C_{M-1}, C_M\) are calculated using the previous hidden states of the decoder \(h_1, h_2, h_3, \dots , h_{M-2}, h_{M-1}\) respectively instead of the last hidden state of the encoder.

Fig. 3
figure 3

Decoder

During each time-step, the decoder uses embedded decoder output and context vector of the previous time-step as input except the the case for the first time-step when embedded dummy start token and all-zero context vector is fed to the decoder RNN unit (LSTM or GRU cell). A softmax activation function calculates probability of each character of output alphabet using LSTM/GRU cell’s output. In the training phase, we have used greedy decoding, that is, character with highest probability from softmax function is chosen as output of the current state. In the evaluation phase, we have used beam search to maintain a pool of top K choices to generate final sequence as locally best choice may not give global optimal result. Although beam search does not guarantee the optimal result, probability of finding it increases significantly. Beam search has not been used in during the training phase because at the stage we want train the model in such a way that decoder always emit character that is part of global optimal solution with highest local probability.

Fig. 4
figure 4

BaNeL Model

However, we are using the teacher-force-ratio (TRF) in the training phase as shown in Fig. 4, which randomly (with 50% probability) skips previous decoder output and feeds the actual target character as decoder input during the training phase. As the decoder may frequently predict wrong output in the initial epochs of learning, the wrong output will be fed to create subsequent decoder output. This phenomenon will affect the convergence time of the model. Therefore, to boost the convergence time, this teacher-force-ratio is used.

5 Experimental results

To trace the best combination of hyper-parameters and analyze comparative performance, we have run our models with various configurations. Instead of independent testing, we have conducted simultaneous validation with the test dataset. If consecutive ten iterations show no improvement in the validation accuracy, the model stops training. We have randomly sampled approximately 10% of the entire dataset for testing and rest of the 90% data have been used for training (80% as train-set, 10% as validation-set). Accuracy has been measured by subtracting character error rate from 100. As shown in Eq. 1, levenshtein distance between inflected words and lemma is calculated first and taking its ratio with total number of characters among all lemma of the dataset in consideration, character error rate is calculated.

$$\begin{aligned} Accuracy = (1 - \frac{ \sum _{I, L \in D}levenshtein\_distance(I, L)}{\sum _{L \in D} length(L)}) \times 100 \end{aligned}$$
(1)

To report performance compared to other approaches we have used EM: ’Exact Match Percentage’, which is the percentage of correctly predicted lemma in the test/validation phase. Equation 2 is used for this purpose.

$$\begin{aligned} EM = \frac{\sum _{I, L \in D}I==L?1:0}{\sum _{I \in D}1} \times 100 \end{aligned}$$
(2)

In Equation 1 and 2, symbols I, L and D are used for inflected word, lemma and dataset respectively. We have conducted the training phase of the proposed BaNeL model on Google Colab Pro. Details about environmental setup is given in Table 5.

Table 5 Environmental setup

5.1 Hyper-parameter tuning

Alongside parameters (Weight values of the inter-layer connection of a neural network), performance of the model also heavily depends on several hyper-parameters. We find the optimal combination of such hyper-parameters manually. We have traced out the best combination of the following hyper-parameters:

  • Learning Rate

  • Batch Size

  • Hidden Size

  • Encoder RNN Layer

After that, we have also analyzed the effect of applying beam search in the inference phase.

5.1.1 Tracing optimal learning rate

We have trained our model with four different learning rates: 0.0001, 0.0005, 0.001, and 0.0015. As expected, a lower learning rate yielded better accuracy but required more epochs to converge. Epoch-wise accuracy graph is for all of these learning rates are shown in Fig. 5.

Fig. 5
figure 5

Learning Rate Tuning (Enc_layer = 2, hidden_size = 1024, batch_size = 64)

Observed highest accuracy for learning rate 0.0001, 0.0005, 0.001 and 0.0015 are 94.24% (23rd epoch), 94.03% (22nd epoch), 92.98% (16th epoch) and 92.17% (14th epoch) respectively. From the graph, we can also see that lower learning lates (0.0001 and 0.0005) shows more consistent performance whereas there are many fluctuations in the accuracy level for higher learning rates (0.001 and 0.002). Although learning rate 0.0001 is ten times smaller than 0.001, accuracy difference not that acute. Utilization of Adam Optimizer in the back-propagation phase contributes to this as learning rate is continuously adjusted when no improvement is observed in the validation accuracy.

5.1.2 Tracing optimal encoder layer

We have traced out the optimal stack depth of the encoder. We have started with a single layer GRU to prepare encoder outputs and recorded accuracy changes up to four layers as shown in Fig. 6

Fig. 6
figure 6

Encoder Layer Tuning (lr = 0.0001, hidden_size = 1024, batch_size = 64)

During this tracing, we have kept the learning rate fixed to 0.0001. Observed highest accuracy for layer count 1, 2, 3 and 4 are 94.17% (29th epoch), 94.24% (23rd epoch), 94.01% (31st epoch), and 94.27% (16th epoch) respectively. Although encoder with achieved highest accuracy, accuracy improvement is not significant. Moreover, highest observed accuracy for encoder with three layer is smaller than with two layers. Required time for convergence with four layer encoder is higher without yielding any noticeable improvement; so have considered two-layer encoder best suited for our task.

5.1.3 Tracing optimal batch size

Batch size also has impact on performance as adjusting weights in the network follows mini-batch gradient descent approach. The training phase was conducted with varying batch sizes. We have trained with four different batch sizes: 32, 64, 96, and 128. During this time, other hyper-parameters were kept fixed to to their optimal values, as discussed in the previous sections with hidden_size 1024. The accuracy level for these batch sizes is shown in Fig. 7.

Fig. 7
figure 7

Batch_size tuning (Enc_layer = 2, lr = 0.0001, hidden_size = 1024)

From Fig. 7, we can see that batch size 64 outperformed the accuracy level achieved for other batch sizes with a clear margin.

5.1.4 Tracing optimal hidden size

Kee** the learning rate, encoder layers and batch size fixed to 0.0001, 2, and 64 respectively, we have trained the model for various hidden sizes in the encoder/decoder starting from 512 to 1024. Epoch-wise accuracy for these different hidden sizes is shown in Fig. 8.

Fig. 8
figure 8

Hidden_size Tuning (Enc_layer = 2, lr = 0.0001, batch_size = 64)

Training the model with hidden_size = 768 has given the highest accuracy 94.49% at 23rd epoch compared to 94.33% for hidden_size = 512 and 94.24% for hidden_size = 1024. Absence of fluctuation in accuracy with hidden_size 512 and 768 indicates that both 512 and 768 are better choice than 1024. Hidden_size = 512 requires slightly less time to converge than hidden_size = 768. However, hidden_size = 768 gives slightly better accuracy. So, our final combination with optimal values for hyper-parameters are:

  • Learning Rate: 0.0001

  • RNN Layer Count: 2

  • Hidden Size: 512 or 768

  • Batch Size: 64

With this optimal combination, our proposed BaNeL model has achieved 94.49% accuracy with beam search decoding where beam_width = 10 was used. The next section describes in detail about the impact of greedy decoding and beam decoding during inference.

5.1.5 Effect of beam search

Typical Encoder-Decoder model suffers heavily for greedy decoding. It is often likely that prediction of first character in the decoding phase may not be perfect; it may the case that the expected character has a good probability after the softmax layer, not the highest. choosing such locally best choice prevents the gobal best choice from being generated. Incorporating beam search in the decoding phase maintain K-best choices instead of only the best one. For greedy decoding, K is equal to one. K is called beam width.

Fig. 9
figure 9

Tracing best beam width (Enc_layer = 2, lr = 0.0001, hidden_size = 768)

Figure 9, shows beam width’s effect on accuracy. It is nearly guaranteed that higher beam width will yield higher accuracy however, computational cost increases rapidly with the increase of beam width. From the Fig. 9 we can see that accuracy level did not improve much for 10, 15 and 20 compared to that of beam width 1 (greedy decoding) and 5. Highest achieved accuracy with beam width 10, 15 and 20 are 94.49%, 95.75% and 96.81% respectively. As accuracy improvement is marginal for increasing beam width from 15 to 20. So, we have chosen beam width 15 as optimal value.

5.2 Over-fitting issue

Although we have achieved 95.75% validation accuracy in the 22nd epoch, the training vs validation loss graph in Fig. 10 with traced optimal combination for hyper-parameter shows that the model was still converging even after the 38th epoch.

Fig. 10
figure 10

Train vs Validation Loss (Enc_Layer = 2, lr = 0.0001, batch_size = 64, hidden_size = 768, beam_width = 15)

At this stage, our model started to over-fit the training data; that is why training loss was decreasing, although validation loss was not improving. A larger dataset can circumvent this phenomenon and yield even better accuracy with our BaNeL lemmatization. Besides, this indicates that there are scopes for improvement in the BaNeL model.

5.3 Comparative performance analysis

In order to report how our proposed model performs, Benlem [5], BLSTM-BLSTM [6] and Akhmetov et al. [1] have been used as base references. Although character accuracy rate of BaNeL is 95.75%, exact match measure EM (Eq. 2) is a bit lower. Among predicted 2236 lemma of the test-set, 183 entries did not exactly match with the target lemma. Thus observed EM measure is 91.81%.

We have used the same training data to train BLSTM-BLSTM model and random forest creation for Akhmetov et al. [1]. Then both of these approaches including our proposed BaNeL model have been tested with the test-split of the dataset.

Fig. 11
figure 11

Exact Match Percentage Comparison with BaNeL, BLSTM-BLSTM and Akhmetov et al

Figure 11 shows achieved exact match percentage. From the figure we can see that, BLSTM-BLSTM approach performed better compared to Akhmetov et al. which is not surprising because even the original Akhmetov et al. paper reported lower accuracy for highly inflectional language like Estonian and Farsi. BLSTM-BLSTM structure is, to some extent, similar to our decoder structure except the utilization of a time-step aligned context vector and its performance level also indicates this similarity; 87.29% predicted lemma matched with the actual target lemma compared to 91.81% in case of BaNeL model.

As we had no access to the authentic Benlem model, we have utilized the dataset described in that article for comparison. Using the Benlem dataset in the following configurations, this section presents a comprehensive performance comparison of the proposed BaNeL model with tuned hyper-parameters.

  • C1: Randomly sampled 90% of the BaNeL dataset for training (80% as train-set and 10% as validation-set) and entire Benlem dataset as test

  • C2: Entire BaNeL dataset as train and entire Benlem dataset as test

  • C3: Reported accuracy on Benlem model in the original paper

  • C4: 90% the Benlem dataset for training (80% as train-set and 10% as validation-set) and of 10% of the Benlem dataset for testing.

Figure 12 shows achieved (C1, C2, C4) and previously reported (C3) exact match percentage for these configurations.

Fig. 12
figure 12

Exact Match Percentage Comparison with Benlem

Configuration C1 has been trained with 90% tuples from the collected dataset using the BaNel model. When tested with the entire Benlem dataset (C1) 96.21% exact match in the predicted lemma is recorded which is significantly higher than the reported accuracy level in the Benlem model (C3). Training the model with the entire collected dataset (C2) also reached to similar accuracy level as of C1. We have also trained the BaNeL model with 90% of the original Benlem dataset and rest of the 10% for testing (C4). However, as the dataset is considerably small, BaNel model quickly overfits with the train dataset resulting only 87.21% testing accuracy. This proves how crucial the dataset size is for training the BaNeL model.

6 Conclusion

This paper describes a new large POS-tagged dataset for Bangla and also presents a new lemmatization framework for Bangla called BaNeL which works with a neural encoder and decoder. Accuracy of the framework is significantly higher than other lemmatization techniques in Bangla. Besides, the BaNel model is designed relying on character level sequence to sequence map** without considering language-dependent word-inflection rules, unlike existing rule-based approaches. This generic design makes the model applicable to any other languages that show inflections similar to Bangla, especially for languages originating from “Magadhi Prakrit”, such as Bhojpuri, Bihari, Odia, and Marathi. The performance level of BaNel on Eastern Magadhan languages like Assamese and Chakma is expected to be similar as of Bangla if trained with a suitable dataset due to striking similarity in word formation and contextual inflections. However, requirement of a POS-annotated dataset for lemmatization is a limitation of the BaNeL model for applying it on a new language. Further research can be conducted to make applying BaNeL on a new language easier.