1 Introduction

The preservation of spoken colloquial language is an important task, which requires the collection of relevant materials and their careful curation. The Donate Speech (Lahjoita puhetta) campaign embarked on the quest of preserving the current state of the spoken Finnish language and boosting the development of AI that understands spoken Finnish. To this end, a large collection campaign was initiated that resulted in the creation of a large-scale colloquial Finnish speech corpus. In this paper, we explain how the collection and curation of the data were performed to maximise the amount participants while still ensuring a high quality of the dataset. Furthermore, we will demonstrate with pilot projects and their results how the materials can be used to study and develop new technology and services in the Finnish language.

Currently, there is only one large freely available transcribed Finnish speech corpus, the Finnish Parliament ASR Corpus.Footnote 1 It contains over 3000 h of professionally transcribed speech which is rather formal in style and often read from the speaker’s notes. However, colloquial, spontaneous Finnish differs significantly from formal Finnish in multiple aspects. Considering phonological features, for instance, durations of phones are longer in read speech than in spontaneous speech (Lennes, 2009). From the morphological and lexical point of view, it is common to truncate or combine words, and to use incorrect word inflections in addition to words not used in written text. Finnish has a near-phonemic orthography: there is usually a one-to-one map** from letters to phonemes, except for some rare cases such as certain loan words and the “ng” letter pair which is not pronounced as /n/ followed by /g/ (which are the normal phonemes for the letters “n” and “g”, respectively) but instead has its own phoneme, /ŋ/. Because of the near-phonemic ortography, the phonological variations can be transcribed mostly unambiguously into text. Since there is no standard transcription style for colloquial speech, the spelling variations of a single word can be numerous [for example, the word “minä” (“I”, first person singular) can be written as “mä”, “mie” or “mää”], which further increases the distance between the domains of formal and colloquial Finnish.

There are a few smaller corpora that include carefully transcribed spontaneous, colloquial Finnish speech. The SPEECON (Iskra et al., 2002) corpus is a collection of speech for multiple languages, recorded in varying environments. It includes both read and spontaneous speech from 550 speakers. The spontaneous Finnish part includes ten sentences from each speaker, in total about 18.8 h. The FinDialogueFootnote 2 part of the FinINTAS (Lennes, 2009) corpus contains 6338 utterances by 22 speakers. The speech is from spontaneous and unmonitored conversations between participants, and includes about 10.4 h of speech in total. The DSPCONFootnote 3 corpus consists of free-form conversations between students, recorded at the Aalto University between 2013 and 2016. It includes 5281 spontaneous sentences from 218 different male students and 24 female students, totalling 9.8 h (Enarvi, 2018). Combining these three corpora, there are about 40 h of transcribed spontaneous Finnish speech currently available for research (non-commercial) use,Footnote 4 to the best of our knowledge. We note that substantial amounts of Finnish colloquial speech has been collected in the 1960s and 1970s by the National Institute for the languages of Finland as well as some cultural foundations, but that data is not yet available for commercial development use according to the European data protection legislation.

For major languages like English, large spontaneous and colloquial speech corpora are available for research and commercial use. The Switchboard corpus (Godfrey et al., 1992) consists of about 260 h of telephone conversations among 302 male and 241 female speakers. The Fisher corpus (Cieri et al., 2004) includes approximately 2000 h of colloquial telephone conversations. These two corpora, for example, have been actively used in speech research for many years now, and technologies built for spontaneous English have greatly benefited from the datasets. Even though Finnish has far fewer speakers than the major languages (not even in the top 100), the new Lahjoita puhetta corpus covers many more speakers per language (over 20k) than probably any other publicly available spontaneous speech corpus. The Lahjoita puhetta 2021 release consists of 3600 h of speech out of which about 1600 h have been transcribed. The data covers all regions of Finland and has both male and female, mostly native, speakers in all age brackets. In this work we describe this dataset, how it was curated, and demonstrate its use-cases. Specifically, the contributions of this work are the following:

  1. 1.

    Presenting an open large colloquial speech corpus for Finnish.

  2. 2.

    Describing a successful concept for large-scale speech data curation.

  3. 3.

    Demonstrating the utility of the corpus in speech. recognition and metadata (gender, age, dialect and topic) classification.

  4. 4.

    Defining relevant benchmarks for speech recognition and metadata classification.

  5. 5.

    Providing trained, downloadable baseline systems for the benchmarks, and open-source code for reproducing the systems.

All of the tools and resources described in this work can be accessed online.Footnote 5

2 Data collection

The speech material donated during the campaign is shared by the Language Bank of Finland (Kielipankki),Footnote 6 coordinated by the University of Helsinki. Since speech samples may contain personal data, they are protected by European and national data protection legislation, most notably by the General Data Protection Regulation (GDPR).Footnote 7 The speech material has been collected based on the legitimate interest of individual researchers, universities, research organisations and private companies to study language or artificial intelligence, to develop AI solutions and to provide higher education in the aforementioned areas. To use legitimate interest as the legal basis for the processing of personal data, it was necessary to accomplish a balance test to ensure that the legitimate interests are not overridden by the interests or fundamental rights and freedoms of the data subjects.

To inform the individuals who donated their speech to the campaign, two essential documents were drafted: a short information page including simple conditions of participation, and a more comprehensive data protection policy. The donors had to acknowledge that they were informed of the conditions of participation before they could start donating.

An ethical review for the data collection was not needed as such a review applies only to the research configurationsFootnote 8 specified by the Finnish National Board for Research Integrity (TENK). It was also considered that the risks to the rights and freedoms of natural persons were rather low, but to be sure, a data protection impact assessment (DPIA) was made. For a more detailed description of the campaign and its legal background documents, see Lindén et al. (2022).

The goal of the campaign was not merely to collect a vast amount of any kind of speech, but to reach out to as many different groups of Finnish speakers and to as many individuals as possible. In marketing the campaign to citizens, it was emphasised that all variants of spoken Finnish are welcome, including speech from second language Finnish learners. However, in order to understand the privacy notice and the instructions, a certain level of language proficiency was required from the speech donors.

Key issues and challenges for the design of the user interface were in determining elicitation methods that entice a person to speak freely, gaining the trust of the speaker, making him feel comfortable while also satisfying legal constraints for presenting enough required information in an easy to understand format, as well as more technical choices of supported platforms, presentation forms, visual and auditory feedback of the on-going recording or its quality. After some ideas for themes had been formulated and tested, Yle (the Finnish Broadcasting Company) settled on the fail-safe recurring functions of showing a video, a picture or some textual content enticing a person to speak with an easy-to-use one-button starting and stop** of the recording.

Cooperating with Yle was crucial for the marketing of the campaign and for attracting the attention of the Citizens of Finland for the campaign. In the end, Yle developed around 40 straightforward topics, within ten different themes, for stimulating the collecting of speech data. As part of the campaign, Yle made comical infomercials with requests to the general public to donate speech. These were broadcast during programme breaks in national radio and TV channels during the summer and autumn of the Covid-19 pandemic in 2020 with some trailing reruns during spring 2021. In 2021 the data collection campaign was awarded the best European Digital Audio Project prize by PRIX EUROPA, which was founded by the European Parliament, the European Commission and the European Cultural Foundation in 1987.

To illustrate the campaign results with regard to collection speed, the number of recordings received each month during the campaign is shown in Fig. 1. The peaks in the beginning and at the end of 2020 reflect the effects of the increased public advertising activity.

Fig. 1
figure 1

The number of recordings received in each month during the campaign

2.1 Metadata complementing the speech corpus

Identities of speakers were not collected explicitly, but we assume that one application client identity (of the browser or smart phone application used for recording) corresponds to one speaker. This assumption is not watertight since one person may use multiple application clients, or multiple persons may use one client, but the correspondence generally holds. Assuming this, the number of speakers is well over 20k, which means quite a good sample of all Finnish speakers, which are fewer than six million in total.

Opening the Lahjoita puhetta website or phone app, the user is offered a few different themes to choose from. To focus the campaign, all of the themes are not always available on the website. The complete list of themes, and their English translations and abbreviations used in this text, is the following:

  • “Eläinystävät” (“Animal friends”, A)

  • “Urheiluhetket” (“Sports moments”, SP)

  • “K-18” (“Rated R”, R)

  • “Luonto, sää ja mää” (“Nature”, N)

  • “Lähelläni juuri nyt” (“My surroundings”, M)

  • “Mediataidot 4–6 lk.” (“Media skills—grade 4–6”, MS4)

  • “Mediataidot 8–9 lk.” (“Media skills—grade 8–9”, MS8)

  • “Mediataidot lukio” (“Media skills—high school”, MSH)

  • “Kirottu korona” (“The cursed covid”, C)

  • “Sukella kesään” (“Summer”, S)

Each theme includes up to eight different topics that ask a question or in some other way invites the user to speak about the topic. Each recording therefore pertains to some general theme, as well as to a certain topic within that theme. The theme and topic are metadata which can be used to categorise the recordings.

Between the recording prompts, the participant is asked multiple questions about his or her background. These metadata questions include dialect background, gender, native language, age, place of residence, birthplace, occupation and education. In this paper, we focus on the first four of these metadata types.

The dialect background question offers 20 options to choose from. In order to have fewer classes, we clustered these dialect regions into eight larger dialect groups, based on the information provided by The Institute for the Languages of Finland.Footnote 9 The dialect groups and their abbreviations used in this paper are:

  1. 1.

    The Southwestern dialects (SW)

    • Varsinais-Suomi

    • Ahvenanmaa

  2. 2.

    The transitional dialects between the Southwestern and Häme dialects (TRAN)

    • Uusimaa

    • Satakunta

  3. 3.

    The Häme (Tavastian) dialects (HÄME)

    • Pirkanmaa

    • Häme

  4. 4.

    The dialects of South Ostrobothnia (Pohjanmaa) (SO)

    • Etelä-Pohjanmaa

    • Pohjanmaa

  5. 5.

    The dialects of Central and North Ostrobothnia (Pohjanmaa) (CNO)

    • Keski-Pohjanmaa

    • Pohjois-Pohjanmaa

  6. 6.

    The dialects of Peräpohjola (the Far North) (FN)

    • Lappi

  7. 7.

    The Savo dialects (SAVO)

    • Pohjois-Savo

    • Etelä-Savo

    • Kainuu

    • Keski-Suomi

    • Pohjois-Karjala

    • Kymenlaakso

    • Päijät-Häme

  8. 8.

    The Southeastern dialects and a few transitional dialects bordering on them (SE)

    • Etelä-Karjala

  9. 9.

    Non-native Finnish speakers (NN)

2.2 Corpus statistics

In the Lahjoita puhetta 2021 release, there are about 3600 h of recordings in total, and over 20k different speakers. The median speaker donated eight recordings, while the top donor donated 1039 recordings. The median duration of a recording is about 40 s and the longest are about 10 min.

Silent parts were trimmed from the beginnings and endings of the recordings using the silence effect of SoX,Footnote 10 with a threshold of 0.5% and duration of 0.05 s. After trimming, 3270 h remained, and the randomly selected recordings were sent to human transcribers. When we received the transcribed subset, there were 512 recordings that had empty transcriptions. Some of these were silent audio and some were left empty by mistake by the transcribers, but all 512 were discarded at this point. To verify the quality of the human transcriptions, we generated ASR transcriptions with a hybrid HMM/DNN (hidden Markov model/deep neural network) system trained on the previously existing colloquial Finnish speech data: DSPCON, FinDialogue, SPEECON (see Sect. 1). The average WER (word error rate) was around 38% and CER (character error rate) about 15%. We then filtered out recordings for which both the WER and the CER were over 94% in order to mitigate the chance of having low-quality samples in the ASR training corpus. From the set of about 100k transcribed recordings, 392 had WER and CER over the threshold and were excluded. Combined with the 512 empty-transcript recordings, these excluded 904 recordings were about 9.1 h in duration.

We sampled a 10-h test set and 10-h development set from the transcribed speech data, each including at least 10 min of speech for each metadata class in each of the five metadata domains. We also modified the speaker gender ratio of the test and dev sets, so that they have over 40% male speakers although the training set has just over 20%. As a second test dataset, we used a 1-h set that was transcribed by four different transcribers, which includes 58 recordings from 57 speakers. If we add all recordings by those 57 speakers to this subset, we get a 10-h test set, that we call “test multi-transcriber speakers” in Table 1. The rest of the transcribed speech is used as training data. The train, dev and test sets have no overlap of speakers. There are still recordings that are by the speakers of the dev or test sets but which are not transcribed. These are left unused, leaving about 3230 h in the complete dataset that we use. Table 1 lists the sizes of the corpus subsets.

Table 1 The sizes of the corpus and its subsets

Figure 2 presents the amount of speech for each metadata type as a portion of the whole training set (both transcribed and untranscribed pooled together) and the 10-h main test set. As the transcribed training set is a 1600-h random sample of the whole dataset, which has about 3199 h out of the complete 3230 h, the training data accurately represents the overall distribution of the whole dataset. We can note that the corpus has varying amounts of speech from the different metadata classes. Younger than 11-year old children have donated some but a relatively small amount, as have older than 80-year-old people. Of the dialects, Savo and Tran have most data, roughly a quarter each. Women have donated significantly more than men: over three times as much. Investigating the reason for this male-female imbalance is beyond the scope of the present work, but we may speculate, for example, that the campaign might have been advertised between TV/radio shows whose audience is predominantly female, although unfortunately we do not know the demographics of the TV or radio audience to which the campaign was advertised. Another factor could be that women might be more likely to answer to surveys in general (Smith, 2008); although speech donation is not exactly a survey, it is similar enough that the trends found in survey response rate could give some clues to why our dataset is imbalanced. Four themes seem to have a low amount of speech: Rated R and the three Media skills themes, as they were added only when the official marketing campaign had already ended. In all metadata domains, the test set was smoothed to have at least 10 min of speech from each metadata class, visible in the figure.

Fig. 2
figure 2

The distribution of the speaker metadata in the corpus. The “training set” includes both the “train transcribed” and “train untranscribed” described in Table 1. “N/A” means the user has not answered to the question about his or her background, or has given multiple contradicting answers

Figure 3 displays the distribution of the recording lengths. The majority of the recordings are less than 2 min, but longer recordings are not uncommon. There are spikes at the 2-min mark and the 10-min mark. The spike at 10 min was effected by the limit of the duration of recordings: those that would have spoken for longer were cut at 10 min. The other spike, at 2 min, corresponds to the duration of a video clip that was played for the user in one topic. The theme was “Summer”, and in this topic the user was asked to describe what is happening in the video clip to an alien while the video displayed sceneries of Finnish summer pastime activities.

Fig. 3
figure 3

The recording length distribution. The recording durations are pooled to 1-s bins to generate this figure

3 Annotation procedure

Because a high-quality manual transcription of 1600 h of spontaneous speech is a significant investment, we made an effort to develop a careful process described in detail in this section. The aim was exact transcription, which included not only the verbal content of the speech but also full words, repetitions, hesitations, partially pronounced or only partially audible words, and non-verbal communication such as laughs, growls, and coughs. The guidelines that were given to the transcribers are reproduced in Appendix.

3.1 First phase: annotator selection

To choose the best transcriber companies, we ran a pilot transcription competition, where we shared a 20-h subset of the data with all candidates along with the carefully constructed annotation instructions. The datasets consisted of 19 h of randomly selected data per participant mixed with a common 1-h evaluation set (the composition of the data was not disclosed to the companies). After the competitors submitted their transcripts, we evaluated them automatically and manually using the overlap** 1-h set to determine the quality of their work as well as an hour of random samples from the non-overlap** parts to verify the automated comparisons manually. During the evaluation process, we had no information about individual annotators, so we treated each company as a single transcriber. Our primary goal was to validate that they could produce high-quality transcripts for the collected data.

The automatic evaluation focused on comparing the transcripts of different annotators with each other and with multiple ASR systems. Our goal was to select companies who can produce high quality transcripts that met the standards of the Language bank (Kielipankki). First, standard ASR metrics like word error rate (WER) and character error rate (CER) were used to estimate the inter-annotator agreement. Specifically, one annotators transcript was used to calculate edit distances from the others, treating them as speech recognition systems. This allowed us to create a preference order from the perspective of one of the annotators. Repeating this process for all transcriber companies gave us multiple rankings, and we tried to identify outliers by aggregating these preference rankings. In case of an outlier, we could verify that its transcription is of lower quality than the others by manually checking the transcripts with the most differences to the other transcribers. During these analyses, we ignored the non-word symbols, as they were annotated with considerable discrepancies by different annotators.

The inter-annotator disagreements in terms of WER and CER were generally high due to the nature of the data, see Table 2. Still, we can observe considerable differences. These metrics allowed us to create rankings per annotator. Fortunately, we only wanted to ensure the high quality of the transliteration, so we did not have to use complex methods (like the Borda count etc.) to produce a complete order. In the end, we opted for a straightforward scheme to aggregate the individual preference orders by simply eliminating the worst in each round until we get the desired number of annotators.

Table 2 Pairwise comparison between transcribers (T) using the word and character level edit distances

Looking at the values in Table 2, we can see that T1 had the highest disagreement with the others, both in terms of WER and CER. The transcription quality was also substantiated by manually inspecting 1-h random samples from each candidate. Thus T1 was the first to be eliminated. Of the remaining annotators, T2 and T3 disagree most with T4. Nevertheless, the differences between these three annotators were relatively small, so in the end, we opted to accept all three in this round of selection.

Next, we repeated the experiments, but this time, we compared the transcripts with ASR outputs. Two models were selected for this purpose, a hybrid HMM/DNN, and a Wav2Vec2-based (Baevski et al., 2020) end-to-end network. The hybrid HMM/DNN system was trained on the existing spontaneous colloquial Finnish speech datasets: DSPCON, FINDialogue and SPEECON (spontaneous part), totalling about 37 h. The 1st pass n-gram LM and 2nd pass RNN LM are trained on the WEBCON (Enarvi, 2018) corpus and the speech transcripts, in total about 76 million words. For the end-to-end model, we decided to utilise the publicly available multilingual Wav2Vec2 Large model pre-trained on 100K h of the VoxPopuli dataset (Wang et al., 2021). The model was fine-tuned on the same 37-h colloquial Finnish corpus used to train the hybrid system.

Comparing with ASR models reaffirmed our previous findings (Table 3). We can see that comparing the ASR models with T1 leads to the highest error rates. An interesting observation is that both models seem to favour T2, yielding the lowest error rates, followed by T3 and T4.

Table 3 Pairwise comparison between transcribers (T) and ASR systems using the word and character level edit distances

Lastly, we also validated the conclusions of all automatic experiments by manually checking the utterances with the largest differences (revealed by the previous examinations). The manual inspection revealed that T4 had transcribed files mostly correctly, but they often used the formally correct spelling instead of writing the verbatim spoken version. This resulted in slightly higher error rates compared with T2 and T3. Comparing T2 and T3 we saw that the latter skipped the extremely noisy part of an utterance, resulting in T2 being selected as the most diligent annotator.

Combining all observations, we concluded that companies T2, T3, and T4 are all capable of creating sufficiently high-quality transcripts, so we continued to work with them to transliterate a large portion of the collected corpus.

3.2 Second phase: quality control

After the initial selection phase, we continued to utilise our ASR models to perform automatic quality control checks. Our goal was to highlight recordings with unusual error rates (WER \(\ge 94\%\)) for manual inspection. In practice, once we received the transcriptions from the companies, we applied the same ASR models as in the phase one to get the WER and CER for each utterance. To avoid unnecessary checks, we only selected files with a high WER and CER compared with both models.

Our manual examinations revealed several problems that we could address during the annotation process. One of the primary issues that we managed to identify was a mismatch between the transcription and the audio files (approx. 20 transcripts had been assigned to the wrong recording). Naturally, with the help of the annotators, we could fix this problem quickly. The second source of the high ASR error rates was the presence of extreme noises, which made it hard for the ASR systems to recognise the speech. We kept these noisy recordings in the corpus to enable the building of noise-robust models.

Figure 4 depicts the error rates of the hybrid ASR model for each transcriber company. Note that due to legal constraints, we were unable to match the transcribing organizations’ ids used here to those in the first phase. Thus we could not analyse how their performance changed on the large dataset. Overall, we can see that the distributions are quite similar, meaning that from the ASR model’s viewpoint, they were equally good at providing the gold standard texts. We can see that there is a considerable amount of utterances with more than 100% WER, but overall, the vast majority of recordings are recognisable with less than 50% error. The CER statistics further reassured us that the transcription is high quality; more than 75% of the utterances had a CER below 20%. The high errors could be explained by the discovered problems (noise, low volume, speaking far from the microphone).

Fig. 4
figure 4

The distribution of word-level (top) and character-level (bottom) error rates per annotators on the transcribed dataset. Note Utterances with more than 100% errors were pooled together for this visualisation. Note also that the transcribers’ ids of the second phase do not match to the first phase

4 ASR experiments and results

In this section several ASR experiments with various architectures are presented. The goal of the ASR experiments is first to establish that the transcribed Lahjoita puhetta data is useful for creating ASR systems, and then to provide baseline results and recipe starting points for a few different ASR techniques. The trained ASR systems are also used to provide both time alignments of the manually transcribed part, as well as ASR decoding outputs for the untranscribed part, which can later be used for indexing, searching, or statistical studies on the data, as attested by for example Carrive et al. (2021).

One initial difficulty in using the transcribed Lahjoita puhetta data for ASR is that many of the recordings are longer in duration than is ideal for many speech recognition methods. Bootstrap** alignments for long recordings is more difficult. Long recordings exacerbate the vanishing gradient problem and they also present practical issues related to memory consumption (Narayanan et al., 2019). In these experiments, we are able to bootstrap alignments and create shorter segmentations for different systems by starting from simple monophone HMM/GMM (Gaussian mixture model) systems trained on the shortest utterances.

It is good to note that as Finnish is an agglutinative language, the WER results are not directly comparable to those of, say, English. Hirsimäki et al. (2006) found that as one long Finnish word corresponds to several English words the WER becomes multiplied. For this reason, we report also the CER results, which do not have this problem. Furthermore, some previous works (Enarvi et al., 2017) have used normalisation of colloquial Finnish words in order to mitigate the effect of various spelling variations on the WER results. However, this method is partly manual and thus not easily scalable to large corpora, and we did not use such normalisation. Additionally, the transcripts contain special markers (e.g. for noise and pauses) and some decisions should be made about them in speech recognition: either to predict them, or to simply discard them. We opted for the latter. Before calculating the WER and CER, we removed all the special tokens, such as “.laugh” as well as the dash symbols “-” that indicate dysfluencies in speech, for example false starts (“predi- presidentti” was changed to “predi presidentti”).

4.1 Hybrid HMM/DNN ASR systems

The HMM/GMM approach and, later, the hybrid HMM/DNN approach have been popular in speech recognition for the last couple of decades. Although they are now outperformed by newer approaches (mainly end-to-end systems; see the next two Subsects. 4.2 and 4.3), they are still useful since they require relatively small training corpora, and versatile toolkits have been built around these approaches. Namely, the Kaldi (Povey et al., 2011) toolkit provides optimised “recipes” to train and apply ASR systems, which we used to train baseline systems with our data. We then use the best baseline system to align the text and audio and segment the speech. Since large end-to-end systems cannot handle long segments of speech (see Sect. 4.2), segmentation was necessary before we could train the end-to-end models.

In the first phase, we trained two models using mostly standard Kaldi recipes without hyperparameter tuning, one with a 100-h subset (denoted as initial-100 h-TDNN) and another with the complete transcribed training corpus (initial-1600 h-TDNN). To train the HMM/GMM system for monophones and triphones, we used the Kaldi WSJ recipe. This recipe trains the initial monophone model on the shortest utterances in the data, which helps in bootstrap** the alignments. As a deviation from the standard WSJ recipe, we trained the final triphone system using the discriminative, MMI (Bahl et al., 1986) training criterion, which is available as an optional addition in the WSJ recipe. The time-delay neural network (TDNN) (Peddinti et al., 2015; Waibel et al., 1989) models were trained using the HMM/GMM alignments. The TDNN architecture and other hyperparameters were adopted from the Switchboard recipe, since this trains a larger neural network, more suitable for the large training corpus. The TDNN has 15 layers with a dimension of 1536 and a bottleneck dimension of 160. In total the TDNN has about 17M parameters.

Using the SRILM (Stolcke, 2002) toolkit, we trained 4-gram language models (LMs) on the Lahjoita puhetta (LP) 100-h training corpus transcripts, the whole 1600 h training corpus transcriptions, as well as on a corpus of the LP transcripts pooled with other available colloquial Finnish text corpora, namely the WEBCON corpus and the DSPCON transcriptions. The systems that utilised the external language modelling data are marked with “ext. LM data” in Table 5. We used the Morfessor (Creutz & Lagus, 2002, 2007) toolkit to segment words into subword units. We trained the morfessor model using the same LP transcripts appended with the WEBCON and DSPCON corpora as for the large LMs, with a corpus weight of 0.05. The resulting sizes of the LMs and their training corpora are listed in Table 4. We also trained LMs with a word vocabulary, but subword units yielded better results. For example, the word-based initial-1600 h-TDNN system got a WER of 25.12% on the test set, compared with 24.00% using subword units, so we opted to use subword units in the remaining experiments. The sizes of the training corpora are listed in Table 4. For more details about the language models, see the published recipes.

Table 4 Sizes of the language models and their training corpora

We used the initial-1600 h-TDNN to segment the training data, so the data could be used for training the E2E ASR systems. The initial-100 h-TDNN with large LM was used to generate transcriptions for the rest of the training corpus, which we then used for training the topic and dialect classification systems (see Sect. 5).

After training the initial ASR systems, we made some simple hyperparameter tuning for the HMM/GMM system to get an idea of how much room for improvement there is, compared with the Kaldi WSJ recipe. The tuning experiments focused mainly on increasing the number of parameters of the GMMs. By increasing the number of Gaussians from 4200 (in the WSJ recipe) to 14,000, and the number of leaves per Gaussian from 40,000 to 200,000, the penultimate, speaker-adaptive triphone system WER on the development set decreased from 42.86% to 39.71%. Training the MMI triphone system on top of the alignments from these systems, the WERs decreased to 37.08% and 35.36%, respectively for the smaller and larger GMM/HMM system. Finally, training the TDNN system on top of these MMI triphone models, the word error rates dropped to 22.09% (smaller GMM/HMM) and 21.98% (larger GMM/HMM) for the dev set and 24.00%/23.88% for the test set.

Decoding with a large language model trained on external data brings additional improvement compared with the LM trained on 100 h transcriptions (see the second and third row in Table 5). However, the 1600 h transcriptions seem to be enough to train a decent language model, and adding external data only brings a small improvement in WER and CER results (see the last two rows in Table 5). It is good to note, however, that the external text data is not exactly in the same domain as the test corpus, although it is colloquial in style.

Table 5 Error rates of various ASR systems

Additionally, we wanted to demonstrate that the sizeable untranscribed portion of the corpus can be leveraged via semi-supervised training. For this experiment, we choose the approach presented in (Manohar et al., 2018). To demonstrate that the recordings without annotations could be used for improving the ASR systems, we started the semi-supervised training by generating transcriptions of the additional data with the initial-100 h-TDNN. Afterwards, we pooled the self-supervised portion (approx. 1587 h) and the 100 h set for the model training. The resulting model (semisup-100 h-model) had the same architecture as the initial-100 h-TDNN to ensure a fair comparison. From the achieved results (see Table 5), we can conclude that the additional unsupervised data is indeed valuable, the error rates dropped significantly. On the other hand, we can also see that having more, accurately transcribed data is far more beneficial. The initial-1600 h-TDNN outperforms the semi-supervised system by a large margin, and the hyperparameter tuning offers some additional improvements.

4.2 AED ASR systems

Various end-to-end ASR approaches, such as Connectionist Temporal Classification (CTC) (Graves et al., 2006), the Recurrent Neural Network Transducer (RNN-T) (Graves et al., 2013), and Attention-based Encoder–Decoder (AED) (Bahdanau et al., 2016; Chan et al., 2016) models became popular in the 2010s, both in research as well as industrial applications. We train AED models on the transcribed data to serve as end-to-end baselines. Our AED models are trained with the SpeechBrain toolkit (Ravanelli et al., 2021). They consist of a stack of convolution, recurrent, and feed-forward layers in the encoder, a location-and-content aware attention mechanism, and recurrent layers in the decoder with altogether \(\approx 28\)M parameters. The inputs are log-Mel-filterbank-energies and for each output step the network computes a distribution over a vocabulary of 1750 SentencePiece subword units. We trained with dynamic batching, targeting 50 s of audio per batch altogether, for 100 nominal epochs of 10,000 updates each. For the first 20 nominal epochs the encoder learning was aided by using an additional multi-task CTC loss (Kim et al., 2017). We do not use any external language with our AED system, making it fully end-to-end. For further details we refer to the published recipe.

End-to-end models seem to have difficulties with long-form speech, both in learning as well as in generalising (Chiu et al., 2019; Narayanan et al., 2019). Our preliminary experiments with AED systems showed similar issues. Models would not converge with full length utterances. Via segmentations produced with the HMM-based ASR systems, we split the data into shorter utterances. Training converges well on short (up to 10 s) segments and slightly slower on medium length (up to 50 s) segments. Decoding an ad-hoc segmented version of the development set yields a WER of \(\approx 22\%\) on both models. However, on the official development set, which has longer utterances, both models have pathological behaviour on a minority of utterances, which increased the error rate considerably. Similar to reports by Keung et al. (

Fig. 9
figure 9

Confusion matrix for the dialect classification model on the test set

On the topic classification plot, we can see that the model is performing well on almost all the classes. The weakest one seems to be the Rated R class, which generally has a low number of samples in the training and testing sets.