1 LANGUAGE DOCUMENTATION SYSTEMS OVERVIEW

LingvoDoc is a software system that is designed for language documentation and analysis in a collaborative manner by groups of researchers. The system is largely influenced by version control software systems mainly used by programmers but too complex for daily use by linguists (Github) (Bitbucket). The main goal of the project was to design a system that would provide most of its features through web interface for the most frequent documentation and processing tasks (input of archival and field audio and video material, its transcription, building connections through cognates, glossing, and comparison with other languages) and at the same time would provide means for external data processing through HTTP API for advanced users familiar with programming and natural language processing. Several existing systems can store similar language data; the closest to LingvoDoc are the Starling project [1], LEGO project [2], TypeCraft [3], Kielipankki [4], and corpus-tools project [5] but all these systems have functional limitations that we tried to overcome.

The Starling project is desktop software that allows to create etymologically connected dictionaries. A single Starling dictionary looks like a table with lexical entries; each lexical entry has a unique integer identifier for a particular dictionary. A table can have an unlimited number of named columns of two types: columns that hold textual data and columns that hold pointers to other dictionaries. The table cells of the latter type contain integer values that correspond to identifiers inside a dictionary that has a filename matching the name of the column. The Starling user interface helps to navigate through connected entities and offers a wide range of possibilities for data input and analysis. Also, Starling has some export functions including export to HTML representation so dictionaries could be exported for read-only purposes on the Web.

Nevertheless, Starling has certain limitations.

This system is proprietary and has no sources open to the public; thus, it cannot be modified or extended by external developers. However, if the sources were opened to the community, it would not change the situation much: the system has been developed in Harbour programming language [

2 LINGVODOC SYSTEM OVERVIEW

LingvoDoc is a dictionary- and corpora-oriented system. The main page (lingvodoc.ispras.ru) provides a user with a list of published dictionaries and corpora structured by grant or organization’s affiliation combined with language tree or just organized by language. At the top of the page a quick navigation menu is located. The number in square brackets shows how many dictionaries/corpuses a language or a language family contains (Fig. 1).

Fig. 1.
figure 1

Navigation by language.

Next to each language displayed on the screen there is a number that shows how many dictionaries or corpora belong to it. A dictionary is presented as a multilayer structure and each of the layers contains its own typed columns. As shown in Fig. 2, the number indicates how many layers a dictionary contains and a click on the “View” button shows a list of available layers. We name these layers “perspectives” after the similar term in the Eclipse integrated development environment. A click on the name of a selected perspective will lead a user inside the chosen dictionary layer or corpora collection.

Fig. 2.
figure 2

Dictionary perspectives.

The language tree structure is flexible so one can add new dialects to it, but it still has some subsets of protected languages and dialects that can be changed by system administrators only. The system provides some drag-and-drop features for language tree structuring. Each non-protected language could be pushed to the needed tree node as a daughter or “sibling” language and it is possible to add a daughter node to any language with the help of the “Create” button (Fig. 3).

Fig. 3.
figure 3

Language tree structuring interface.

The language tree is used during dictionary or corpora collection creation process (Figs. 4, 5). A dictionary can be created in two modes: from scratch or as an imported dictionary from some of the sources supported by the system. Currently, the following import sources are supported: LingvoDoc early versions format and CSV with special delimiter symbol. CSV files can be produced from Excel files, using the Starling export dialog or in any other way. The LingvoDoc import module for CSV supports bulk import of several files retaining the interconnections among imported dictionaries if the source of the CSV files is Starling.

Fig. 4.
figure 4

Language selection in the dictionary creation dialog.

Fig. 5.
figure 5

Most of the system entities can be translated into all supported locales.

Each dictionary has a flexible customizable structure. A user can create dictionary perspectives with custom typed column (field) set. If there are no appropriate columns in the existing list available, one can create a new column type. But now in the system there are several dozen column for example Sound, Mark up, Cognates, Paradigm forms and contexts. The selected fields are represented as column names inside a dictionary. The types selected for the fields specify user interface interactions on data in table cells (Fig. 6): if the field type is a text, a user sees text input for the cells of the column; if it is a sound, a user sees file selector for uploading and buttons to play files that are already uploaded. This dialog also supports selecting nested fields. It is useful for dependent data such as markup data files for audio files: one audio file may have more than one markup.

Fig. 6.
figure 6

Dictionary structure customization mode.

After creating a dictionary a user becomes the only owner of the new dictionary and can edit it or share their permissions with other users for collaborative work. Another option is to transfer their permissions to the grant registered in the system (grants can be created by system administrators) losing a number of their own permissions mainly related to publishing process.

Editing process looks quite similar to Excel tables with a number of differences (Fig. 7). Each cell in the table may have many separate values (used for separate variants of the data from different authors, e.g. sounds). The cells act in accordance with the data type of the column: text, sound, markup, directed links, bidirectional links or a linked set. Every new cell is created in an “unpublished” state so it can be seen only by dictionary collaborators. Also, data edits are performed in “mark as deleted and create new” transactional manner, so any editing of cell data marks it as “unpublished” again.

Fig. 7.
figure 7

Ob Mansi’s dictionary table representation for a perspective.

Every dictionary and perspective have a fine-grained permissions list for allowed actions. Users that have publishing-related permissions for the perspective can change the state for each cell from unpublished to published and vice versa. Also, there are special permissions related to the state of the whole dictionary. Each dictionary and perspective can be in the following list of states: “WiP” (work in progress), “Hidden,” “Published.” Published dictionaries become visible on the main page in the language tree.

As shown in Fig. 8, each dictionary can have additional metadata: a list of authors in free form, tags, linked files (pdf, docx, and xlsx) and location.

Fig. 8.
figure 8

Dictionary metadata edit dialog.

LingvoDoc provides a number of computational features for dictionary data. A user can see contribution statistics for a selected time range, do a list of computations on sounds that have Praat markup (see below) and get some suggestions about possible duplicates inside the perspective.

In every dictionary there are “Tools,” which include 13 programs made for authors of the dictionary (only 4 of them are available for all users of the system). These programs analyze language data from phonetical, morphological and etymological point of view. This analysis previously was performed manually by linguists. Our programs allows do it tens and sometimes hundreds times faster. You can find more information about some of them in the description of the Ob-Ugric languages. It is possible to upload texts (in the .odt, .wav, .mp4 formats and annotated/glossed video or audio in the Elan program) as corpora in LingvoDoc.

A user can store Elan corpora collections in the system too. A corpora collection looks just the same as a regular dictionary but it has some predefined columns which could be changed if needed: sound, sound with markup, text comment. LingvoDoc has its own Elan format web-based viewer for browsing corpora (Fig. 9). Also, the engine supports the creation of dictionaries using specifically structured Elan files.

Fig. 9.
figure 9

LingvoDoc Elan/Praat viewer.

The system also has a search dialog that supports queries about both dictionaries and corpora with or/and predicates (Fig. 10). It allows to construct complex queries with many options. A user can combine several search results on the world map to see their intersections.

Fig. 10.
figure 10

LingvoDoc search dialog.

LingvoDoc provides free registration in the system for new users (Fig. 11). To sign up, a new guest of the system just needs to click the Sign up button, provide minimal information about themselves and wait until one of the system moderators will confirm that it is a real person who was not previously banned in the system.

Fig. 11.
figure 11

Main page and Sign up button.

3 LINGVODOC INTERNALS

The LingvoDoc software system provides the following features:

(1) Collaborative work on dictionaries (similar to the one provided by Google Docs or Github).

(2) HTTP GraphQL API to allow integration with any other software.

(3) Web interface (reference client application) that uses GraphQL API.

(4) Flexible access control lists (ACL) for collaborative editing, viewing and publishing. Each dictionary in the system can be shared with any other user of the system, organized for read-only, read-write, and publishing purposes. Any user of the system without direct access to the particular dictionary can propose edits that can be reviewed by dictionary editors.

(5) Multilanguage translations for dictionaries based on the same data. All the dictionaries may contain translations to any language sharing the same media, markups, transcriptions and other data.

(6) Scalable architecture (designed to utilize cloud resources for scaling).

(7) Semi-offline clients with two-way synchronization. A user can be virtually anywhere and still synchronize their data if they want to and have an internet connection. Furthermore, a user needs an internet connection for the first launch of the application only.

(8) A possibility to make one’s own portals with data that belongs to a group of users or an organization; the system features two-way synchronization with central system capabilities.

(9) Multitenancy. The system natively supports total access isolation among dictionary contributors: a single user can access separate dictionaries for personal use, collaborative work, and internal use, with each dictionary hosted at one’s own institution and shared with other users or institutions at will.

(10) Security. We do not know the users’ passwords; the system is designed to hold data using the most up-to-date techniques to make sure the users’ data is secure.

Sources are available under Apache 2.0 license.

The main feature of the system is native support for semi-offline synchronization of user data. This feature is unique for this kind of systems and is based on the concept of a composite primary key [9]. The main idea behind this kind of synchronization is that each user on each login (including offline client installations) acquires a client-unique big integer composite key. After that, each object the particular client creates is enumerated based on a special sequence identifier. During each synchronization process, the offline client acquires a new unique personal identity composite key. Thus, each object in the system has an object-unique combination of client identifier key and object identifier key. This technique allows us to make use of the “anytime synchronization” concept: for each particular offline application or client online login process, a unique object identification key is generated.

Undoubtedly, such systems have been out there for a long time although almost all of them require manual conflict resolution systems. Some of the most famous examples are GitHub and GitLab projects based on git version control system. These projects are rather easy to use if one is a single programmer without any need to synchronize their projects with anyone else’s. But even a small version conflict requires manual conflict resolution which is a difficult process and it works only on non-binary data. Our approach is quite similar to CouchBase DBMS conflict resolution [7] but used for classic relational DBMS, and it seems likely that we invented it a little bit earlier since we cannot find any occurrences of this method before the year 2015.

The second basic concept is a virtual entity that does not contain any data but serves as an anchor to be referenced by other objects. To illustrate, let us imagine that one has a certain concept from the real world that is universal for the particular dialect they are trying to describe. In the LingvoDoc system, almost each database entity has a unique ID combination in composite key terms. Dictionaries, perspectives, columns, cells and rows in a table are the main examples of such entities. A row in a table is an intuitive illustration for such an anchor. The data stored in table cells has a reference to its row composite ID as an anchor and is organized in a table by means of fronted web application. Internally the data is organized in more tree-like than table-like form. Each author can have as many versions of such anchored data as he wants to; the system places no limits. Multiple versions of one or many authors are represented as a list of the versions for an entity, and the publisher is responsible for choosing which of them are correct. There could be many correct versions sometimes: for example, if a narrator repeats the same word three times, there would be three correct sound files for one lexical entry, and the system would display all of them.

In data model terms, this means that we store “version” data in denormalized relational form and combine the data server-side.

Let us imagine, for instance, that M authors have different opinions on some object (such as a translation of a particular term). The system does not limit the number of listed translations, provided each of the M authors has corresponding rights to edit the dictionary.

However, the system provides special view modes for that purpose (see the overview in the first part of the paper). The main view, of course, is the editor’s view: from there, the editors of the data can do anything they want. The second view is the publisher’s view. Using that view, the people responsible for a dictionary can approve one or more correct entities used in the virtual anchor object. For instance, if a lexical entry has 5 versions for transcriptions and 10 versions for translations, and the owner of this dictionary thinks that only one of the transcriptions and three of the translations are correct, they can select only them and publish their choices for other researchers.

The last view is the view/guest/data-researcher view. Here you can see the data that has been uploaded and verified by authors and publishers.

4 LINGVODOC GRAPHQL API

LingvoDoc certainly offers standard web interface access, which is of no special interest from the perspective of concurrent technologies. Full access to the LingvoDoc system can be gained through the GraphQL API (HTTP-based) system. GraphQL is a way to construct API for web-applications suggested by Facebook in 2012 and widely used nowadays.

The main features of GraphQL approach are as follows:

(1) An API user does not see the system internals but simplified abstractions only which are much simpler and intuitive.

(2) The whole system API can be introspected and observed using standard GraphQL clients such as Altair browser extension or any other.

(3) An API user can get exactly the data they need at the moment. Classic REST API applications return all the data for each method or use complex data selectors to limit the output; for instance, for method/person one will get the name, birthdate, surname, contact number etc. The same method implemented in GraphQL manner allows one to choose if they need just surnames and nothing more.

Each object in the system has a clear access method; our web interface is just a reference JavaScript client. All levels of access are available using our simple HTTP-based API. All the data in the system is accessible via API and returned in JSON representation.

Most of the data is stored in PostgreSQL database management system. А notable part of the database scheme is the use of a denormalized scheme with massively used SQL composite primary keys. With this trick, we are adding our own data types to the system. Tables in dictionaries are not stored in the form of a table inside the database but are formed by backend API dynamically using the tables listed in Fig. 12.

Fig. 12.
figure 12

LingvoDoc database scheme.

All the data that is uploaded by users as files is stored in a POSIX-compliant filesystem.

The backend part of LingvoDoc uses Pyramid as a web-framework, Graphene for building GraphQL API, Celery for task distribution, SQLAlchemy for contacting PostgreSQL database management system.

The frontend part of LingvoDoc is a classic web-application that is built on React framework and Apollo GraphQL client.

Figure 13 shows the scheme of interaction.

Fig. 13.
figure 13

LingvoDoc interaction scheme.

5 OB-UGRIC LANGUAGES ANALYSIS

At the beginning of the 20th century, the Ob-Ugrians—the speakers of the KhantyFootnote 1 and MansiFootnote 2 languages—still occupied a vast territory extending from the upper reaches of the Pechora River, in the northern Urals, to the Yugan, Vasyugan, and Vakh rivers in the Tomsk oblast (a total land area of about 3000 km2, northwest to southeast). It is not surprising that related languages distributed over such a large area should show significant dialectal differences. The Khanty and Mansi languages are divided into four dialect groups each, and there is no mutual understanding between the speakers of these groups. At the beginning of the 20th century, each of these groups had several dialects, with significantly different morphological and phonetic systems.

Sadly, this situation is now changing extremely quickly. Some dialect groups have already become extinct: the last speakers of Southern and Eastern Mansi died in the middle of the 20th century, as did the last speakers of Southern Khanty. Some Khanty dialects, like Nizyam (a transitional group between Southern and Northern KhantyFootnote 3) and Salym (a transitional group between Western and Eastern Khanty), have been considered dead, but field expeditions conducted by researchers within our project have found a few remaining speakers.

The problem with the study of the Ob-Ugric dialects lies in the fact that many of them do not have complete descriptions of their grammar or dictionaries; the existing descriptions do not follow any single standard and are difficult to access. For example, Western descriptions are mainly made in the Latin Finno-Ugric transcription using many additional characters, but it remains questionable whether these characters are meaningful. Furthermore, these descriptions mainly use the data collected in the 19th century, which might have varying degrees of accuracy (see, for example, [10], for a comparison of the transcription of Mansi speech by speakers from the same settlements in works by B. Munkachi and A. Kannisto, in which there are a lot of contradictions that were transferred to subsequent studies), as their source material. The Russian and Soviet descriptions use a transcription based on the Cyrillic alphabet, in which, on the other hand, there are practically no additional letters. An adequate translation from one transcription to another is currently impossible without an experimental phonetic analysis of the speech of the native speakers themselves.

In the last few years, linguists from different countries have undertaken important work to record and study the Ob-Ugric languages, realizing their critical condition. The two largest of these projects are as follows:

• EuroBABEL Ob-Ugric Languages: an Ob-Ugric database of analyzed text corpora and dictionaries for less-described Ob-Ugric dialects, led by E. Skribnik at http://www.babel.gwi.uni-muenchen.de/; this site also provides detailed links to numerous resources on Ob-Ugric languages and ethnography. Field data on the Kazym, Surgut, and Yugan Khanty dialects, along with the data on the Sos’va Mansi dialect, was collected within the scope of this project, as well as glossed texts of northern, western and eastern Mansi dialects and northwestern and eastern Khanty dialects.

• Multimedia documentation of the endangered Vasyugan and Alexandrovo Khanty dialects of the Tomsk oblast of Siberia, led by A. Filchenko at http://www.policy.hu/filtchenko/FTG%20ELDP% 20project/audio.htm. This project collects and analyzes Eastern Khanty texts.

Although such important projects do exist, there had been no audio data accessible to the public on the Eastern Mansi dialect, the Ob sub-dialect of the Northern Mansi, or the Nizyam and Salym transitional dialects before our work started. Furthermore, numerous Khanty and Mansi texts, created in Russia in the 19th century, had never been analyzed. Our project has uncovered dozens of books in the archives and libraries of St. Petersburg and Finland, including gospels, liturgical texts, and various dictionaries of partly disappeared dialects of Khanty and Mansi.

Since 2012, our team has been conducting research to identify accent systems with moving stress in the Ob-Ugric languages. We have organized a number of expeditions into some remote regions of western Siberia and, with the help of local administration, found numerous Khanty and Mansi speakers.

In 2012, S.V. Onina led an expedition to the last remaining Khanty speakers, who live in the woods along the river Nazim; the expedition identified them as speakers of the Nizyam dialect, previously considered extinct. Field work with native speakers of the Yukonda dialect of Eastern Mansi also took place in 2013, with material collected by M.K. Amelina in Shugur village in the Kondinsk area, Khanty-Mansi Autonomous Okrug.Footnote 4 I.A. Stenin conducted field work with speakers of the Ob Mansi dialect living in two villages of the Oktyabrsky region of KhMAO, Nizhnie Narynkary, and Peregrebnoe.

A major component of our project, in addition to field collection of the material on the Ob-Ugric languages, involves analyzing that data using Praat phonetic software and identifying the etymological connections between the various dialectal materials. It is now possible to do this work online at our project website: http://lingvodoc.ispras.ru/ (for more technical information, see the first part of this article above).

LingvoDoc allows any researcher who possesses field audio recordings to create online multimedia dictionaries. Not only do these dictionaries unite phonetic, dialectal, and etymological components, but they also allow the researcher to connect each word entry to a corresponding phonetic wordform recording processed with Praat software. Further work with uploaded words is also supported by this program.

Such software is indispensable for researchers working with endangered languages. Standard dictionaries only provide word transcriptions for extinct languages and dialects, and there is often no way to determine how accurate the transcription is. It should be noted that mistakes in transcriptions occur quite often. An illustrative example comes from the Mansi dictionary Munkácsi, Kálmán (1986); on average, Bykonya’s work lists 2 to 3 transcriptional variants for certain wordforms. It is no longer possible to determine which variant is correct, as almost all eastern and western Mansi dialects are extinct (only 1 fluent native speaker of the East dialect remains). The program we are piloting at LingvoDoc will offer both scholars and future Ob-Ugric people the opportunity to hear pronunciations of words in these dialects long after the last speakers are gone (an eventuality we sadly anticipate happening within the next ten or twenty years for, e.g., the East Mansi, whose last surviving native speaker is over 80 yr old). This project will allow researchers to verify the transcriptions in Munkácsi, Kálmán (1986) by comparing those transcriptions with the audio files in Praat.

The fact that every user of the dictionary will be able not only to view fixed phonetic images processed in Praat, but also to work directly with the software to verify optimal processing, will dramatically increase the validity of the achieved results and improve worldwide communication among researchers studying endangered languages. The availability of both the dictionaries and the software online means that suggestions to increase the accuracy of data processing can be easily communicated and considered via the hotline.

Now on LingvoDoc are available 18 dictionaries of Mansi and 32 dictionaries of Khanty dialects. Each of the online Ob-Ugric audio-dictionaries comprises about 600–1000 lexemes listed with paradigmatic forms. The content of each entry is as follows:

(1) Initial form of the word presented in the following way: (1) dictionary form (in contemporary orthography), (2) phonological or phonetic transcription of the word, (3) audio file containing pronunciations of the word, (4) image of the audio file processed using Praat phonetic software, with all the main parameters reflected (intensity, duration, frequency, and tone). Note that the option exists to proceed from this image to the Praat software and independently analyze the wordform.

(2) To every initial form are attached: inflectional wordforms (full paradigm in some cases). Every paradigmatic form and pronunciation will be presented in the same manner as the initial wordform.

(3) Every initial word if possible has links to etymological cognates of the lexeme in other dictionaries created by a user or a group of users who have agreed to allow public access to their dictionaries. Pressing the “Etymology” button yields a list of etymological cognates of the word, listed in the order of their relationship proximity, with more closely related terms listed first (for example, etymological cognates from other dialects of the same language), followed by more distant cognates. Thus, for East Mansi, the first words listed will be forms from other Mansi dialects–Sos’ Middle-Ob, Pelym, and Khanty dialects–Vakh, Vasyugan, Alexandrovo, Nizyam, Salym. In the future, when dictionaries on more Finno-Ugric, Turkic and other languages have been created and hosted using this software, words in these dictionaries will also be linked to East Mansi words.

Then one choose option “Tools” in every dictionary and see the list of the programs for the analysis of the material and checking it’s correctness.

Result of spectrogram’s analysis. LingvoDoc supports compilation of vowel phonologies for dictionaries with sound recording and markup data. Phonology for a particular dictionary can be compiled by selecting Tools > Phonology menu option in the view of this dictionary in the web interface of LingvoDoc.

Given a dictionary with a set of paired sound recordings and markup files, sound files in uncompressed WAV format and markup files in Praat TextGrid format, vowel phonology compilation proceeds as follows:

(1) For each sound recording/markup pair, a set of vowel sounds is selected for analysis.

A user has an option to select either all vowel sounds, or the longest vowel sound and the vowel sound with the highest intensity of each sound recording. The method of selection is the same for all sound recording/markup pairs.

(2) First, second and third formants are computed for each distinct vowel sound found during the first step.

The data on each vowel sound includes its length, its length relative to average length of all sounds marked up in its sound recording, and its intensity (Fig. 14).

Fig. 14.
figure 14

Example of the vowel sound data extracted from a dictionary.

(3) Formant data is grouped by vowel; each vowel’s first/second formant data is modeled as a bivariate normal distribution, and then a vowel formant plot is created from the joint data on all the vowels.

(4) After vowel phonology compilation is completed, the computed formant dataset and constructed formant plot become available for downloading as a Microsoft Excel.xlsx file.

The algorithms used for computation of sound intensity and formants mimic corresponding algorithms used by Praat [8], see sections Sound: To Intensity… and Sound: To Formant (burg)… of the Praat’s manual, accessible from the Praat’s website.

We have created special software Tools > Phonology (it can be used in every dictionary, which has spectrograms) that would collect all the physical characteristics of every phoneme, for example “a,” in a given Ob-Ugric language and dialects by processing spectrograms marked out in Praat. In the output we have a table with a list of all physical characteristics of every phoneme in the array of lexemes of a given dialect. Then sound formants are computed as averages of the formant values of the sound chunks. Sound is partitioned into small chunks, and for each chunk its formant values are estimated from its LPC coefficients computed via Burg’s method. Mean and covariance matrix of the bivariate normal distributions of first/second formants of each vowel are estimated using maximum likelihood estimation, i.e., just as sample mean and sample covariance matrices for vowel’s formant data sets.

The formant plot constructed from the formant vector data for each vowel contains a scatterplot of the vowel’s formant vectors, a mean formant vector and a standard deviation ellipse, computed from the covariance matrix (Fig. 15).

Fig. 15.
figure 15

Example of the analyses of the right transcription of the vowel sound data extracted from a dictionary of Vakh Khanty dialect http://lingvodoc.ispras.ru/dictionary/1230/570/perspective/1230/571/view.

As shown in Fig. 16, if the transcription is incorrect, the clouds will intersect quite significantly.

Fig. 16.
figure 16

Example of the analysis of the wrong transcription of the vowel sound data extracted from a dictionary of Yukonda Mansi dialect http://lingvodoc.ispras.ru/dictionary/1230/570/perspective/1230/571/view.

This information allows assessing the correctness of phonetic transcription used for each dialect on a new level and substantially refining it. Then, having revised the phonetic transcription, we plan to proceed to the next stage of our project.

LingvoDoc etymological analysis. In the 19th centuries, as a result of the activity of the Translation Committee of the Russian Bible Society and some outstanding researchers, the first Cyrillic books and dictionaries for all groups of the Ob-Ugric languages were created. A preliminary philological analysis of these written records conducted by our group shows that for most languages material from several dialects was used.

(1) Phonetic correspondences

Then for every language and every group of languages which have common etymologies we created (Tools > Cognate analysis) systems of etymological correspondences of all phonemes among manuscripts and contemporary dialects for every chosen language (Fig. 18); furthermore, one can verify, modify, or propose new phonetic laws, accordingly.

Fig. 17.
figure 17

Example of the automatic correspondences of the first syllable vowels in Mansi dialects.

Fig. 18.
figure 18

Example of the result of Cognate analysis: 2D-graph of phonetic proximity of the Mansi dialects.

(2) Dialect classification

Having taken into account that at the previous stages of the project the dictionaries of the manuscripts and living dialects would have been connected lexeme-by-lexeme, and for audio-dictionaries Praat spectrogram marking would have been performed, we created a special program (Tools > Cognate analysis), which would give for every grapheme in the manuscript the corresponding phonemes of all contemporary dialects with a full list of physical characteristics (Figs. 18, 19). Then we plan to use the program for processing the dictionaries of the manuscripts and living dialects with Wagner-Fischer algorithm and compute the Levenshtein distances between them. It will allow to determine the degree of similarity between the manuscript and all the considered dialects fully automatically.

Fig. 19.
figure 19

Example of the result of Cognate analysis: 3D-graph of phonetic proximity of the Mansi dialects.

(3) Reconstruction of the proto-forms

As shown in Fig. 20, based on the sound correspondences, which were created automatically, we could devise a system of correspondences between languages or dialects and reconstruct the Protophonems or Protoforms (Tools > Cognate reconstruction or Multi-language cognate reconstruction).

Fig. 20.
figure 20

The Protomansi sound correspondences on the base of the comparison of 16 dictionaries.

As a result of this work, we get precise numerical values which will determine the degree of closeness of each of the languages of the first Cyrillic manuscripts to the contemporary dialect groups of the Ob-Ugric languages. We also measure the degree of closeness of all the idioms under analysis between each other.

The precision of data in audio dictionaries is higher than in traditional dictionaries by several orders of magnitude, since we analyze not transcriptions relying on researchers’ hearing, but digital audio recordings of high quality (in .wav format). The physical parameters of sounds are determined precisely and transformed into transcriptions with the means of experimental phonetics and computational linguistics.

Achieving this level of verification and accuracy is becoming possible now thanks to the creation of the LingvoDoc virtual laboratory, where the functions for phonetic and etymological analysis are present. Without this resource, it would not have been possible to achieve the results on the Ob-Ugric languages.

We have also created computer software which can be used in future research. This software will help to analyze audio dictionaries of living languages and concordances of written sources to determine (1) the accuracy of the transcription for living dialects, (2) the degree of closeness between idioms.

The results anticipated will be significant for both linguistics in general and the history of Russia in particular, since we will discover new data about linguistic change and places where certain languages were spoken about 250 yr ago.

6 CONCLUSIONS

This paper presents the LingvoDoc system for collaborative language documentation with focus on dictionaries with inter-dictionary connections and with support for flexible dictionary structure. The system provides a web interface with its own Elan and Praat markup read-only viewer and computational features on these markups. The backend of the system exposes GraphQL HTTP API which could be introspected with standard tools and makes it possible to use the data in the system for external computational analysis. The project is open source with permissive Apache 2.0 license. The central system is used by more than 300 active registered users.

Furthermore, the paper discusses the analysis of the Ob-Ugric languages as an example of data collection procedures and as a real application of the system to a researcher’s needs.

Our future plans for the LingvoDoc projects are: improving institution-local version of the system, expanding the community for computational extensions, accuracy of the transcription for living dialects, degree of closeness between idioms, user interface improvements.