1 Introduction

Text is a focal point of study in content-oriented social sciences research and communication studies, serving as one of its primary research subjects. In addition, researchers commonly employ automatic content analysis as a research method to explore extensive volumes of textual data. From a historical perspective, the field of computational social science (CSS) has emerged over the past decade, taking advantage of the vast amounts of digital data now accessible on the internet, which includes diverse sources like newspapers, parliamentary protocols, social media platforms, administrative records, and historical archives. The term itself was coined by Lazer et al. in 2009, who identified CSS as a broad field with a focus on network analysis and the aim of better understanding both the structure and content of relationships. More recently, in a review of computational social science and sociology, Edelmann et al. (2020, p. 64) demonstrated that we can observe a rapid growth of new techniques and tools since 2010 that help to analyze these large, complex datasets; in particular, various forms of automated text analysis for the ever-increasing amount of textual data (for example, see the text books of Ignatow and Mihalcea 2016 and Macanovic 2022; for an example of such a study, see Wiedemann 2016). Similarly, an increasing number of resources have been developed to support training and education in these emerging methodological paradigms (for instance, see the Language Technology and Data Analysis Laboratory at the University of Queensland, https://ladal.edu.au; Wiedemann and Niekler 2017).

While more and more researchers in the (digital) humanities and (computational) social sciences continue to embrace programming skills, it is worth noting that the development of sophisticated research software remains a complex undertaking, often necessitating the expertise of trained software engineers. As a consequence, applied research mostly relies on ready-to-use research software that is typically designed for a specific task or method, such as creating concordances and calculating word frequencies (for example, https://www.laurenceanthony.net/software/antconc/; on which, see Anthony 2005) or for creating topic models (for example, https://dariah-de.github.io/TopicsExplorer/; on which, see Simmler et al. 2019). While such tools are typically easy to use, they also impose limitations on researchers, confining their study designs to the predefined boundaries of the software. Consequently, there is limited flexibility to adapt the methods to accommodate more intricate research design.

As an alternative to such immediately applicable yet relatively static and inflexible tools, one may instead advocate for the use of more versatile software packages that implement fundamental text mining methods. As Grimmer and Stewart (2013, pp. 267–297) noted, clustering methods and supervised or unsupervised methods for text classification, often based on prior human hand coding of documents into a predetermined set of categories, are key elements of computer-supported systematic analysis of large-scale text collections. Most of the methodological requirements Grimmer mentioned are in fact implemented in tools and open source software packages for Natural Language Processing (NLP), such as StanfordNLP (see Manning et al. 2014), OpenNLP (http://opennlp.apache.org/), NLTK (see Bird et al. 2009), Gensim (see Řehůřek and Sojka 2010), SpaCy (see Honnibal et al. 2022), or Quanteda (see Benoit et al. 2018), and have been around for the last decade.

However, employing these software packages for intricate workflows can pose technical challenges, demanding a profound understanding of the frameworks and their associated application programming interfaces (APIs). The level of expertise required might act as a deterrent for many researchers. In order to make such technically demanding software frameworks more accessible, several attempts have been made to integrate rudimentary NLP pipelines into ready-to-use tools, such as Clarin Weblicht (see Heyer and Böhlke 2021; Hinrichs et al. 2010), Triple (see Dumouchel et al. 2020), or Textgrid (see Neuroth et al. 2015). Nevertheless, this approach leads to similar limitations as those of the static tools mentioned earlier, as researchers can only recombine a predetermined set of processes within the tool, without the ability to fundamentally modify or expand upon them.

It should also be noted that there are numerous tools available for qualitative data analysis (QDA), such as VERBI software’s MAXQDA (https://maxqda.com/), ATLAS.ti (https://atlasti.com), or NVivo (see Richards 2021). These tools are specifically devised to support QDA research design, involving the manual annotation of texts based on a provided codebook. These software solutions generally lack the flexibility to accommodate research designs that deviate from this traditional approach, such as incorporating automatic machine learning methods for text classification during the coding process.

Due to the growing need and challenges identified, we have designed the interactive Leipzig Corpus Miner (iLCM; also see Niekler et al. 2014, 2018), which is the result of the development of an integrated research environment for the analysis of text data (https://ilcm.informatik.uni-leipzig.de/). The key features of the iLCM compared to existing software tools for computer-assisted text analysis are its flexibility and scalability. Most importantly, the tool’s functionality offers commonly needed methods for automatic processing of text—such as preprocessing, standard text analysis, and visualization—which would be time consuming without a ready-to-use software tool. In order to also provide more methodological flexibility, the iLCM is not tied to one specific class of research question, but can easily be ported to other applications. Users can initially explore the tool’s functions through an easy-to-use graphical user interface (GUI) and customize or expand specific features as needed.

The iLCM’s extensibility is made possible because it is built entirely using the open-source environment R (https://www.r-project.org/). This means that all the tool’s functions are based on modular R scripts. Because these are “hidden” behind an RShiny-based (https://shiny.rstudio.com/) GUI, researchers can choose whether or not they wish to modify the predefined scripts, doing so based on their needs or abilities. To facilitate this customization, the tool provides an environment within its GUI where the integrated R scripts can be edited and saved as custom scripts. When initiating analysis, researchers can then utilize these custom scripts instead of the standard ones. This customizability and extensibility also allow researchers to integrate a wide range of relevant text mining methods that are already available via R implementations into the tool. As a result, the functionality of the software offers more than existing standalone tools, inasmuch as it brings a range of functionality together through the one GUI.

The range of functions offered by the iLCM includes, among others, retrieval and management of document collections, the analysis of word frequency, time series analysis, topic models, and the automatic coding and annotation of categories, or supervised text classification as a “Software as a Service” architecture (SaaS). Its built-in ability to produce custom scripts and to export results and script-based adaptations of the available analyses circumvents some restrictions of other tools for text-oriented analysis methodologies. In short, the iLCM research environment addresses 1) the requirements for quantitative analysis of large qualitative data using text mining methods and 2) the requirements for reproducibility, intersubjectivity, and validity of data-driven research design in the social sciences.

In this article, we will focus on the iLCM’s capabilities and adaptability, extensibility, and data exchange with other tools from the field of empirical content analysis. We will first present the features of the iLCM and showcase individual examples. In addition to providing an overview of the tool’s functionality, in Section 3, we will showcase the practical use of the iLCM in a real research project within the field of communication studies. This case study will exemplify the utilization of iLCM across all stages of the research process, offering a comprehensive illustration of its capabilities. The text is thus intended to help the reader learn about the methods implemented and to generate an understanding of how the different functions can contribute to different research paradigms.

2 Overview of the interactive Leipzig Corpus Miner (iLCM)

The iLCM provides a variety of different functionalities that are useful for dealing with large text corpora. First and foremost, it serves as a text mining infrastructure specifically tailored to content-based research tasks. It is based on the Leipzig Corpus Miner, which was developed as part of an interdisciplinary project titled “Postdemokratie und Neoliberalismus” (ePol; Wiedemann et al. 2013). The aim of the ePol project was to analyze over 3.5 million news items from 60 years of German newspaper history. The iLCM, on the other hand, has a broader focus. It was initially based on the idea of combining quantitative (e.g., exploratory search or automatic classification) and qualitative approaches (e.g., through manual annotation of textual documents on the basis of a codebook) in a single tool to support extensive mixed method approaches. To this end, a number of different options have been implemented for the analysis of textual data.

In this section, we describe the tool’s capabilities in detail and contextualize their usage for different research tasks in content analysis. We describe the main functions of the iLCM, aligning them with a typical research workflow in social sciences and communication studies (see Fig. 1). The workflow graphic illustrates that the iLCM comes with a wide variety of functions and methods to support researchers throughout the research process. We will briefly summarize these functions and, when appropriate, provide methodological reflections and usage examples.

Fig. 1
figure 1

Stages of a typical research workflow in social sciences and communication studies aligned with the functionality that the iLCM provides for each of these stages

2.1 Installation

As the iLCM consists of a number of different modules, its setup can be challenging for different operating systems. To ensure an easy set up, we decided to utilize virtualization. This means the iLCM comes as an image, which was previously defined by the developer as a self-contained environment that includes all the necessary libraries and dependencies. The end user is therefore not required to manually install all the necessary software packages, but merely needs to install the software to execute such images, independent of the used operating systems.

Specifically, we utilized the Docker framework to develop an image that can be downloaded and launched with a single command (https://hub.docker.com/r/ckahmann/ilcm/tags). Although setting up the iLCM on desktop machines in this manner is convenient, it is important to acknowledge that not all machines are equally suitable, and the availability of computing resources significantly impacts the handling of large and complex data. Consequently, for projects reliant on extensive datasets, we highly recommend running the iLCM on suitable server environments.

2.2 Data import

The iLCM’s flexible import and export interface allows users to enter data in structured (CSV, Excel, Rotterdam Exchange Format Initiative [REFI]) as well as unstructured (PDF, DOCX, TXT) formats. No single standard format is required. Instead, the tool interactively maps the existing data structure, including the given metadata, to the internally used data format. An interface is available in the tool to assign which information is mapped to the internal data fields of the iLCM.

During the import process, it is essential to specify at least the date and title of the document. However, the iLCM allows for the inclusion of up to nine additional metadata fields for more comprehensive document organization. In cases where the date or title is unavailable, the iLCM provides options to set these automatically or with the assistance of an R script. This ensures a smooth and efficient import process for various types of documents.

2.3 Data processing

Within the iLCM environment, there are various data processing mechanisms available, encompassing linguistic preprocessing as well as a deduplication process for the text data. The linguistic preprocessing of the text data is carried out automatically with the SpaCy library when importing new data into the tool. This includes sentence segmentation, tokenization, and lemmatization as well as part of speech (POS) tagging, syntactic parsing, and named entity recognition (NER). Large parts of the implemented functions of the iLCM rely on such pre-processed linguistic information. Besides the need to be able to perform sentence- or word-level analysis, it is, for example, beneficial to be able to provide meta information (POS, named entities, grammatical categories) for each word during the annotation phase. The results of the preprocessing are stored in the database so that they are available for later analysis without having to process the text data repeatedly.

SpaCy enables the use of pre-trained models for various languages. By default, the iLCM includes models for English and German. However, if additional language models are required, users can easily add these with just a few clicks. The language options interface in the iLCM displays the models that are currently installed and provides a straightforward installation process for additional language models.

In the context of content analysis and the application of text-mining methods, deduplication plays a crucial role in ensuring the reliability and significance of results by establishing a duplicate-free corpus. As Benko has argued (2013, p. 27), the increased availability of web corpora, particularly those compiled partially or fully through automated methods, has underscored the necessity of document deduplication in content analysis. Duplicates in text can greatly distort procedures, such as co-occurrence analysis, frequency analysis, and topic modeling; the presence of duplicate words can have a substantial impact on accuracy and validity. Depending on the extent of duplicate occurrences in the text, the entire analysis of the corpus using automatic language processing methods may become unreliable or even obsolete without proper document deduplication.

2.4 Data analysis

Upon importing and processing the text data, the iLCM enables the application of various text mining and machine learning methods for analysis. The following sections provide a concise overview of the different functions.

2.4.1 Full text search

After importing the corpus into the tool, the iLCM offers functions for document search and the creation of sub-corpora known as collections. Often, only specific portions of a corpus are relevant to a given research question. Thus, to facilitate filtering, the iLCM allows for real-time search queries that combine complex keyword searches with existing metadata conditions. This enables the creation of customized searches based on specific criteria. Additionally, the iLCM provides options for sampling documents into collections, which is particularly useful when working with extensive amounts of text data. For instance, if the analysis of a given corpus were to focus only on those texts that are somehow related to the war in Ukraine, one can narrow down the corpus using specific search terms, as shown in Fig. 2.

Fig. 2
figure 2

Search interface of the iLCM, in which complex search queries can be built by using operators such as “*” (Wildcards), “AND”, “OR”, and “NOT”

2.4.2 Diachronic word frequencies

One popular approach to analyzing text data is studying diachronic word frequencies (see, for example, Michel et al. 2011). This involves tracking the frequency curve of specific words or word groups over time and comparing them. This perspective allows researchers to identify the most frequent words within different temporal aggregation ranges. Frequency data can be measured at the word and document levels. Researchers can analyze absolute frequencies, based on time points, or normalized frequencies, based on document quantities at different time points.

The following brief example further illustrates the iLCM’s capabilities with regards to this approach. Here, diachronic word frequency is applied to analyze the mentions of political parties over time in texts from the German-language newspaper taz. This was part of an investigation into the correlation between the number of mentions and the respective parties’ poll results. Figure 3 displays the monthly number of mentions for four selected party names. This counting approach can be expanded by utilizing dictionaries, which enable the inclusion of additional synonyms or the names of top politicians associated with the parties. By doing so, the analysis goes beyond counting official party names and encompasses a broader understanding of party mentions in the text data.

Fig. 3
figure 3

The visualization interface for diachronic frequency analysis in the iLCM. Here, the frequencies of appearance of various German political parties in 2017 are presented by month. The data was retrieved from the archive of the German-language newspaper taz. The designations cdu, spd, afd and csu are abbreviations for German parties that are examined here

2.4.3 Co-occurrence analysis

Another popular way to explore large text corpora involves the analysis of significant co-occurrences of words, which can be calculated with various statistical measures in the iLCM (see Biemann et al. 2022 for more details on significance measures). Results of co-occurrence analysis are displayed both by means of a Keyword in Context view of the words as well as a network visualization, making it easy to discover patterns of co-occurring concepts in a corpus. As an advanced co-occurrence feature, the iLCM also provides a measure called context volatility (see, for example, Heyer et al. 2009), which takes into account how much the context of a word (i.e., its co-occurrences) change over the course of time (= volatility).

The example in Fig. 4 shows a co-occurrence network at sentence level for the word “water”. The corpus includes texts from the Nationally Determined Contributions (NDCs), which are the heart of the Paris Agreement and “embody efforts by each country to reduce national emissions and adapt to the impacts of climate change” (United Nations Framework Convention on Climate Change 2022). The representation as a network makes it possible to quickly determine the most important related words of the keyword under investigation.

Fig. 4
figure 4

Network visualization for the word “water” and its statistically significant co-occurrences in the NDCs corpus

2.4.4 Topic modeling

Topic models offer an unsupervised approach for clustering documents. The method is based on the Latent Dirichlet Allocation (LDA) algorithm, first presented by Blei et al. (2003). LDA models the word compositions of documents and organizes them into coherent groups (= topics) based on word usage. In essence, the model assigns a probability distribution to each document indicating its likelihood of belonging to different topics and assigns a probability distribution to each inferred topic indicating its usage of words from the entire vocabulary. Chen et al. (2023) offer a systematic review of what can (and cannot) be done with the topic modeling method in communication studies. In addition to LDA topic modeling, the iLCM also offers Dynamic Topic Modeling (cf. Blei and Lafferty 2006) and uses Structured Topic Models (STM; cf. Roberts et al. 2016).

The iLCM offers multiple approaches for evaluating and assessing the quality of topic models which is necessary for a valid application. Maier et al. (2018) give an overview of a valid methodological approach in automatic content analysis. These include measures such as topic coherence and topic intrusion, which provide insights into the coherence and relevance of topics. Additionally, the tool can check the topic reliability, allowing researchers to examine the consistency and stability of the identified topics.

The iLCM also provides capabilities for evaluating correlations between metadata and topic distributions. It allows for the map** of diachronic trends in topic distributions at specific time points. For qualitative analysis, researchers can select the most relevant documents for a chosen topic, which are then displayed with key words highlighted in the corresponding color. This enables quick identification and verification of particularly relevant text passages. This approach enhances the interpretation of topics beyond simple word lists and helps to identify potential systematic errors in the data or the modeling process. To facilitate interpretation, the iLCM includes a labeling tool that allows researchers to assign uniform names to topics after an interpretation step has been conducted.

An application scenario for topic modeling could identify the main thematic discussion items in a set of texts and evaluate their distribution over the corpus. In the excerpt shown in Fig. 5, national climate strategies were evaluated. The text was downloaded from different online sources and was provided by the research project TRANSNORMS (see www.transnorms.eu). Topics such as “funding”, “renewable energy”, or “waste management” emerged and can be further analyzed with respect to metadata or accompanying information in the text data.

Fig. 5
figure 5

The image shows the interface for the evaluation of topic model results. The distribution of topic importance for a specified period is shown. The topics were previously labeled using the available labeling tool

2.4.5 Supervised classification

One of the fundamental requirements in quantitative content analysis is the process of coding a measurable variable into identifiable categories within the texts being analyzed (see Früh 2001; p. 80; Krippendorff 2018, pp. 155–161). Thus, in addition to unsupervised clustering through topic modeling, the iLCM also offers integrated procedures for supervised classification—a machine learning technique where documents are assigned predefined labels based on a training dataset. Researchers can create codebooks within the iLCM’s GUI, adhering to the requirements of content analysis procedures. Using the annotation interface, documents can be annotated based on these codebooks. These annotations serve as a training dataset for building a classifier.

Additionally, the iLCM supports the initiation of an Active Learning (AL) process (see Settles 2012; Schröder et al. 2022) based on an initial training set or a dictionary-based search. With the help of a classifier, new examples for AL can be generated. This approach suggests potential instances of texts for different codebook categories to the user automatically. This significantly reduces time and effort in comparison to manual qualitative annotation. AL facilitates the efficient creation of a sufficiently large training set, enabling the classifier to be applied accurately to entire document sets. The results of this classification can be further examined quantitatively, exported to other tools, or utilized in subsequent analysis, such as co-occurrence calculations based on the classification examples.

The following example showcases the application of the iLCM’s supervised classification functions, examining the changing relationship between the portrayal of affection and violence in movies over the past four decades. The analysis draws on a dataset consisting of short descriptions of movies from the past 40 years (derived from Kaggle 2022). A set of labels, according to a given codebook, with the two categories affection and violence is used (Fig. 6) to annotate the data and AL was employed to quickly expand the training dataset. Finally, the supervised classifier was applied to the entire set of movie descriptions, the result of which can be seen in Fig. 7. Based on the movie descriptions, we can observe that the concepts of affection and violence both show a rising trend from the year 2000. Furthermore, we can observe a larger proportion of content reflecting the concept of affection.

Fig. 6
figure 6

An illustration of the document view in the iLCM, where textual evidence can be annotated according to the categories (affection, violence) of a selected codebook

Fig. 7
figure 7

View of the chronological distribution of classification results. The results are derived from applying the trained classifier to the entire set of texts in the corpus

2.4.6 Sentiment analysis

Sentiment analysis, which involves quantifying the emotional tone expressed in a text, plays a crucial role in understanding the emotionality of media content. By analyzing sentiment, we can explain various media effects and gain insights into how emotional aspects shape the presentation of topics, political campaigns, and historical events. Recognizing the influence of emotions in media content is essential for comprehending their impact and implications in sha** audience perceptions and societal narratives (see Döveling and Konijn 2021, pp. 48–66; Kühne et al. 2021, p. 128; Nabi 2019, pp 163–178).

Sentiment dictionaries (on which, see Khoo and Johnkhan 2018; Ribeiro et al. 2016) are curated collections of words or phrases along with their associated sentiment scores. Such dictionaries serve as a reference for sentiment analysis tasks, allowing the classification of text based on the presence of positive, negative, or neutral words, and providing a basis for quantifying the sentiment expressed in a given text. By default, sentiment dictionaries for English and German are available in the iLCM; however, these can also be extended or supplemented as desired.

After extracting the document’s sentiments based on a dictionary, results can be evaluated according to the metadata. Figure 8 illustrates the iLCM’s sentiment analysis function with an example of a textual description of a movie (derived from Kaggle 2022). We chose the movie Seven, since its emotional content draws a threatening scenario. This example also nicely shows that sentiment analysis does not always have to be about value judgements, but simply looks at emotional components, whether value judgements or expressed sentiment.

Fig. 8
figure 8

A detailed view of the sentiment analysis results, with red-highlighted words indicating sentiments with negative connotations and green-highlighted words representing those with positive connotations

2.4.7 Keyword extraction

By extracting keywords (i.e., terms that are characteristic for a specific text), we can efficiently obtain a concise summary of essential information within a document collection. Keyword extraction methods use statistical characteristics of word distributions to assign weights to words based on their statistical significance. In the iLCM, standard keyword extraction procedures such as RAKE or Textrank are available to users (on which, see Ganiger and Rajashekharaiah 2018).

Figure 9 showcases keyword extraction from a news corpus of the British daily newspaper The Guardian using the RAKE algorithm. This analysis provides a swift understanding of the primary focus of the reporting, which revolves around the handling and impact of the Covid-19 pandemic. In this analysis, keywords have been selected to be bigrams (that is, two-word phrases such as “new cases”, “press conference”, “first dose”, etc.).

Fig. 9
figure 9

Results for a 2gram keyword analysis of Guardian newspaper articles that mainly report on the Covid-19 pandemic

2.4.8 Word embeddings

Since the release of Word2Vec by Mikolov et al. (

Fig. 11
figure 11

Import interface of the iLCM, showing the map** of the UNGA general debate CSV to the iLCM standard

3.2 Data analysis

3.2.1 Full text search

As we are interested in the analysis of speech acts in political communication, the first task is to check for the general availability of speech acts in the corpus. This can easily be achieved by looking up specific performative verbs, which, as Searle (1976, p. 16) has argued regularly co-occur with speech acts. To gain a first overview of the corpus and potential speech acts, we use the full-text search, looking for the verbs “will” and “support”, which are often associated with declarative speech acts. With a total of 7625 sentences containing at least one of the targeted verbs, we can now leverage the iLCM to conduct a close reading of selected sentences, ensuring that they accurately represent the desired speech act types. Two example sentences found by the iLCM are shown here:

We support the peoples of Mozambique and Angola in their struggle to defend national independence against interference and aggression by imperialists and their reactionary lackeys. Laos, 1977

We will continue to combat illicit trade and the spread of small arms. Norway, 2000

3.2.2 Diachronic word frequencies

Based on the above qualitative findings, we can confidently state that, as would be expected of the UNGA, our corpus encompasses a notable quantity of expressive and declarative speech acts. This sets the stage for a subsequent quantitative analysis focusing on selected performative verbs and their probable associations with specific speech act types. For this purpose, we utilize the tool’s frequency analysis function to investigate the diachronic frequencies of selected word forms. Figure 12 provides a snapshot of three performative verbs that indicate how declarative speech acts have evolved in the corpus over time.

Fig. 12
figure 12

Diachronic frequency plot of the verbs “support”, “help”, and “condemn”

3.2.3 Co-occurrence analysis

Furthermore, the use context of the performative verbs can be examined in more detail with the help of co-occurrence analysis to discover possible further indicators or peculiarities in the environment of the key words. For example, Fig. 13 shows words like “struggle”, “full”, “efforts”, “community”, and “assure” in semantic proximity to the performative verb “support”. The graph also shows that there is a particularly strong relation between the words “international” and “community”, which are likely an established bigram.

Fig. 13
figure 13

Plot of the co-occurrence network for the term “support”

3.2.4 Supervised classification

The ultimate goal of this case study is to automatically identify and classify sentences according to different speech act types, so we can empirically investigate their distribution with respect to different speaker groups and time (who uses which speak acts, and how does this change over time?). To achieve this, we first create a basic codebook with categories for each of the five different speech act types (see Fig. 14).

Fig. 14
figure 14

Codebook created and used in the ILCM for the classification of speech acts

For the individual classes, references are then searched for in the corpus and annotated accordingly (see Fig. 15). Given that the process of annotating these speech acts can be time-consuming, we leverage the Active Learning feature of the iLCM to streamline the process. As noted in Sect. 2.4.5, AL involves the ability to train an initial classifier using manually annotated data, which is subsequently employed to present the user with new candidate examples for each class. This approach minimizes the manual annotation process, reducing it to evaluating the suggested samples. Furthermore, this iterative process can be repeated to achieve a robust classifier that can automatically annotate speech acts in all texts. These annotated speech acts can then be evaluated alongside other metadata for comprehensive analysis.

Fig. 15
figure 15

Interface of the iLCM for displaying and annotating texts

Figure 16 indicates that between 2002 and 2004 there is an increase in declaratives and a simultaneous decrease in expressives. From here, further qualitative and quantitative analysis can be performed, such as working with topic models to identify prevalent themes or subjects, analyzing named entitiesto extract and study specific entities mentioned in the text since they are already extracted by the linguistic preprocessing, or conducting a detailed analysis of individual documents.

Fig. 16
figure 16

Diachronic development of speech act types, declaratives and expressives, aggregated by year

3.2.5 Export of results

The results of the analysis can be extracted from the tool both in the form of example figures and as actual data in the form of CSV or R data objects. This data can then be further processed and analyzed in other environments (such as the RStudio IDE), which allows for a very flexible handling of diverse research questions and research hypotheses.