Keywords

1 Introduction

Prediction of students’ performance has been a central theme within the field of learning analytics (LA) since the early days [1]. In fact, the initial conceptualization of the field has highlighted the use of digital data collected from learners to predict their success—among other usages. Such predictions hold the promise to help identify those who are at risk of low achievement, in order to proactively offer early support and appropriate intervention strategies based on insights derived from learners’ data [1, 2]. Nevertheless, the prediction of students’ performance is not unique to LA and was an important theme in related fields even before LA, e.g., academic analytics [3], educational data mining [4], and even far earlier in education research at large [5].

Such widespread, longstanding and continuous centrality of early and accurate prediction of students’ performance lends itself to the premise that detection of early signs could allow a timely prevention of, e.g., dropout, low achievement, or undesired outcomes in general [6]. More importantly, identifying the predictors could help inform interventions, explain variations in outcomes and inform educators of why such outcomes happened—a form of predictive modelling that is often referred to as explanatory modelling [7]. Noticeable is the famous example of the Signal system at the University of Purdue, where predictions were based on digital data collected from an online learning platform [8]. Signal produced predictions and classified students into three categories according to “safety” and presented the students with traffic-light-inspired dashboard signs, where at-risk students get a red light. However, the influence of Signal on retention rates is unclear and often debated [9]. Several other systems were designed, built and applied in practice, e.g., OU analyse at the Open University, where the system offers informative dashboards to students and teachers as well as predictive models to forecast students’ performance [10].

Successful prediction of students’ performance has been demonstrated repeatedly in several LA studies across the years [11]. In general, the majority of the published studies used features extracted from logged learning trace data (i.e., data about students’ interactions with online learning activities and resources) and achieved accurate predictions for a considerable number of students. Yet, most of such studies examined a single course or what is often referred to as a convenience sample (i.e., a course with a sufficiently large and accessible dataset) [11]. Studies that attempted to apply predictive modelling across several courses have not found similar success [12,13,14,15]. For instance, Finnegan et al. [15] examined 22 courses across three academic domains using student log-trace data recorded from the learning management system. The authors found considerable differences among predictive models developed for individual courses regarding their predictive power as well as the significance of features. Similar results were reported by Gašević et al. [12] who used data from nine undergraduate courses in different disciplines to examine how instructional variations affected the prediction of academic success. Gašević et al. [12] found that predictors were remarkably different across courses with no consistent pattern that would allow for having one model applicable across all courses. Similarly, Conijn et al. [13] examined 17 courses across several subjects and confirmed the considerable variability of the indicators and predictive models across courses.

Studies within the same domain have also found significant differences in predictors and predictive models. For instance, a recent study [14] examined 50 courses with a similar course design and homogeneous pedagogical underpinning. The authors found variations among different offerings of the same course, that is, the same predictor was statistically significantly correlated with performance in one course offering, but not in the same course offered to similar students in the next year. Furthermore, some predictors were more consistent than others e.g., the frequency of online sessions was more consistent than the frequency of lectures. In a similar vein, Jovanović et al. [16] applied mixed-effect linear modelling to data from fifty combined courses and developed several predictive models with different combinations of features. All predictive models in the work by Jovanović et al. [16] were able to explain only a limited proportion of variations in students’ grades. The intraclass correlation coefficient (a measure of source of variability) of all models revealed that the main source of variability were students themselves, that is, students’ specific features not captured in the logged data, pointing to the importance of taking students’ international conditions into account.

The goal of this chapter is to introduce the reader to predictive LA. The next section is a review of the existing literature, including the main objectives, indicators and algorithms that have been operationalized in previous works. The remainder of the chapter is a step-by-step tutorial of how to perform predictive LA using R. The tutorial describes how to predict student success using students’ online trace log data extracted from a learning management system. The reader is guided through all the required steps to perform prediction, including the data preparation and exploration, the selection of the relevant indicators (i.e., feature engineering) and the actual prediction of student success.

2 Predictive Modelling: Objectives, Features, and Algorithms

Extensive research in the LA field has been devoted to the prediction of different measures of student success, as proven by the existence of multiple reviews and meta-analyses on the topic [17,18,19,20]. Among the measures of student success that have been examined in the literature are student retention [21], grades [22], and course completion [23]. Predicting lack of success has also been a common target of predictive analytics, mostly in the form of dropout [24], with special interest in the early prediction of at-risk students [25, 26].

To predict student success, numerous indicators from varying data sources have been examined in the literature. Initially, indicators were derived from students’ demographic data and/or academic records. Some examples of such indicators are age, gender, and previous grades [27]. More recent research has focused on indicators derived from students’ online activity in the learning management system (LMS) [17, 20]. Many of such indicators are derived directly from the raw log data such as the number of total clicks, number of online sessions, number of clicks on the learning materials, number of views of the course main page, number of assignments completed, number of videos watched, number of forum posts [13, 14, 28,29,30,31]. Other indicators are related to time devoted to learning, rather than to the mere count of clicks, such as login time, login frequency, active days, time-on-task, average time per online session, late submissions, and periods of inactivity [13, 14, 32,33,34,35]. More complex indicators are often derived from the time, frequency, and order of online activities, such as regularity of online activities, e.g., regularity of accessing lecture materials [16, 36, 37], or regularity of active days [14, 16]. Network centrality measures derived from network analysis of interactions in collaborative learning settings were also considered, as they compute how interactions relate to each other and their importance [38]. Research has found that predictive models with generic indicators are only able to explain just a small portion of the overall variability in students’ performance [36]. Moreover, it is important to take into account learning design as well as quality and not quantity of learning [17, 20].

The variety of predictive algorithms that have been operationalized in LA research is also worth discussing. Basic algorithms, such as linear and logistic regression, or decision trees, have been used for their explainability, which allows teachers to make informed decisions and interventions related to the students “at risk” [37]. Other machine learning algorithms have also been operationalized such as kNN or random forest [39, 40], although their interpretability is less straightforward. Lastly, the most cutting-edge techniques in the field of machine learning have also made their way to LA, such as XGBoost [41] or Neural Networks [42]. Despite the fact that the accuracy achieved by these complex algorithms is often high, their lack of interpretability is often pointed out as a reason for teachers to avoid making decisions based on their outcomes [7, 43].

It is beyond the scope of this review to offer a comprehensive coverage of the literature. Interested readers are encouraged to read the cited literature and the literature reviews on the topics [11, 14, 17,18,19,20]

3 Predicting Students’ Course Success Early in the Course

3.1 Prediction Objectives and Methods

The overall objective of this section is to illustrate predictive modelling in LA through a typical LA task of making early-in-the-course predictions of the students’ course outcomes based on the logged learning-related data (e.g., making predictions of the learners’ course outcomes after log data has been gathered for the first 2–3 weeks). The course outcomes will be examined and predicted in two distinct ways: (1) as success categories (high vs. low achievement), meaning that the prediction task is approached with classification models; (2) as success score (final grades), in which case the development of regression models is required.

To meet the stated objectives, the following overall approach will be applied: create several predictive models, each one with progressively more learning trace data (i.e., logged data about the learners’ interactions with course resources and activities), as they become available during the course. In particular, the first model will be built using the learning traces available at the end of the first week of the course; the second model will be built using the data available after the completion of the second week of the course (i.e., the data logged over the first 2 weeks); then, the next one will be built by further accumulating the data, so that we have learning traces for the first 3 weeks, and so on. In all these models, the outcome variable will be the final course outcome (high/low achievement for classification models, that is, the final grade for regression models). We will evaluate all the models on a small set of properly chosen evaluation metrics and examine when (that is, how early in the course) we can make reasonably good predictions of the course outcome. In addition, we will examine which learning-related indicators (i.e., features of the predictive models) had the highest predictive power.

3.2 Context

The context of the predictive modelling presented in this chapter is a postgraduate course on learning analytics (LA), taught at University of Eastern Finland. The course was 6 weeks long, though some assignments were due in the week after the official end of the course. The course covered several LA themes (e.g., Introductory topics, Learning theories, Applications, Ethics), and each theme was covered roughly in 1 week of the course. Each theme had a set of associated learning materials, mostly slides, and reading resources. The course reading resources included seminal articles, book chapters, and training materials for practical work. The course also contained collaborative project work (referred to as group projects). In the group project, students worked together in small groups to design an LA system. The group project was continuous all over the course and was designed to align with the course themes. For instance, when students learned about LA data collection, they were required to discuss the data collection of their own project. The group project has two grades, one for the group project as a whole and another for the individual contribution to the project. It is important to note here that the dataset is based on a synthetic anonymized version of the original dataset and was augmented to three times the size of the original dataset. For more details on the course and the dataset, please refer to the dataset chapter [44] of the book.

3.3 An Overview of the Required Tools (R Packages)

In addition to a set of tidyverse packages that facilitate general purpose data exploration, wrangling, and analysis tasks (e.g., dplyr, tidyr, ggplot2, lubridate), in this chapter, we will also need a few additional R packages relevant for the prediction modelling tasks:

  • The caret (Classification And REgression Training) package [45] offers a wide range of functions that facilitate the overall process of development and evaluation of prediction models. In particular, it includes functions for data pre-processing, feature selection, model tuning through resampling, estimation of feature importance, and the like. Comprehensive documentation of the package, including tutorials, is available online.Footnote 1

  • The randomForest package [46] provides an implementation of the Random Forest prediction method [47] that can be used both for the classification and regression tasks.

  • The performance package [48] offers utilities for computing indices of model quality and goodness of fit for a range of regression models. In this chapter, it will be used for estimating the quality of linear regression models. The package documentation, including usage examples, is available online.Footnote 2

  • The corrplot package [49] allows for seamless visual exploration of correlation matrices and thus facilitates understanding of connections among variables in high dimensional datasets. A detailed introduction to the functionality the package offers is available online.Footnote 3

3.4 Data Preparation and Exploration

The data that will be used for predictive modelling in this chapter originates from the LMS of a blended course on LA. The dataset is publicly available in a GitHub repository,Footnote 4 while its detailed description is given in the book’s chapter on datasets [44]. In particular, we will make use of learning trace data (stored in the Events.xlsx file) and data about the students’ final grades (available in the Results.xlsx file).

We will start by familiarising ourselves with the data through exploratory data analysis.

Three lines of a code indicate the library of tidy verse, library of lubridate, and library of rio.

After loading the required packages, we will load the data from the two aforementioned data files:

A code snippet uses the import function for events and results with the path of the files.

We will start by exploring the events data, and looking first into its structure:

One line of a code reads glimpse of events.

Rows: 95,626 Columns: 7 $ Event.context <chr> "Assignment: Final Project", "Assignment: Final Project"~ $ user          <chr> "9d744e5bf", "91489f7a9", "278a75edf", "53d6ab60c", "aab~ $ timecreated   <dttm> 2019-10-26 09:37:12, 2019-10-26 09:09:34, 2019-10-18 12~ $ Component     <chr> "Assignment", "Assignment", "Assignment", "Assignment", ~ $ Event.name    <chr> "Course module viewed", "The status of the submission ha~ $ Log           <chr> "Assignment: Final Project", "Assignment: Final Project"~ $ Action        <chr> "Assignment", "Assignment", "Assignment", "Assignment", ~

Since we intend to build separate predictive models for each week of the course, we need to be able to organise the events data into weeks. Therefore, we will extend the events data frame with additional variables that allow for examining temporal aspects of the course events from the weekly perspective. To that end, we first order the events data based on the events’ timestamp (timecreated) and then add three auxiliary variables for creating the course_week variable: weekday of the current event (wday), weekday of the previous event (prev_wday), and indicator variable for the start of a new week (new_week). The assumption applied here is that each course week starts on Monday and the beginning of a new week (new_week) can be identified by the current event being on Monday (wday==”Mon”) while the previous one was on any day other than Monday (prev_wday!=”Mon”) :

A code snippet orders events in a data frame by timestamp using the time created keyword. Four variables are added to the data frame using mutate functions. They are weekday, previous weekday, new week, and course week.

Having created the variable that denotes the week of the course (course_week), we can remove the three auxiliary variables, to keep our data frame tidy:

One line of a code of events uses the select function. The weekday, previous weekday, and new week are grouped and copied to the variable events using the c operator.

We can now explore the distribution of the events across the course weeks. The following code will give us the count and proportion of events per week (with proportions rounded to the fourth decimal):

A code snippet calculates the number of course week and the proportions, with the row of events rounded to 4 decimal points, using count and mutate functions.

The output of the above lines show that we have data for 7 weeks: 6 weeks of the course plus one more week, right after the course officially ended but students were still able to submit assignments. We can also observe that the level of students’ interaction with course activities steadily increased up until week 5 and then started going down.

Let us now move to examining the factor variables that represent different types of actions and logged events. First, we can check how many distinct values each of these variables has:

One line of code indicates the number of distinct values across event dot context, and component with action, using the summarize function.

  Event.context Component Event.name Log Action 1            80        13         27  80     12

We can also examine unique values of each of the four variables, but it is better to examine them together, that will help us better understand how they relate to one another and get a better idea of the semantics of events they denote. For example, we can examine how often distinct Component, Event, and Action values co-occur (Table 1):

Table 1 Count of all combinations of Component, Event, and Action
A code snippet of events calculates the component, event dot name, and action, and arranges the component with decreasing values of n using the count and arrange functions.

Likewise, we can explore how Action and Log values are related (i.e., co-occur) (Table 2):

Table 2 Count of all combinations of Log and Action
A code snippet uses the count and arrange functions to calculate the action log of the events and arrange the action.

Having explored the four categorical variables that capture information about the students’ interactions with course resources and activities, we will select the Action variable as the most suitable one for further analysis. The reason for choosing the Action variable is twofold: (1) it is not overly granular (it has 12 distinct values), and thus allows for the detection of patterns in the learning trace data; (2) it captures sufficient information about the distinct kinds of interaction the events refer to. In fact, the Action variable was manually coded by the course instructor to offer a more nuanced way of analysis. The coding was performed to group actions that essentially indicate the same activities under the same label. For instance, logs of viewing feedback from the teacher were grouped under the label feedback. Practical activities (Social network analysis or Process mining) were grouped under the label practicals. In the same way, accessing the group work forums designed for collaboration, browsing, reading others’ comments, or writing were all grouped under the label group_work [50].

We will rename some of the Action values to make it clear that they refer to distinct topics of the course materials:

A code snippet groups the action values of general, applications, theory, ethics, feedback, and L a types categories into topical action. The mutate function uses an if-else block to check if a value is an action or not.

Let us now visually examine the distribution of events across different action types and course weeks:

A code computes event counts across action types and course weeks and visualizes the event distribution. It includes g g plot with x = course week, y = n, and fill = action.

From the plot produced by the above lines of code (Fig. 1), we can observe, for example, that group work (Group_work) was the most represented type of actions from week 2 till the end of the course (week 6). It is followed by browsing the main page of the course containing the course materials, announcements and updates (Course_view) and working on practical tasks (Practicals). We can also note that the assignment-related actions (Assignment) are present mostly towards the end of the course.

Fig. 1
A stacked bar chart of action proportion versus course week indicates the highest values for the actions, group work and course view. The material ethics indicate 0 values in weeks 1 and 2 and decreasing values from week 3 to 7.

Distribution of action types across the course weeks

Now that we have familiarised ourselves with the events data and done some initial data preparation steps, we should do some final ‘polishing’ of the data and store it to have it ready for further analysis.

A code snippet keeps only the variables to be used for further analysis, renames some of the remaining ones to keep naming consistent, and saves the prepared data in the R native format.

The next step is to explore the grades data that we previously loaded into the results data frame

A code reads glimpse of results.

Rows: 130 Columns: 15 $ user               <chr> "6eba3ff82", "05b604102", "111422ee7", "b4658c3a9",~ $ Grade.SNA_1        <dbl> 0, 8, 10, 5, 10, 7, 9, 10, 10, 10, 7, 10, 9, 9, 9, ~ $ Grade.SNA_2        <dbl> 0, 10, 10, 5, 10, 10, 9, 10, 10, 10, 8, 10, 10, 10,~ $ Grade.Review       <dbl> 6.67, 6.67, 10.00, 0.00, 10.00, 9.67, 6.67, 7.00, 1~ $ Grade.Group_self   <dbl> 5, 1, 10, 1, 10, 6, 10, 9, 10, 10, 6, 10, 10, 10, 9~ $ Grade.Group_All    <dbl> 4.00, 3.00, 9.11, 4.00, 9.18, 4.00, 8.56, 8.56, 9.2~ $ Grade.Excercises   <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 3.33, 10.00, 10.~ $ Grade.Project      <dbl> 0.00, 7.00, 9.33, 6.00, 5.33, 7.67, 0.00, 9.33, 10.~ $ Grade.Literature   <dbl> 6.67, 6.67, 10.00, 4.33, 10.00, 9.67, 5.00, 6.67, 1~ $ Grade.Data         <dbl> 4, 3, 5, 3, 5, 5, 1, 4, 5, 4, 5, 4, 4, 5, 3, 4, 5, ~ $ Grade.Introduction <dbl> 6, 6, 10, 4, 10, 10, 4, 8, 10, 8, 10, 10, 6, 8, 6, ~ $ Grade.Theory       <dbl> 2, 2, 10, 2, 10, 10, 2, 8, 6, 2, 8, 2, 2, 8, 8, 6, ~ $ Grade.Ethics       <dbl> 2, 8, 10, 2, 10, 10, 4, 6, 10, 6, 10, 10, 6, 8, 4, ~ $ Grade.Critique     <dbl> 4, 4, 10, 6, 10, 10, 2, 2, 10, 6, 10, 10, 6, 6, 6, ~ $ Final_grade        <dbl> 2.626970, 4.670169, 9.244600, 0.000000, 8.238179, 5~

Even though the results dataset includes the students’ grades on individual assignments, we will be able to use just the final grade (Final_grade) since we do not have information when during the course the individual assignment grades became available.

To get an overall understanding of the final grade distribution, we will compute the summary statistics and plot the density function for the Final_grade variable:

A code reads summary of results $ final underscore grade.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   0.000   5.666   7.954   7.254   9.006  10.000

A code creates a g g plot of results with x = final grade and title = distribution of the final grade.

We can clearly notice both in the summary statistics and the distribution plot (Fig. 2) that the final grade is not normally distributed, but skewed towards higher grade values.

Fig. 2
A line graph of density versus final grade has a left-skewed curve that starts at (0.0, 0.0125), slightly decreases to (1.25, 0.01), gradually increases to (8, 0.22) with fluctuation, and then drops to (10, 0.12). The values are approximate.

Distribution of the final course grade

As noted in Sect. 3.1, we will build two kinds of prediction models: models that predict the final grade (regression models) as well as models that predict whether a student belongs to the group of high or low achievers (classification models). For the latter group of models, we need to create a binary variable (e.g., Course_outcome) indicating if a student is in the high or low achievement group. Students whose final grade is above the 50th percentile (i.e., above the median) will be considered as being high achievers in this course (High), the rest will be considered as having low course achievement (Low):

A code snippet uses an if-else block to determine the high and low categories based on, if the final grade is greater than the median of final grade. The course outcome indicates the results.

Now that we have prepared the outcome variables both for regression and classification models (Final_grade and Course_outcome, respectively), we can save them for later use in model building:

A code snippet uses the select and save R D S functions to save the user details, final grades, and course outcomes to a file, final grades dot R D S.

3.5 Feature Engineering

After the data has been preprocessed, we can focus on feature engineering, that is, the creation of new variables (features) to be used for model development. This step needs to be informed by the course design and any learning theory that underpins the course design, so that the features we create and use for predictive modelling are able to capture relevant aspects of the learning process in the given learning settings. In addition, we should consult the literature on predictive modelling in LA (see Sect. 2), to inform ourselves about the kinds of features that were good predictors in similar learning settings. Following such an approach, we have identified the following event-based features as potentially relevant:

  1. A.

    Features based on learning action counts

    1. 1.

      Total number of each type of learning actions

    2. 2.

      Average number of actions (of any type) per day

    3. 3.

      Entropy of action counts per day

  2. B.

    Features based on learning sessions:

    1. 1.

      Total number of learning sessions

    2. 2.

      Average (median) session length (time)

    3. 3.

      Entropy of session length

  3. C.

    Features based on number of active days (= days with at least one learning session)

    1. 1.

      Number of active days

    2. 2.

      Average time distance between two consecutive active days

In addition to the course specific features (A1), the feature set includes several course-design agnostic (i.e., not directly related to a specific course design) features (e.g., A2 and A3) that proved as good predictors in similar (blended) learning settings [14, 16, 36, 51]. Furthermore, the chosen features allow for capturing both the amount of engagement with the course activities (features A1, A2, B1, B2, C1) and regularity of engagement (features A3, B3, C2) at different levels of granularity (actions, sessions, days).

To compute features based on action counts per day (group A), we need to extend the events dataset with date as an auxiliary variable:

A code reads mutate, left parenthesis, date = a s dot date of t s, right parenthesis. The values are assigned to the events dataset.

To compute features based on learning sessions, we need to add sessions to the events data. It is often the case that learning management systems and other digital learning platforms do not explicitly log beginning and end of learning sessions. Hence, LA researchers have used heuristics to detect learning sessions in learning events data. An often used approach to session detection consists of identifying overly long periods of time between two consecutive learning actions (of the same student) and considering them as the end of one session and beginning of the next one [14, 16, 36]. To determine such overly long time periods that could be used as “session delimiters”, LA researchers would examine the distribution of time periods between consecutive events in a time-ordered dataset, and set the delimiter to the value corresponding to a high percentile (e.g., 85th or 90th percentile) of the time distance distribution. We will rely on this approach to add sessions to the event data.

First, we need to compute time distance between any two consecutive actions of each student:

A code snippet groups the user, arranges the values of t s, uses mutate function to predict t s difference = t s minus lag of t s, ungroups, and assigns the values to events with an ungroup function.

Next, we should examine the distribution of time differences between any two consecutive actions of each student, to set up a threshold for splitting action sequences into sessions:

A code snippet assigns the time differences to t s underscore difference, converts it to hours, and outputs a statistical summary of time differences in hours.

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's   0.00000   0.00028   0.00028   1.40238   0.01694 307.05028       130

As summary statistics is not sufficiently informative, we should examine the upper percentiles:

A code snippet uses a quantile variable with time differences in hours, and sequence values of probabilities. It also checks for a true condition for n a dot r m.

    80%     81%     82%     83%     84%     85%     86%     87%     88%     89%   0.017   0.017   0.034   0.034   0.034   0.050   0.067   0.084   0.117   0.167     90%     91%     92%     93%     94%     95%     96%     97%     98%     99%   0.234   0.350   0.617   1.000   1.834   3.367   7.434  14.300  22.035  39.767    100% 307.050

Considering the computed percentile values, on one hand, and the expected length of learning activities in the online part of the course (which included long forums discussions), we will set 1.5 hours (value between 93th and 94th percentile) as the threshold for splitting event sequences into learning sessions:

A code calculates the time difference in hours, groups the users by time, arranges the data, and creates a new session variable for differences greater than or equal to 1.5 hours. Finally, it ungroups the users.

We will also add session length variable, which can be computed as the difference between the first and last action in each session, as it will be required for the computation of some of the features (B2 and B3):

A code snippet groups the users with the session i d, determine session length = maximum time minus minimum time, where time is measured in seconds, ungroups, and assigns the values to events with sessions.

After adding the necessary variables for feature computation, we can tidy up the dataset before proceeding to the feature computation. In particular, we will keep only the variables required for feature computation:

A code snippet uses the select function to keep only the necessary variables of user, t s, date, week, action, session number, session i d, and session length.

All functions for computing the event-based features outlined above are given in the feature_creation R script. In the following, we give a quick overview of those functions, while the script with further explanations is available at the book’s GitHub repository:

  • the total_counts_per_action_type function computes the set of features labelled as A1, that is, counts of each type of learning action, up to the current week

  • the avg_action_cnt_per_day function computed feature A2, that is, average (median) number of learning actions per day, up to the current week

  • the daily_cnt_entropy function computes feature A3, namely entropy of action counts per day, up to the current week

  • the session_based_features function computes all session-based features up to the current week: total number of sessions (B1), average (median) session length (B2), and entropy of session length (B3)

  • the active_days_count function computes the number of active days (C1), up to the current week

  • the active_days_avg_time_dist function computes avg. (median) time distance between two consecutive active days (C2), up to the current week

  • finally, the create_event_based_features function makes use of the above functions to compute all event-based features, up to the given week

Having defined the feature set and functions for feature computation, cumulatively for each week of the course, we can proceed to the development of predictive models. In the next section (Sect. 3.6), we will present the creation and evaluation of models for predicting the overall course success (high/low), whereas Sect. 3.7 will present models for predicting the final course grade.

3.6 Predicting Success Category

To build predictive (classification) models, we will use Random Forest [47]. This decision was motivated by the general high performance of this algorithm on a variety of prediction tasks [52] as well as its high performance on prediction tasks specific to the educational domain [43].

Random forest (RF) is a very efficient machine learning method that can be used both for classification and regression tasks. It belongs to the group of ensemble methods, that is, machine learning methods that build and combine multiple individual models to do the prediction. In particular, RF builds and combines the output of several decision or regression trees, depending on the task at hand (classification or regression). The way it works can be briefly explained as follows: the method starts by creating a number of bootstrapped training samples to be used for building a number of decision trees (e.g. 100). When building each tree, each time a split in a tree is to be made, instead of considering all predictors, a random sample of predictors is chosen as split candidates from the full set of predictors (typically the size of the sample is set to be equal to the square root of the number of predictors). The reason for choosing a random sample of predictors is to make a diverse set of trees, which has proven to increase the performance. After all the trees have been built, each one is used to generate a prediction, and those predictions are then aggregated into the overall prediction of the RF model. In case of a classification task, the aggregation of predictions is done through majority vote, that is, the class voted (i.e., predicted) by the majority of the classification trees is the final prediction. In case of a regression task, the aggregation is done by averaging predictions of individual trees. For a thorough explanation of the RF method (with examples in R), an interested reader is referred to Chapter 8 of [53].

We will first load the additional required R packages as well as R scripts with functions for feature computation and model building and evaluation:

A code snippet uses the library function to load the caret and random forest packages. The source function loads the feature creation dot R and the model develop and evaluate dot R scripts.

The following code snippet shows the overall process of model building and evaluation, one model for each week of the course, starting from week 1 to week 5. Note that week 5 is set as the last week for prediction purposes since it is the last point during the course when some pedagogical intervention, informed by the model’s output, can be applied by the course instructors.

A code snippet demonstrates the process of building and evaluating a model from week 1 through week 5 of the course using a for loop. A variable d s is used for creating a dataset for course success prediction values. The variable train indices creates a data partition on the course and outcome.

The process consists of the following steps, each of which will be explained in more detail below:

  1. (1)

    Creation of a dataset for prediction of the course outcomes, based on the logged events data (events_with_sessions) up to the given week (k) and the available course outcomes data (results)

  2. (2)

    Splitting of the dataset intro the part for training the model (train_ds) and evaluating the model’s performance (test_ds)

  3. (3)

    Building a RF model based on the training portion of the dataset

  4. (4)

    Evaluating the model based on the test portion of the dataset

All built models and their evaluation measures are stored (in models and eval_measures lists) so that they can later be compared.

Going now into details of each step, we start with the creation of a dataset to be used for predictive modelling in week k. This is done by first computing all features based on the logged events data (events_data) up to the week k, and then adding the course outcome variable (Course_outcome) from the dataset with course results (grades):

A code snippet creates a dataset for course success prediction with the functions of events, data, current week, and grades. It uses the select and inner join functions to select the user and course outcome and to indicate the features of the user.

Next, to be able to properly evaluate the performance of the built model, we need to test its performance on a dataset that the model “has not seen”. This requires the splitting of the overall feature set into two parts: one for training the model (training set) and the other for testing its performance (test set). This is done in a way that a larger portion of the dataset (typically 70–80%) is used for training the model, whereas the rest is used for testing. In our case, we use 80% of the feature set for training (train_ds) and 20% for evaluation purposes (test_ds). Since observations (in this case, students) are randomly selected for the training and test sets, to assure that we can replicate the obtained results, we initiate the random process with an (arbitrary) value (set.seed).

In the next step, we use the training portion of the dataset to build a RF model, as shown in the code snippet below. We train a model by tuning its mtry hyper-parameter and choose the model with optimal mtry value based on the Area under the ROC curve (AUC ROC) metric. The mtry hyper-parameter defines the number of features that are randomly chosen at each step of tree branching, and thus controls how much variability will be present among the trees that RF will build. It is one of the key hyper-parameters for tuning RF models and its default value (default_mtry) is equal to the square root of the number of features (n_features). Hence, we create a grid that includes the default value and a few values around it.

A code defines the model hyperparameter to be tuned, includes the settings to train the model through 10-fold cross-validation, initializes the training process, and sets the evaluation measure for choosing the best value of the tuned hyperparameter.

The parameter tuning is done through 10-fold cross-validation (CV). K-fold CV is a widely used method for tuning parameters of machine learning models. It is an iterative process, consisting of k iterations, where the training dataset is randomly split into k folds of equal size, and in each iteration, k-1 folds are used for training the model whereas the k-th fold is used for evaluating the model on the chosen performance measure (e.g., ROC AUC, as in our case). In particular, in each iteration, a different fold is used for evaluation purposes, whereas the remaining k-1 folds are used for training. When this iterative process is finished, the models’ performance, computed in each iteration, are averaged, thus giving a more stable estimate of the performance for a particular value of the parameter being tuned. CV is often done in 10 iterations, hence the name 10-fold CV.

The final step is to evaluate each model based on the test data. To that end, we compute four standard evaluation metrics for classification models—Accuracy, Precision, Recall, and F1—as shown in the code snippet below. These four metrics are based on the so-called confusion matrix, which is, in fact, a cross-tabulation of the actual and predicted counts for each value of the outcome variable (i.e., class).

A code to use the model to make predictions on the test set, create a confusion matrix, and compute evaluation measures based on the confusion matrix.

In our case, the confusion matrix has the structure as shown on Fig. 3. In rows, it has the counts of the actual number of students in the high and low achievement groups, whereas the columns give the predicted number of high and low achievers. We consider low course achievement as the positive class, since we are primarily interested in spotting those students who might benefit from a pedagogical intervention (to prevent a poor course outcome). Hence, TP (True Positive) is the count of students who had low course achievement and were predicted by the model as such. TN (True Negative) is the count of those who were high achieving in the course and the model predicted they would be high achievers. FP (False Positive) is the count of those who were high achievers in the course, but the model falsely predicted that they would have low achievement. Finally, FN (False Negative) is the count of students who were predicted to have high achievement in the course, but actually ended up in the low achievement group. These four count-based values forming the confusion matrix serve as the input for computing the aforementioned standard evaluation measures (Accuracy, Precision, Recall, and F1) based on the formuli given in the code snippet above.

Fig. 3
A confusion matrix of actual values versus predicted values indicates the high and low categories. The first-row entries are T N and F P in high-high and high-low. The second-row entries are F N and T P in low-high and low-low.

Confusion matrix for the prediction of the students’ overall course success

After the predictive models for weeks 1–5 are built and evaluated, we combine and compare their performance measures:

A code snippet uses the bind rows function to evaluate the measures through weeks 1 to 5, rounds off the values of accuracy and F 1 to 4 digits, and selects the week, and accuracy with F 1.

Table 3 shows the resulting comparison of the built models. According to all measures, models 2 and 3, that is, models with the data from the first two and first 3 weeks of the course, are the best. In other words, the students’ interactions with the course activities in the first 2–3 weeks are the most predictive of their overall course success. In particular, the accuracy of these models is 84%, meaning that for 84 out of 100 students, the models will correctly predict if the student would be a high or low achiever in this course. These models have precision of 75%, meaning that out of all the students for whom the models predict will be low achievers in the course, 75% will actually have low course achievement. In other words, the models will underestimate students’ performance in 25% of predictions they make, by wrongly predicting that students would have low course achievement. The two best models have perfect recall (100%), meaning that the models would identify all the students who will actually have low course performance. These models outperform the other three models also in terms of the F1 measure, which was expected considering that this measure combines precision and recall giving them equal relevance. Interestingly, the studies exploring predictive models on weekly basis have found similar high predictive power for models developed around the second week of the course [54].

Table 3 Comparison of prediction models for successive course weeks

RF allows for estimating the relevance of features used for model building. In a classification task, RF estimates feature relevance as the total decrease in the impurity (measured by the Gini index) of leaf nodes from splitting on a particular feature, averaged over all the trees that a RF model builds [46]. We will use this RF’s functionality to compute and plot the importance of features in the best model. The function that does the computation and plotting is given below.

A code snippet utilizes R F functionality to compute and plot feature importance from the best model. It creates a g g plot with x = reorder of variable, importance, y = importance, and title = feature importance.

The plot produced by this function for one of the best models (Model 2) is given in Fig. 4.

Fig. 4
A horizontal bar graph of feature importance indicates the decreasing values from top to bottom for 14 categories. The average session length has the highest value of 6.5, while practicals count has the lowest of 1.8. Values are approximate.

The importance of features in the best course outcome prediction model, as estimated by the RF algorithm

A line of code uses the compute and plot variable importance function for model 2.

As Fig. 4 shows, features denoting the overall level of activity (avg_session_len, session_cnt) are those with the highest predictive power. They are followed by entropy-based features, that is, features reflective of the regularity of study. These findings are in line with the LA literature (e.g., [14, 36, 37]. It should be also noted that the feature reflective of the level of engagement in the group work (Group_work_cnt) is among the top 5 predictors, which can be explained by the prominent role of group work in the course design.

3.7 Predicting Success Score

To predict the students’ final grades, we will first try to build linear regression models, since linear regression is one of the most often used regression methods in LA [43]. To that end, we will first load a few additional R packages:

Two lines of code to load the performance library and the correlation plot library.

Considering that a linear regression model can be considered valid only if it satisfies a set of assumptions that linear regression, as a statistical method, is based upon (linearity, homogeneity of variance, normally distributed residuals, and absence of multicollinearity and influential points), we will first examine if our data satisfies these assumptions. In particular, we will compute the features based on the events data from the first week of the course, build a linear regression model using the computed features, and examine if the resulting model satisfies the assumptions. Note that we limit our initial exploration to the logged events data over the first week of the course since we aim to employ a regression method that can be applied to any number of course weeks; so, if the data from the first course week allow for building a valid linear regression model, we can explore the same method further; otherwise, we need to choose a more robust regression method, that is, method that is not so susceptible to imperfections in the input data.

Having created the dataset for final grade prediction based on the week 1 events data, we will split it into training and test sets (as done for the prediction of the course outcome, Sect. 3.6), and examine correlations among the features. The latter step is due to the fact that one of the assumptions of linear regression is the absence of high correlation among predictor variables. In the code below, we use the corrplot function to visualise the computed correlation values (Fig. 5), so that highly correlated variables can be easily observed.

Fig. 5
A correlation matrix with number of cells increasing from 1 to 13 from top to bottom compares the correlation of 13 categories. Correlation values range from negative 1 to 1, with 2 shades denoting positive and negative correlations. Assignment count in course materials has the most of 13 cells.

Correlations among variables in the feature set

A code creates a dataset for grade prediction and examines the correlations among the variables with the condition that they must not be highly mutually correlated for a linear regression mode.

Figure 5 indicates that there are a couple of features that are highly mutually correlated. These will be removed before proceeding with the model building. While there is no universal agreement on the correlation threshold above which features should be considered overly correlated, correlation coefficients of 0.75 and \(-0.75\) are often used as the cut-off values [53].

A code snippet uses the variable train d s to copy the session count, course view count, active days count, and entropy daily counts to a new variable called train d s sub.

We can now build a model and check if it satisfies the assumptions of linear regression:

A code snippet indicates the linear regression function where the final grade values are stored and the data assigned to train d s sub. It uses the check model function of linear regression.

The check_model function from the performance R package [48] allows for seamless, visual verification of whether the assumptions are met. The output of this function when applied to our linear regression model (lr) is shown on Fig. 6. As the figure shows, two important assumptions of linear models are not met, namely linearity and homoscedasticity (i.e. homogeneity of variance). Therefore, linear regression cannot be used with the given feature set. Instead, we have to use a regression method that does not impose such requirements on the data distribution. Since Random forest is such a method and it has already proven successful with our dataset on the classification task (Sect. 3.6), we will use it to build regression models that predict students’ final grades.

Fig. 6
6 graphs. Multiple bell curves for posterior predictive check. 4 scatterplots with a curve for linearity and homogeneity of variance, with a solid and 2 dashed curves for influential observations, and with a positive correlation for normality of residuals. Dots withe error bars for collinearity.

The output of the check_model function enables visual verification of the assumptions that linear regression is based upon

Before moving to regression with Random forest, it is worth noting that, in addition to checking all model assumptions at once, using the check_model function, one can also check each assumption individually using appropriate functions from the performance R package. For example, from Fig. 6, one can not clearly see the X-axis of the collinearity plot and might want to explore this assumption more closely. That can be easily done using the check_collinearity function:

A line of code reads check underscore collinearity, left parenthesis l r, right parenthesis.

# Check for Multicollinearity Low Correlation                  Term  VIF   VIF 95% CI Increased SE Tolerance Tolerance 95% CI  Course_materials_cnt 3.02 [2.32, 4.08]         1.74      0.33     [0.25, 0.43]        Group_work_cnt 4.10 [3.09, 5.59]         2.02      0.24     [0.18, 0.32]      Instructions_cnt 4.70 [3.52, 6.44]         2.17      0.21     [0.16, 0.28]        Practicals_cnt 2.30 [1.81, 3.08]         1.52      0.43     [0.32, 0.55]            Social_cnt 3.74 [2.83, 5.09]         1.93      0.27     [0.20, 0.35]        Assignment_cnt 2.04 [1.63, 2.71]         1.43      0.49     [0.37, 0.61]         avg_daily_cnt 2.89 [2.23, 3.90]         1.70      0.35     [0.26, 0.45]       avg_session_len 2.24 [1.77, 3.00]         1.50      0.45     [0.33, 0.56]   session_len_entropy 2.65 [2.06, 3.56]         1.63      0.38     [0.28, 0.49]         avg_aday_dist 2.34 [1.84, 3.13]         1.53      0.43     [0.32, 0.54]

From the function’s output, we can clearly see the VIF (Variance Inflation Factor) values for all the features and a confirmation that the assumption of the absence of multicollinearity is satisfied. The documentation of the performanceFootnote 5 package provides the whole list of functions for different ways of checking regression models.

To build and compare regression models in each course week, we will follow a similar procedure to the one applied when building classification models (Sect. 3.6); the code that implements it is given below. The differences are in the way that the dataset for grade prediction is built (create_dataset_for_grade_prediction), the way that regression models are built (build_RF_regression_model) and evaluated (get_regression_evaluation_measures), and these will be explained in more detail below.

A code snippet creates the dataset for grade prediction. creates data partition, builds R F regression model, and gets the regression evaluation measures with r f, train d s, and test d s variables.

To create a dataset for final grade prediction in week k, we first compute all features based on the logged events data (events_data) up to the week k, and then add the final grade variable (Final_grade) from the dataset with course results (grades):

A code creates the dataset for grade prediction, creates event-based features with events data and current week variables, selects the user and final grade, and uses the inner join function to indicate the features of the user.

As can be observed in the code snippet below, building a RF regression model is very similar to building a RF classification model. The main difference is in the evaluation measure that is used for selecting the optimal mtry value in the cross-validation process - here, we are using RMSE (Root Mean Squared Error), which is a standard evaluation measure for regression models [53]. As its name suggests, RMSE is the square root of the average squared differences between the actual and predicted values of the outcome variable on the test set.

A code builds the R F regression model with n features, default m t r y, grid, control, and r f. The r f block includes method = r f, metric = R M S E, tune grid = grid, and t r control = control.

Finally, to evaluate each model on the test data, we compute three standard evaluation metrics for regression models, namely RMSE, MAE (Mean Absolute Error), and R\({ }^2\). MAE is the average value of the absolute differences between the actual and predicted values of the outcome variable (final grade) on the test set. Finally, R\({ }^{2}\) (R-squared) is a measure of variability in the outcome variable that is explained by the given regression model. The computation of the three evaluation measures is shown in the code below.

A code computes 3 standard evaluation metrics for regression models, Root Mean Squared Error, Mean Absolute Error, and R-squared. Mean Absolute Error measures average absolute difference between the actual and predicted values. R-squared indicates the proportion of variance in the outcome variable.

After the regression models for weeks 1–5 are built and evaluated, we combine and compare their performance measures, with the results reported in Table 4.

Table 4 Comparison of grade prediction models for successive course weeks
A code indicates the regression evaluation d f with bind rows function. It uses the mutate function for weeks 1 to 5, rounds off the values of x to 4 digits, and selects the week, R 2, R M S E, and M A E.

As shown in Table 4, in this case, we do not have a clear situation as it was with the classification task (Table 3), since the three evaluation measures point to different models as potentially the best ones. In particular, according to R\({ }^{2}\), the best model would be model 5 (i.e., the model based on the data from the first 5 weeks), whereas the other two measures point to the second or third model as the best. Considering that (1) RMSE and MAE measures are considered more important than R\({ }^{2}\) when evaluating the predictive performance of regression models [13] and (2) RMSE and MAE values for models 2 and 3 are very close, while the second model is better in terms of R\({ }^{2}\), we will conclude that the second model, that is, the model based on the logged event data from the first 2 weeks of the course is the best model. This model explains 84.23% of variability in the outcome variable (final grade), and predicts it with an average absolute error of 0.4531, which can be considered a small value with respect to the value range of the final grade [0–10].

To estimate the importance of features in the best regression model, we will again leverage the RF’s ability. We will use the same function as before (compute_and_plot_variable_importance) to estimate and plot feature importance. The only difference will be that the importance function (from the randomForest package) will internally use residual sum of squares as the measure of node impurity when estimating the features importance. Figure 7 shows that, as in the case of predicting the overall course success (Fig. 4), regularity of study features (avg_aday_dist, entropy_daily_cnts, session_len_entropy) are among the most important ones. In addition, the number of learning sessions (session_cnt), as an indicator of overall activity in the course, is also among the top predictors.

Fig. 7
A horizontal bar graph of feature importance indicates decreasing values for 14 categories from top to bottom. The average a day distance and session count have the highest value of 62, while practicals count has the lowest of 15. Values are estimated.

The importance of features in the best final grade prediction model, as estimated by the RF algorithm

A line of code reads compute underscore and underscore plot underscore variable underscore importance, left parenthesis, regression models, double left square brackets, 2, double right square brackets, right parenthesis.

4 Concluding Remarks

The results of predictive modelling presented in the previous section show that, in the examined postgraduate course on LA, we can make fairly accurate predictions of the students’ course outcomes already in the second week of the course. In fact, both classification and regression models, that is, prediction of the students’ overall course success and final grades, proved to be the most accurate when based on the logged learning events data from the first two or three course weeks. That students’ learning behaviour in the first part of the course is highly predictive of their course performance, which is in line with related research on predictive modelling (e.g., [54,55,56]). It should be also noted that the high performance of the presented predictive models can be partially explained by the well chosen feature set and the used algorithm (Random forest) that generally performs well on prediction tasks. However, it may also be due to the relatively large dataset. As noted in Sect. 3.1, we used a synthetic anonymized version of the original dataset that is three times larger than the original dataset.

Considering the features that proved particularly relevant for predicting the students’ course performance, we note that in both kinds of predictive tasks—course success and final grade prediction—features reflective of regularity of study stand out. In addition, features denoting the overall level of engagement with online learning activities and resources also have high predictive power. It is also worth noting that the highly predictive features are session level features, suggesting that learning session is the right level of granularity (better than actions or active days) for predictive modelling in the given course. In fact, this finding is in line with earlier studies that examined predictive power of a variety of features derived from learning traces [13, 14, 36]. Note that due to the purpose of this chapter to serve as introductory reading to predictive modelling in LA, we based the feature set on relatively simple features and used only one data source for feature creation. For more advanced and diverse feature creation options, interested readers are referred to, for example [57,58,59].

The algorithm used for building predictive models, namely Random forest, offers the advantage of flexibility in terms of the kinds of data it can work with (unlike, for example, linear regression which is based on several assumptions about data distribution) as well as fairly good prediction results it tends to produce. On the other hand, the algorithm is not as transparent as simpler algorithms are (e.g., linear regression or decision trees) and thus its use might raise issues of teachers’ trust and willingness to rely on the models’ output. On the positive side, the algorithm offers an estimate of feature importance thus shedding some light on the underlying “reasoning” process that led to its output (i.e., predictions).

To sum up, predictive modelling, as applied in LA, can bring about important benefits in the form of early in the course detection of students who might be struggling with the course and pointing out indicators of learning behaviour that are associated with poor course outcomes. With such insights available, teachers can make better informed decisions as to the students who need support and the kind of support they might benefit from. However, predictive modelling is also associated with challenges, especially practical challenges related to the development and use of such models, including availability and access to the data, interpretation of models and their results, and the associated issue of trust in the models’ output.

5 Suggested Readings