Introduction

Traditional recommendation strategies build user profiles using information directly supplied by users through surveys, ratings, and registration forms [13]. When the available information about the courses is in textual form, vectorial representations are generally used as descriptors of courses and users [11].

CF bases its recommendation on student ratings for courses or learning materials. This is the most popular technique used for online courses outperforming CB recommenders [11]. In this work, in the absence of explicit user ratings, we estimate them using the semantic representation of courses and users.

SA recommender systems [8] have gained attention with the advent of the Linked Open Data (LOD). LOD provides a variety of structured knowledge bases in RDF format that are freely accessible on the Web and interconnected to each other that can be exploited to build better recommendation systems [8]. The term Knowledge Graph was coined by Google in 2012 and refers to LOD knowledge bases such as DBpediaFootnote 1 that concentrate the knowledge of multiple domains and specify a large number of interrelationships between concepts. Typically, SA recommenders have been used on domains with a direct map between the recommendation object and a concept in the Knowledge Base. For example, the ESWC 2014 Challenge [1] worked on books that were mapped to their corresponding DBpedia concept. Other SA recommenders are not limited to a candidate set in which each item has a direct resource that represents it in the Knowledge Graph. Instead, they take the textual information as input and identify the set of concepts present as annotations in the text [6]. In our work, we also follow this approach and build an SA-CB recommender based on a semantic representation of courses and users.

The major contribution of our work lies in the generation of recommendations in limited user information scenarios. Although the aforementioned works have used and combined recommendation techniques, all of them operate with explicit user information: demographic information, evaluation results, perception surveys, and explicit ratings. To our knowledge, only our previous recent work has addressed limited information scenarios where identifying the user and their interactions on the platform is not straightforward [10, 11]. In contrast to previous works utilizing this dataset, our paper employs a semantic approach to address this challenge.

Course Recommendation System Overview

We propose a hybrid recommender system that combines CB and CF strategies. GCFGlobaluses GA to register, in a log format, the user interactions with the lessons’ web pages that compose a course. This is the only information available about the users on the site. To extract the courses viewed by a user from the logs, we have to assume the following two statements: (i) a user is identified with a unique GA number that is assigned for each physical device. For this reason, we suppose that each user uses a unique device to access the platform. This statement is true in most cases when short periods of time are analyzed, but we cannot guarantee it, and (ii) access to a lesson is equivalent to studying it. We do not have a reliable way to measure the percentage of the lesson study or evaluation results.

Recent research shows that the CF model has better results than the CB model [10, 11], but it requires enough information about users. In the GCFGlobalplatform, around 70% of the traffic comes from new users, so we need a hybrid approach that combines CF and CB models. Our recommendation system operates as follows: if the user is new or does not have log records of lessons from at least three courses, the recommendation is made by the CB model, and in any other case, the recommendation is made by the CF model.

Fig. 1
figure 1

Overview recommendation system. GA4 stands for Google Analytics 4

Figure 1 shows the general structure of the recommendation system. The “Data Processing” explains the process to extract the user information from GA, the content course, and the semantic enrichment processing. Then, the “Content-Based Recommendation with Semantic Enrichment” shows the strategy used in the CB system, and the “Collaborative Filtering Recommendation with Semantic Enrichment” explains how to estimate the users’ rating matrix for the CF system exploiting the semantic representation.

Data Processing

The dataset contains the logs of GA and the course content of GCFGlobal, and was the same dataset used in [10]. In the following sections, we explain the processing carried out.

Google Analytics Logs Processing

The information in GAFootnote 2 is organized based on sessions. A session starts when a user gets on the site and finishes automatically after 30 min of inactivity or when the user gets out of the site.Footnote 3 The content within the session comprises user IDs that facilitate the tracking of user activities on the web page, segmented into events. Among these events, the “Page View” event stands out, encompassing the URL of the accessed page. URLs in GCFGlobalare specified as “language/course/lesson/additional-params”, so we are able to identify the course and the lesson accessed by each user. In previous studies with this dataset [10], researchers explored additional information from these events, such as session duration. However, while the incorporation of this information improved the model’s precision, it also led to a significant reduction in the number of users in the dataset.

Fig. 2
figure 2

Data processing steps

Figure 2 shows the processing steps. We start loading the logs user information, and then, we transform the timestamp field to temporarily sort the logs as to optimize some operations and to respect the chronological order of the logs. Note that we only extract the logs from the year 2021 (2021-01-31 to 2022-01-31). We then identify the “Page view” events to analyze their URL structure and get the name of the courses viewed by a user. We filter the users by the number of courses accessed. This filter discards users with less than three courses accessed and users with an outlier behavior (more than 50 courses accessed that correspond to more than three standard deviations from the mean). Finally, we split the data in temporal order using the first 70% to build the models and 30% to test. Taking three courses as a lower limit allows us, in the worst case, to build recommendation models with two courses and test with the third one.

Course Content Processing

The GCFplatform uses a Mongo DB to save the course information in HTML format. The GCFGloballearning content is organized into courses, and each course is divided into lessons. For each course, we create a consolidated version of the course content by appending the raw text (all HTML tags were deleted) from its different lessons. The final corpus contains an average of 3704.27 words and 216.68 per course.

Semantic Enrichment Processing

As mentioned in “Related Work”, there are different SA strategies proposed in the literature for recommendation systems. Due to the nature of the elements to be recommended (i.e., courses), we opted for an exogenous recommendation where we identify concepts in the text and link them to the DBpedia knowledge graph. DBpedia covers the topics of the courses offered and is present in the three languages offered for GCFGlobal: English, Spanish, and Portuguese. We use the plain text of each course, extract relevant concepts, and describe the course in terms of a concept set. Each concept is identified with a URI that links the concept with its representation in DBpedia ontology, allowing for the cross-referencing between different annotation services. Additionally, each concept annotation comes with a confidence score that indicates the semantic agreement between the context in the source text and the concept in the knowledge graph, and also the other concepts found in the text. Figure 3 shows the workflow to extract the concepts from each course.

Fig. 3
figure 3

Semantic enrichment process

In the first step, concept mentions found in the text of the resource are extracted. This process takes an input text and returns a set of URIs of structured LOD entities to allow further reasoning on related concepts. Services such as DBpedia Spotlight,Footnote 4 BabelfyFootnote 5 or TextRazorFootnote 6 can be employed for this task. In this work, we used the TextRazor and Babelfy services. It is important to mention that we do not perform any additional verification on the annotations discovered. As was mentioned by [14], there is no guarantee of correct identification of annotations with the above services, so manual cleaning is suggested. However, in a realistic automatic scenario, a manual correction process is not feasible.

In the second step, for each service and concept, we extract the confidence score associated. The TextRazor confidence score does not have a defined range. It is a number in the interval of \([0, \infty )\). According to the TextRazor documentation, a value of 10 or more is enough to consider the concept as a reliable annotation.Footnote 7 In Babelfy, the confidence score value is in the interval of [0, 1]. Values closer to one implicate higher confidence. Later, we will use the confidence of each service as a filter. We create three concept sets: (i) “Razor set” that consolidates the concepts annotated via TextRazor and contains a total of 3960 concepts with a mean of 82 concepts by course; (ii) “Babel set” that contains 6,734 concepts with a mean of 148 concepts by course annotated by Babelfy; and (iii) “General set” that merges the concepts of TextRazor and Babelfy with a total of 8,241 and a mean of 184 concepts by course.

Content-Based Recommendation with Semantic Enrichment

The CB recommendation system seeks to recommend courses that are thematically similar to those that the user has already seen. As a first step, it is necessary to transform the text into a vectorized form that allows the calculation of some similarity metric. Subsequently, we create a simple user model composed of vector representations of the courses accessed by the user. The vectors of these courses are averaged in each dimension, to create a unique user-vector. Finally, we use this user-vector to calculate the cosine similarity and find the most similar courses to generate a top-N list of courses to recommend to the user.

The main step in the CB model is the selection of an appropriate vectorized representation of the course. According to the literature review, one of the most used strategies is topic modeling via algorithms like Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) [5, 13]. Collaborative filtering (CF) assumes that if two users rate a same course in the same way, they might also rate another course equally. Therefore, the CF model needs a rating to find similar users, yet in the GCFplatform, we do not have feedback about the courses from the users, and the scores obtained by the users in the activities are not saved. For this reason, we propose to infer the ratings using a user-vector representation of the CB system (see the “Content-Based Recommendation with Semantic Enrichment”) to create the matrix with the ratings.

The idea is to use the user-vector as a user interest representation to estimate a rating of the courses viewed by the user. In the end, we have for each course viewed by the user the cosine similarity with the user-vector representing the course’s rating. Although it is not the ideal way to build the rating, in a scenario of limited user information like ours, it is an alternative to using CF models [11]. With the rating of each user-course, we use the turicreate libraryFootnote 8 to create a user-course rating matrix, identify the similar users via cosine similarity, and select the top five of courses recommendation according with the courses viewed for the selected similar users. With this simple CF model, we achieved a significant improvement in relation to the CB recommendation models (see “Collaborative Filtering Recommendation with Semantic Enrichment ”).

Experiments and Results

For the evaluation, we use the information extracted from Google Analytics as our ground truth (explained in “Google Analytics Logs Processing”). The resulting dataset contains the historical courses viewed by 7071 users. Our recommendation model is built using the \(70\%\) of the courses accessed by a user in timeline order. The remaining \(30\%\) is used for testing. Users in our dataset have accessed at least (3) courses. Therefore, in the worst case, the recommendation model can be trained with the first (2) courses and tested with the last course accessed (1). The recommendation was evaluated using the metric of precision at K P@k [4, 9, 17]. In the literature, the K value is usually set in the interval of [5, 10], where the most common value found is \(k=10\). In our case, the majority of the users do not have 5 courses in the test set. Thus, a fixed value of 5 could over-penalize the recommendation models. For this reason, we decide using a custom K that depends on the courses viewed by the user. Our K is in the interval of [j, 5], where j is the number of courses viewed by a user in the test set with a limit of 5 courses. The maximum of courses (5) corresponds to the length of the final list suggested by the user. To do the respective comparison between different recommendation models, we also run several hypothesis tests. In all cases, the tests were performed as follows: (a) Generate 100 subsets of 3000 users from the full dataset. (b) Calculate the global average P@5 of the recommendations generated by the two models to be compared over the 100 subsets. (c) With the precision results, use a Wilcoxon rank-sum to test statistical significance (\(\rho < (0.05)\)).

Content-Based Recommendation with Semantic Enrichment

Our first experiment compares LSA with LDA model using our semantically enriched representation. We use the Gensim library and we perform a grid search for the best value of the “number of topics” hyperparameter.Footnote 9 Figure 4 shows the recommendation results for different values in the interval [5,395] with steps of 10.

Fig. 4
figure 4

P@5 results for LSA and LDA using “General set” for different “topics number” values

We can see an irregular behavior for LDA, with a slight tendency to increase the precision as the number of topics increases. For the LSA model, it is clear that an increase in a topic number implies an increase in the precision recommendation with a maximum of 0.22 after 215 topics.

We run a hypothesis test to compare LSA and LDA. Following the previous results, we compare the models using the LDA with 255 topics and the LSA model with 215 topics. Table 1 presents the results obtained in terms of P@5. The first column contains the results using the complete dataset. The second column corresponds to the average P@5 over the 100 subsets used for hypothesis testing. We test if the results obtained via LSA are statistically significant compared with LDA (i.e., evidence that the medians of the two populations differ). The last column of the table corresponds to the \(\rho\)-value obtained. According to the results, LSA representation leads to a better recommendation.

Table 1 Results comparison LDA vs LSA

In the second experiment, we filter the concept sets of the semantic representation using the confidence score. Our hypothesis is that the more confidence the annotations have, the better the representation of the course and, therefore, the better the recommendation. The filter operates using percentiles and creates four groups. The first group (Group 1) takes all of the concepts, the second one (Group 2) excludes the concepts with a confidence value of less than percentile 25, the third group (Group 3) filters according to the percentile 50, and the last one (Group 4) uses the percentile 75. We also added an additional filter inspired by natural language processing representations that remove concepts that are significantly frequent in the corpus. We call this filter the “frequency filter”, and it filters out concepts that appear in 95% or more of the courses.

Table 2 LSA results for different filters’ combinations

Table 2 shows the results of the different combinations of concept sets, confidence score group, and the application or not of the frequency filter using LSA model. For each combination, we indicate the resulting dictionary size (i.e., number of concepts). The results suggest that the confidence score does not have a relevant impact on the recommendation models. The best case uses all the concepts from the Babel set (Combination 5). The top three recommendation models are the Combinations 5, 11, and 12 which use the Babel set. We can conclude that the Razor set leads in general to the worst recommendation precision. In most cases, the precision increases with the frequency filter.

Finally, we want to compare our recommendation model that uses semantic enrichment with the top-performing models documented in [11]. In this work, different textual representations are compared for course recommendation. According to the results, the best representations were the MPNet pre-trained model [12] and LSA. MPNet is a recent multilingual contextual embedding pre-trained on large-scale unlabelled corpora that employ the transformer architecture. As in [11], we did not perform a model refinement process (i.e., fine-tuning) on MPNet. We used the version reported in the literature without any modification.

Table 3 presents the comparison results. As evidence, it is clear that LSA operating on the enriched semantic representation is better than operating only on textual information. A second conclusion is that our representation is also superior to the best model (MPNet) reported in the literature on this dataset [11], and the difference is statistically significant. It is important to consider that the MPNet model does not require extra training. For new courses in the platform, MPNet can be used without any additional process. In contrast, in our approach, when a new course is added, it is necessary to extract the concepts (“Course Content Processing”), and the LSA should be re-training to generate a better topic-based representation. Although concept annotation and LSA re-training can be done offline before including the course in the platform, we consider this to be a disadvantage compared to using a representation like MPNet.

Table 3 Comparison of our approach with textual representations via MPNet and LSA

Collaborative Filtering Recommendation with Semantic Enrichment

We employ the CF system described in “Collaborative Filtering Recommendation with Semantic Enrichment”, and take the top 5 courses for each user. We obtained an outstanding \(P@5=0.41\) evaluated on the complete dataset. Clearly superior to CB and validates the superiority of CF described in the literature for recommender systems. Despite these results, we cannot exclusively depend on a CF model in our scenario. As mentioned above, 70% of users are new users, so it is not possible to estimate ratings for them. For this reason, a hybrid approach is needed that alternates between CF and CB depending on whether there is user’s log information or not.

Conclusions and Future Work

This paper presented a strategy to create a hybrid recommendation system based on a course semantic representation to build CB and CF models for limited user information scenarios. Our research shows that an approach with semantic concepts from knowledge graphs (DBpedia) and the use of classical algorithms like LSA can obtain better results than cutting-edge models like transformers when a CB system is used. However, transformers models could be better in environments where the number of courses in the platform is high, because these models do not require new training where a new course is added.

The results show that the best model in terms of P@5 was CF recommendation model with a value of 0.41, i.e., two out of five courses recommended could be of the user’s interest. We find this result satisfactory given our limited user information scenario. The results obtained in this paper outperform previous works on this dataset.

Future works involve the use of a set of concepts to generate a concept graph that connects users and courses, enabling us to explore graph-based recommender strategies [16] and expansion strategies using knowledge graphs [7].