1 Introduction

Currently, recommender systems are essential for speeding up internet users' searches for relevant content. In many various hubs, including e-commerce and e-libraries [1, 2], the recommender system (RS) is typically used as an information retrieval tool to address the problem of information overflow. Specifically speaking, the item recommendation process in RS uses three well-known techniques: content-based filtering (CBF), collaborative filtering (CF), and hybrid filtering. CF, on the one hand, is the most widely used method across many RS domains [3]. Locating the nearest neighbors using the kNN algorithm is the most important component of CF for enhancing recommendation quality. Thus, the similarity measure has undeniable impact on how well CF performs. To find the closest neighbors (i.e., comparable users/items), numerous different similarity measures have been proposed in the literature [3,4,5,6]. The target user's "preferences on different items" is predicted using the rating matrix and the similarity levels of the nearest neighbors. This idea serves as the foundation for the Top-N recommendation list [7].

On the other hand, CF is subdivided into memory-based and model-based models. In general, memory-based recommendations use similarity metrics and correlation to suggest items to a user for purchase based on past user-item ratings. Both user-based and item-based CF can be used to categorize memory-based literary works [8]. Since 1990, the significance of recommendations has increased, transforming RS into a separate field of study. The categories for the RS approaches are displayed in Fig. 1.

Fig. 1
figure 1

The three basic categories of RS methods are content-based, collaborative filtering, and hybrid. The two types of collaborative filtering are memory-based and model-based. Deep learning models and SVD methods make up the two subcategories of model-based methods. User-based and item-based methods are two categories of memory-based techniques

Our work in this paper will center on collaborative filtering (CF), which is based on the idea that we can forecast active users' ratings for new products by taking into account previous users' ratings of related products [4]. CF approaches are subdivided further into neighborhood or memory-based and model-based models. Memory-based models enable predictions of ratings for new goods based on ratings provided by the target user's closest friends (or neighbors). On the other hand, a linear model is used to train latent components that explain rating on a lower dimensional latent space in the model-based method applying SVD, the first subcategory of model-based CF [1]. Despite the RS effectiveness of SVD algorithms, deep learning models, the second subtype of model-based CF, have been taken into consideration by the research community and the industry to learn latent features from users and items [1, 7]. However, despite the advantages of deep learning, nearest neighbor and similarity measures are still widely used in commercial recommender systems [16, 22] because of their simplicity. This fact drives us to continue develo** similarity measures for the improvement of item-based model performance.

While the item-based model (CF) has a great impact on CF's performance, the vast majority of CF efforts, however, have refrained from looking into the item-based model due to the model's complexity and the lengthy time needed to implement the model using similarity measure(s). As a result, a thorough analysis of the effect of the item-based model on CF performance is currently lacking in the CF literature. The item-based CF, originally introduced by GroupLens Research Group [9], is a more practical model for an item recommendation, although it is more complex than the user-based model. The item-based model's straightforward premise is that people frequently like analogs of products they previously loved. To forecast the target item [10] for the active user, the item-based model uses the degrees of similarity between things (s). Overall, the item-based model has the capacity to make more accurate recommendations than the user-based model, which is one of its main advantages [9,10,11].

In addition to the reasons mentioned above for choosing the item-based model, this work's primary objective is to introduce new similarity measures that have been empirically demonstrated to be promising and to investigate how effective the similarity measures are on the movie lens-100 K and Film Trust datasets using the item-based model. Due to other more significant factors, such as the item-based model's straightforward implementation, the ability to handle cold starts, the ability to scale well with the correlated items [8], as well as the glaring lack of a substantial investigation into the item-based model's impact on CF performance in the literature, we are also motivated to consider studying the impact of the item-based model on CF performance using a dozen similarity measures.

Simply put, the following contributions help this work bridge some of the gaps in the literature: (1) Develop five new item-based similarity measures to properly address the cold-start problem using the item-based model. These measurements are of simplistic and straightforward design, yet expertly drawn, and experimental evidence demonstrates that they are highly effective in identifying precise forecasts and recommendations. (2) Some important aspects are implicitly taken into account during the formulation of the proposed similarity measures without the use of weighting factors to consistently increase RS performance and quality. Decreasing or eliminating the reliance on co-rated items, complete utilization of all rated and non-rated items, and simultaneous treatment of similarity as symmetric and asymmetric are a few of these aspects. Experimentally, the effectiveness of recommender systems based on item-based CF is improved using our proposed similarity measurements. (3) Leveraging the item-based model with fivefold cross-validation to conduct a comprehensive empirically driven comparative analysis on two commonly known datasets (Movielens-100 K and Film Trust) for 30 similarity measures. To demonstrate the effectiveness and precision of our proposed methods, we also provide comprehensive empirical results. So, one key goal is to use the item-based model to benchmark these measures on the selected datasets. According to experimental findings on benchmark data sets, our proposed measures—in particular, NPMS—have a competitive prediction and recommendation quality. Empirically, NPMS addresses the data sparseness successfully when compared to the existing representative measures.

The remaining work is scheduled to be completed as follows: the most pertinent works are covered in Sect. 2. All considered measures, including the new measures and their rivals, are introduced in Sect. 3. The experimental results are shown and concisely discussed in Sect. 4. Finally, our work is concluded by looking at future work in Sect. 5.

2 Related Work

Researchers have been attracted to the memory-based collaborative filtering (CF) whose ultimate goal is to improve the presentation of the recommender system by addressing its flaws. The general goal of the memory-based CF is to ascertain the user's preferences by taking into account their ratings of various goods. In reality, CF has become a popular strategy for the recommendation process due to the growing amounts of data on the internet. Recommendation uses the two-phase item-based CF process, which involves calculating the similarity between items and users and predicting ratings using similar items/users. However, in sparse datasets, the quality of the recommendations has been severely diminished [12]. The top-k most similar items/users are still found using a variety of similarity measures in both user-based and item-based CF, and prediction methodologies are used to rate the prediction. The top N of the recommended things is then generated. Numerous research studies that use new, combined, or modified similarity measurements have been published in the literature to enhance CF performance. In this work, On the one hand, we used item-based models in our empirical investigation to explore as many potential similarity measures as we could to gauge their actual influence on CF. Contrarily, we theoretically rely on the most recent researches that address the item-based CF.

For instance, a modified Bhattacharyya coefficient was proposed in [8] to lessen the impact of data sparsity. The user-user similarity, which served as a weight in the item-based CF, was determined using the proposed similarity measure. When user-user similarity was discovered by the proposed modified similarity measure, the experimental study on the MovieLens dataset indicated a modest improvement in item-based CF. The memory-based CF techniques, such as the KNNBasic, KNNBaseline, KNNWithMeans, SVD, and SVD++, were evaluated in [13] to run RS by (1) determining a user-item similarity matrix and prediction matrix. (2) Applying the model-based CF technique via Co-Clustering to find a user-item similarity matrix and prediction matrix. The results showed that when applied to MovieTrust datasets using cross-validation (CV = 5, 10, and 15), the CF run, utilizing the K-NNBaseline technique, there is reduced error rate.

On the same theme, a technique to incorporate item-variance weighting into item-based CF was proposed in [14] to improve CF performance. A time-related correlation degree was given and employed to create a time-aware similarity computation in the meantime to improve predicted accuracy. This action was performed to recognize the relationship between two items and reduce the weight of items graded over an extended period of time. According to their experimental findings, the proposed technique performed better than conventional item-based CFs. On the other hand, formal concept analysis (FCA), a mathematical model-based CF, was utilized in [15] to improve the rating prediction of the unknown consumers. In order to estimate the rating of the unidentified users, the FCA approach was applied using Boolean matrix factorization (i.e., the best formal notions). The proposed method was tested on the Movielens dataset using the item-based CF technique, and the results were good comparing with rivals.

The improvement of the conventional similarity measures that are frequently practiced on the item-based CF was the main goal of the work of [16]. The three used conventional measures—cosine-based similarity, Pearson correlation similarity, and modified cosine similarity—were considered to be balanced using an algorithm. The algorithm was also contrasted with the currently used conventional measures. The proposed approach provided a better item-based CF for the recommendation systems than the previous one, according to the results. In [17], an item-based CF algorithm was suggested. It comprised of an item-based prediction method with the new neighbor selection approach and an item-based similarity measure based on fuzzy sets [Kullback–Leibler (KL) divergence was employed as a similarity measure]. The proposed method offered improved prediction and suggestion quality, according to the results.

Meanwhile, the improvement of rating prediction was the focus of the effort in [18]. To obtain a more precise calculation of the weight between the active user and each of their neighbors, an objective function was created and improved to be employed in the rating prediction phase. The results demonstrated that the proposed strategy performed better than others, particularly when there were a small and medium number of carefully chosen neighbors. A user-internal similarity-based recommendation algorithm (UISRC) was subsequently proposed in [19]. The recommendation score was modified to produce score predictions and suggestions using the internal similarity of users. The findings demonstrated that UISRC might outperform conventional CF and MD algorithms in terms of recommendation accuracy and diversity.

In addition, to make an effective recommendation, Bag et al. [20] proposed a new structural similarity named RJaccard, which not only took into account co-rated cases but also emphasized the importance of the number of non-co-rated cases. They tested the measure over Movielens-100 K dataset, and results show the proposed measure promising. Gazdar et al. [21] proposed a new similarity measure OS, which further added the influence of the proportion of co-rated cases. Similarly, to calculate item similarity, Wang et al. [22] established a divergence-based similarity measure and further incorporated the effect of the fraction of co-rated cases. Finally, in [23], it was suggested to use a new item-based Kullback–Leibler (KL) divergence-based CF method in which the item similarity measure based on hazy sets and the item-based prediction method (ISP) make up the proposed algorithm. The results showed that the proposed method is promising.

3 Methodology

3.1 Motivation

In actuality of RS, the efficacy of both neighborhood-based and model-based techniques is significantly affected by the similarity measure (either user-based or item-based). More specifically, the similarity method can be considered as a key element of CF success given the impact that similarity has on the accuracy of its recommendations. In this regard, our experimental findings show that most of the existing similarity measures miss the effect of sparse data on prediction outcomes and fail to address the cold-start issue when the item-based model is taken into account. Thus, the ultimate goal of this research is to add new simple-yet-effective similarity measures to the item-based CF literature. These similarity measures are mathematically designed to significantly improve the CF's performance through implicit addressing the data sparsity challenge (i.e., cold-start problem). Moreover, the examination of the influence of the item-based model, in contrast to the user-based CF, has not been given convincing attention. Due to its intricate complexity and prolonged run time compared to the user-based model, only a small number of studies (which are already mentioned in related work section) have been undertaken to explore the item-based model using similarity measures for collaborative filtering. In the meantime, this paper aims to give a thorough empirical investigation on the similarity measures for collaborative filtering using the item-based model in addition to proposing new similarity measures.

Furthermore, the comparative evaluation findings reported in this article could be a greatly helpful guide for researchers in this field as they select the best similarity measure for the item-based model. Most crucially, our empirical study is the first to fully address the item-based model utilizing over 30 similarity metrics. The item-based empirical work, which addresses this research vacuum, is a more difficult challenge than the user-based model and has not yet gotten enough attention in the literature. In conclusion, even though a variety of similarity metrics have been offered from various angles to improve the accuracy of similarity results, more research is still required, according to our review of the literature and experimental study, to produce highly competitive similarity measures.

3.2 The Proposed Similarity Measures

The creation of an active user's neighborhood depends critically on calculation of similarity, which also influences how accurate the CF-based recommendation approach is. To improve prediction accuracy, this part describes five novel similarity measures that try to account for actual similarity between items. Technically speaking, the advancement of memory-based CF approaches' accuracy is the main goal of the proposed work. Memory-based CF approaches search for comparable users/items for an active user/item using a database of user evaluations of items. The ratings of items not viewed by the active user are then predicted using ratings provided by comparable users. The effectiveness of the similarity measure has a significant impact on the recommendations' accuracy.

In a trial to find the best-performing CF using the item-based model, we develop a new measure that is based on the ideologies of the TF and IDF. Given two rating vectors u1 = (r11, r12,…, r1n) and u2 = (r21, r22,…, r2n) of user 1 and user 2, respectively, in which some rij can be missing (empty), this measure can be seen as a quasi-IDF, which is defined in Eq. 1, as follows:

$$\mathrm{quasiIDF}\left({u}_{1},{u}_{2}\right)=1-\frac{\left(\sum_{j\in {I}_{1}\backslash {I}_{2}}{r}_{1j}\right)\left(\sum_{j\in {I}_{2}\backslash {I}_{1}}{r}_{2j}\right)}{\left(\sum_{j\in {I}_{1}}{r}_{1j}\right)\left(\sum_{j\in {I}_{2}}{r}_{2j}\right)}$$
(1)

The notation "\" denotes the complement operator. Similarly, following the ideology of the term "frequency (TF), we define the so-called "quasi-TF" in Eq. 2, as follows:

$$\mathrm{quasiTF}\left({u}_{1},{u}_{2}\right)=\frac{\left(\sum_{j\in {I}_{1}\cap {I}_{2}}{r}_{1j}\right)\left(\sum_{j\in {I}_{1}\cap {I}_{2}}{r}_{2j}\right)}{\left(\sum_{j\in {I}_{1}}{r}_{1j}\right)\left(\sum_{j\in {I}_{2}}{r}_{2j}\right)}$$
(2)

Note, the notation “\(\cap \)” denotes the intersection operator in the set theory. This new measure, which is called a quasi-TF-IDF (QTI), is the product of the quasi-TF and the quasi-IDF, according to Eq. 3.

$$\mathrm{QTI}\left({u}_{1},{u}_{2}\right)=\mathrm{quasiTF}\left({u}_{1},{u}_{2}\right)\times \mathrm{quasiIDF}\left({u}_{1},{u}_{2}\right)=\frac{\left(\sum_{j\in {I}_{1}\cap {I}_{2}}{r}_{1j}\right)\left(\sum_{j\in {I}_{1}\cap {I}_{2}}{r}_{2j}\right)}{\left(\sum_{j\in {I}_{1}}{r}_{1j}\right)\left(\sum_{j\in {I}_{2}}{r}_{2j}\right)}\times \left(1-\frac{\left(\sum_{j\in {I}_{1}\backslash {I}_{2}}{r}_{1j}\right)\left(\sum_{j\in {I}_{2}\backslash {I}_{1}}{r}_{2j}\right)}{\left(\sum_{j\in {I}_{1}}{r}_{1j}\right)\left(\sum_{j\in {I}_{2}}{r}_{2j}\right)}\right)$$
(3)

In the meantime, we also derive a Jaccard-combined QTI, which derives the QTIJ similarity measure according to Eq. 4.

$$\mathrm{QTIJ}\left({u}_{1},{u}_{2}\right)=\frac{\left(\sum_{j\in {I}_{1}\cap {I}_{2}}{r}_{1j}\right)\left(\sum_{j\in {I}_{1}\cap {I}_{2}}{r}_{2j}\right)*\mathrm{Jaccard}\left({u}_{1},{u}_{2}\right)}{\left(\sum_{j\in {I}_{1}}{r}_{1j}\right)\left(\sum_{j\in {I}_{2}}{r}_{2j}\right)}\times \left(1-\frac{\left(\sum_{j\in {I}_{1}\backslash {I}_{2}}{r}_{1j}\right)\left(\sum_{j\in {I}_{2}\backslash {I}_{1}}{r}_{2j}\right)\times \left(1-\mathrm{Jaccard}\left({u}_{1},{u}_{2}\right)\right)}{\left(\sum_{j\in {I}_{1}}{r}_{1j}\right)\left(\sum_{j\in {I}_{2}}{r}_{2j}\right)}\right)$$
(4)

Based on the Jaccard measure, Eq. 4 is written simply in Eq. 5 as follows:

$$\mathrm{QTIJ}\left({u}_{1},{u}_{2}\right)=\mathrm{quasiTFJ}\left({u}_{1},{u}_{2}\right)\times \mathrm{quasiIDFJ}\left({u}_{1},{u}_{2}\right)$$
(5)

where,

$$\mathrm{quasiTFJ}\left({u}_{1},{u}_{2}\right)=\frac{\left(\sum_{j\in {I}_{1}\cap {I}_{2}}{r}_{1j}\right)\left(\sum_{j\in {I}_{1}\cap {I}_{2}}{r}_{2j}\right)\times \mathrm{Jaccard}\left({u}_{1},{u}_{2}\right)}{\left(\sum_{j\in {I}_{1}}{r}_{1j}\right)\left(\sum_{j\in {I}_{2}}{r}_{2j}\right)}$$
(5.1)
$$\mathrm{quasiIDFJ}\left({u}_{1},{u}_{2}\right)=1-\frac{\left(\sum_{j\in {I}_{1}\backslash {I}_{2}}{r}_{1j}\right)\left(\sum_{j\in {I}_{2}\backslash {I}_{1}}{r}_{2j}\right)\times \left(1-\mathrm{Jaccard}\left({u}_{1},{u}_{2}\right)\right)}{\left(\sum_{j\in {I}_{1}}{r}_{1j}\right)\left(\sum_{j\in {I}_{2}}{r}_{2j}\right)}$$
(5.2)

The following example demonstrates how the QTIJ measure works. Given two rating vectors u1 = (r11 = 2, r12 = 5, r13 = 7, r14 = 8, r15 = ?, r16 = 9) and u2 = (r21 = 9, r22 = ?, r23 = ?, r24 = 6, r25 = 5, r26 = 1), we calculate QTIJ as an example. Of course, we have I1 = {1, 2, 3, 4, 6} and I2 = {1, 4, 5, 6}. We also have the following:

$$\mathrm{quasiTFJ}\left({u}_{1},{u}_{2}\right)=\frac{\left(2+8+9\right)\left(9+6+1\right)}{\left(2+5+7+8+9\right)\left(9+6+5+1\right)}\times \frac{\left|{I}_{1}\cap {I}_{2}\right|}{\left|{I}_{1}\cup {I}_{2}\right|}=\frac{19\times 16}{31\times 21}\times \frac{3}{6}\cong 0.23,$$
$$\mathrm{quasiIDFJ}\left({u}_{1},{u}_{2}\right)=1-\frac{\left(5+7\right)*5}{\left(2+5+7+8+9\right)\left(9+6+5+1\right)}\times \left(1-\frac{\left|{I}_{1}\cap {I}_{2}\right|}{\left|{I}_{1}\cup {I}_{2}\right|}\right)=1-\frac{12\times 5}{31\times 21}\left(1-\frac{3}{6}\right)\cong 0.95.$$

This implies:

$$\mathrm{QTIJ}\left({u}_{1},{u}_{2}\right)=\mathrm{quasiTFJ}\left({u}_{1},{u}_{2}\right)\times \mathrm{quasiIDFJ}\left({u}_{1},{u}_{2}\right)=0.23\times 0.95\cong 0.22.$$

However, the Jaccard measure, which is a powerful similarity measure, concentrates on the rating status rather than the magnitude rating values like cosine, for instance. So, we have also tried to resolve this limitation by putting forth the so-called numerical proximity similarity measure (NPSM). NPSM maintains the benefit of the Jaccard measure while focusing on magnitude rating values. In other words, NPSM adds the cardinalities of item sets and the sums of rating values as well. Equation 6 provides the following NPSM formulation.

$$\mathrm{NPSM }\left({u}_{1},{u}_{2}\right)=\frac{\left|{I}_{1}\cap {I}_{2}\right|\sum_{j\in {I}_{1}\cap {I}_{2}}{r}_{1j}{r}_{2j}}{\left|{I}_{1}\right|\sum_{j\in {I}_{1}}{r}_{1j}+\left|{I}_{2}\right|\sum_{j\in {I}_{2}}{r}_{2j}}.$$
(6)

Observe that |I1 ∩ I2| represents the total number of products rated by users 1 and 2, |I1| represents the total number of items rated by user 1, and |I2| represents the total number of items rated by user 2. It is obvious that NPSM is an intriguing advanced cosine measure version with Jaccard support. As such, we are also motivated to prove that NPSM differs greatly from the proposed cosine-Jaccard combination, which is described in Eq. 7, as follows:

$$\mathrm{COSINEJ}\left({u}_{1},{u}_{2}\right)=\mathrm{cosine}\left({u}_{1},{u}_{2}\right)\times \mathrm{Jaccard}\left({u}_{1},{u}_{2}\right)=\frac{\sum_{j\in {I}_{1}\cap {I}_{2}}{r}_{1j}{r}_{2j}}{\sqrt{\sum_{j\in {I}_{1}}{\left({r}_{1j}\right)}^{2}}\sqrt{\sum_{j\in {I}_{2}}{\left({r}_{2j}\right)}^{2}}}\times \frac{\left|{I}_{1}\cap {I}_{2}\right|}{\left|{I}_{1}\cup {I}_{2}\right|}$$
(7)

Our experimental study has shown that our NPSM, in numerous cases, is significantly more straightforward and effective than Jaccard, and our proposed COSINEJ as well. To understand the work of NPSM, the next example is provided. Given two rating vectors u1 = (r11 = 2, r12 = 5, r13 = 7, r14 = 8, r15 = ?, r16 = 9) and u2 = (r21 = 9, r22 = ?, r23 = ?, r24 = 6, r25 = 5, r26 = 1), we calculate NPSM as an example. Of course, we have I1 = {1, 2, 3, 4, 6} and I2 = {1, 4, 5, 6}, then NPSM is computed as follows:

$$\mathrm{NPSM }\left({u}_{1},{u}_{2}\right)=\frac{\left|{I}_{1}\cap {I}_{2}\right|\left(2\times 9+8\times 6+9\times 1\right)}{\left|{I}_{1}\right|\left(2+5+7+8+9\right)+\left|{I}_{2}\right|\left(9+6+5+1\right)}=\frac{3\times 75}{5\times 31+4\times 21}\cong 0.94.$$

In conclusion, we propose the measure families QTI, QTIJ, and NPSM in this research. The accuracy of similarity is further enhanced by NPSM, which implicitly takes into account the impact of rating quantity. Moreover, The NPSM equation has been designed to adjust prediction outcomes by taking into account more rating data. Every vector in the user-based rating matrix is a user rating vector by default. Therefore, the equations for these measures are not altered in terms of their semantics when the user-based rating matrix is transformed into an item-based rating matrix, in which every vector is an item rating vector. According to experimental findings on benchmark data sets, our suggested measures—in particular, NPMS—have competitive prediction and recommendation quality, and successfully address the data sparseness issue when compared to existing representative measures.

3.3 The Evaluation Mechanism

Each dataset used in our experiments (using Java) was divided into five folders, each containing a training set and a testing set. The training set and testing set are independent sets located in the same folder. The proportion of the testing set to the overall dataset has been calculated using the specified testing parameter r, which has a range of 0.1–0.9. Figure 1 shows the proposed strategy that we adopted for our experimental study. In the case of Movielens-100 K, for example, the training set has 90,000 ratings assuming r = 0.1, while the testing set has 10,000 ratings (10% × 100,000), indicating that the testing set covers 10% of the dataset. In our experimental setting, parameter r has nine different values that are 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9, respectively. We discovered that the smaller r is, the larger the training set is, and the more accurate measures are.

It is important to note that the "r" parameter is used to test how well all similarity measures, including the ones we present, perform while taking into account how much training data sparsity they can handle. Therefore, the more instances of r where the relative similarity measure X surpasses its rivals, the more likely it is that the X measure is robustly applicable to sparse datasets. This method of testing has allowed us to show that some of the similarity metrics we proposed to solve the cold start issue are indeed effective. (See Fig. 2).

Fig. 2
figure 2

Experimental design scheme

We used both datasets—the Movielens-100 K and Film Trust datasets—to investigate how the item-based model affected CF performance. Amazingly, our proposed similarity measures in this work are promising. Assuming that the active item has k neighbors, Eq. 8 computes the missing value raj as follows:

$${r}_{aj}={\overline{r} }_{a}+\frac{\sum_{i=1}^{k}\left({r}_{ij}-{\overline{r} }_{i}\right)sim\left({r}_{a},{r}_{i}\right)}{\sum_{i=1}^{k}\left|\mathrm{sim}\left({r}_{a},{r}_{i}\right)\right|}$$
(8)

where \({\overline{r} }_{a}\) and \({\overline{r} }_{i}\) are, respectively, the mean values of \({r}_{a}\) and \({r}_{i}\). The accompanying Tables 1, 2, 3, 4, 5 and 6 show the performance averaged across 5 folders for the tested measures. The results of the associated assessment metric for each of the following values of r = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9 are displayed in each of the accompanying Tables 1, 2, 3, 4, 5, 6. The last column of each Table, which gives the averaged results of the pertinent measure over all values of r, displays the five best values, which are shown in grey-shaded columns.

3.4 The Rivalry Similarity Measures

Through our GitHub repository, the public can view most of the included rivalry similarity metrics.Footnote 1

3.5 Evaluation Metrics

We have used the standard metrics for CF algorithms, namely the mean absolute error (MAE) (Eq. 9), the mean squared error (MSE) (Eq. 10), and the R metric (Eq. 10), to analyze the correlation between the prediction vector v′ and the genuine vector v.

$$\mathrm{MAE}=\frac{1}{n}\sum_{j=1}^{n}\left|{v}_{j}{^\prime}-{v}_{j}\right|$$
(9)

When the MAE is smaller, the approach operates more effectively. The number n represents the expected item count, and the letters vj and vj′ stand for the true and predicted ratings of item j, respectively. On the other hand, the MSE is determined as follows given the prediction vector v' and true vector (tested vector) v:

$$\mathrm{MSE}=\frac{1}{n}\sum_{j=1}^{n}{\left({v}_{j}{^\prime}-{v}_{j}\right)}^{2}$$
(10)

When the MSE is lower, the algorithm performs better and the measurement is more accurate. Finally, the accuracy metric R is shown in Eq. 11. The measure is better if R is higher.

$$R=\frac{\sum_{j=1}^{n}\left({v}_{j}{^\prime}-\overline{{v }{^\prime}}\right)\left({v}_{j}-\overline{v }\right)}{\sqrt{\sum_{j=1}^{n}{\left({v}_{j}{^\prime}-\overline{{v }{^\prime}}\right)}^{2}}\sqrt{\sum_{j=1}^{n}{\left({v}_{j}-\overline{v }\right)}^{2}}}$$
(11)

where \(\overline{{v }{^\prime}}\) and \(\overline{v }\) are the mean values of the tested and predicted items, respectively.

$$\overline{{v }{^\prime}}=\frac{1}{n}\sum_{j=1}^{n}{v}_{j}{^\prime}$$
$$\overline{v }=\frac{1}{n}\sum_{j=1}^{n}{v}_{j}$$

Overall, three metrics, namely the MAE, MSE, and R, enable a precise assessment of the algorithms. A similarity measure is preferred when its MAE and RMSE values are lower and when its R values are higher.

4 Results and Discussion

Over both datasets, the findings of each similarity measure for the MAE, MSE, and R metrics are displayed in the following Tables 1, 2, 3, 4, 5 and 6. According to experimental findings on benchmark data sets, our suggested measures—in particular, NPSM—have competitive prediction and recommendation quality, and successfully address the data sparseness issue when compared to existing representative measures.

According to Table 1, the top six similarity measures, in order of averaged MAE, are: NPSM, NHSM, CosineJ, HSMDJ, TMJ, and QTIJ. On the other hand, the first three similarity measures, in order of r parameter performance (Figs. 3, 4), are: NPSM, QTIJ, and NHSM. The r parameters specify the sparsity level of the dataset. We can note that the proposed measures are more stable and better than all considered measures when the sparsity level of the data is very high 10–90, 20–80, 30–70, 40–60 splits, and even 50–50. These three measures (QTIJ, NPSM, and NHSM) also showed competitive averaged performance against other rivals as seen in Fig. 4.

Table 1 MAE results—Movielens-100 K
Fig. 3
figure 3

Top six similarity measure based on MAE—Movielens

Fig. 4
figure 4

Top six similarity measure in average—Movielens 100 K

According to Table 2, the top six similarity measures, in order of averaged MSE, are: NPSM, NHSM, CosineJ, HSMDJ, TMJ, and Jaccard. On the other hand, the first three similarity measures, in order of r parameter performance (Figs. 5, 6), are: NPSM, CosineJ, and NHSM. We can note that the proposed measures are more stable and better than all considered measures when the sparsity level of the data is very high 10–90, 20–80, 30–70, 40–60 splits, and even 50–50. These measures also showed competitive averages against other rivals as seen in Fig. 6.

Table 2 MSE results—Movielens-100 K
Fig. 5
figure 5

Top six similarity measure based on MSE—Movielens 100 K

Fig. 6
figure 6

Top six similarity measure in average—Movielens 100 K

According to Table 3, the top six similarity measures, in order of averaged R, are: NHSM, Jaccard, NPSM, HSMDJ, QTI, and QTIJ. On the other hand, the first three similarity measures, in order of r parameter performance (Figs. 7, 8), are: Jaccard, NPSM, and NHSM. These measures also showed competitive averaged against other rivals as seen in Fig. 8.

Table 3 R results—Movielens-100 K
Fig. 7
figure 7

Top six similarity measure based on R—Movielens 100 K

Fig. 8
figure 8

Top six similarity measure in average—Movielens 100 K

According to Table 4, the top five similarity measures, in order of averaged MAE, are: Mu, NPSM, NHSM, TMJ, CosineJ, and CjacMD. On the other hand, the first three similarity measures, in order of r parameter performance (Figs. 9, 10), are: NPSM, Mu, and CosineJ.

Table 4 MAE results—Film Trust
Fig. 9
figure 9

Top five similarity measure based on MAE—Film Trust

Fig. 10
figure 10

Top five similarity measure in average—Film Trust

On the other hand, according to Table 5, the top five similarity measures, in order of averaged MSE, are: Mu, NPSM, CjacMD, WPC, and TMJ. On the other hand, the first three similarity measures, in order of r parameter performance (Figs. 11, 12), are: Mu, NPSM, and CjacMD.

Table 5 MSE results—Film Trust
Fig. 11
figure 11

Top Five Similarity Measure based on MSE– Film Trust

Fig. 12
figure 12

Top Five Similarity Measure in Average – Film Trust

Finally, according to Table 6, the top five similarity measures, in order of averaged R, are: CosineJ, QTIJ, QuasiTFJ, NPSM, and NHSM. On the other hand, the first three similarity measures, in order of r parameter performance (Figs. 13, 14), are: QTIJ, QuasiTFJ, and CosineJ.

Table 6 R results—Film Trust
Fig. 13
figure 13

Top five similarity measure based on R—Film Trust

Fig. 14
figure 14

Top five similarity measure in average—Film Trust

To sum up, it is worth stressing that the CF literature has been creating similarity metrics that are built on several notions to adequately represent the concept of similarity between users/items. A pair of users' preferences for a given item have been taken into consideration when computing similarity for some of these measures, just like our suggested measures in this article. To do this, we looked at how strongly a pair of users liked or disapproved of a certain item in relation to the majority. To verify the effectiveness of all similarity measures at varied rates of training and testing data, several scenarios/cases (r values) are addressed. According to experimental findings, currently available similarity measurements don't adequately account for the true similarity of items. Additionally, results showed that the computation of similarity over the majority of these instances has significantly improved with the proposed similarity measures.

4.1 Similarity Measure Ranking

The following formula is proposed to rank the similarity measures (SM) depending on how well it performed across both datasets (the formula can be expanded to be over N datasets):

$$\mathrm{Rank} \left(\mathrm{SM}\right)= \frac{\sum_{i=1}^{N}\sum_{j=1}^{M}(P(SM)++)}{M \times N}$$
(12)

When the SM satisfies the desired performance regarding the target evaluation metric, a temporary counter, denoted by P(SM), is continually increased by one; otherwise, P is set to zero. N and M, respectively, stand for the number of datasets and the number of evaluation metrics (see Table 7).

Table 7 Similarity measure ranking

The Top-Rank I (where I falls in 1, 2, 3) suggests the number of evaluation metrics on which the similarity measure has achieved the first (including the second and third) order concerning the performance (MAE, MSE, and R) over both datasets. The statistics in Table 7 can lead us to the conclusion that NPSM, NHSM, and CosineJ all performed best on average across both datasets, making these similarity measures quite promising, with NPSM being the top one.

4.2 Work Limitations

While we admit that some recent research on model-based recommender systems has focused on stochastic gradient descent [24, 25] and deep learning methods [26], it is worth stressing that the nearest neighbor and similarity measures are still highly preferred in the industry [2, 6,7,8], and, due to their simplicity, they are also still widely employed in commercial recommender systems [16, 22]—a fact that motivates us to keep introducing these much-appreciated measures. In other words, it is also important to note that online businesses most frequently use memory-based RS because it is the most effective and straightforward to install. So, the scope of this paper is restricted to demonstrate the vital impact of the similarity measures, including the proposed and combined ones, on memory-based collaborative filtering performance using the item-based model with kNN. The results do imply that further research into nearest neighbor-driven memory-based collaborative filtering approaches is worthwhile. It is also worth indicating that the proposed similarity measures can still be effectively leveraged, with high performance, in all similarity measure-based fields (i.e., information retrieval and natural language processing).

5 Conclusions and Future Work

In this study, we employed an item-based model to experimentally examine the effectiveness of collaborative filtering. Dozens of similarity measures have been taken into account in order to conduct this investigation completely. In addition, we have developed five new similarity measures of simplistic design comparing to their rivals. Our proposed “simple” similarity measures have proved to be promising in comparison to their contenders, including the most recent ones. Additionally, it is clear from the outcomes of the MAE, MSE, and R metrics that Jaccard was able to play a significant role in the discovery of highly competitive measures like the Jaccard-fused ones (QTIJ and CosineJ). Surprisingly, when an item-based model is accounted, some of the top SOTA performers (i.e., SMD and NHSM) have not been able to significantly compete against their rivals. In other words, we observe that the top-performing SMD and NHSM [2] are not superior when the item-based model is used. We have confirmed that the proposed measures of this study can effectively compete against SMD and NHSM using the user-based model, as all experiments of our current work were only done using the item-based model. We believe that one important factor in the effective performance of our proposed measures is due to their straightforward design, which constructively takes into account both the magnitude and presence of rating vectors. Nevertheless, we think the performance of CF using these measures still needs to be further enhanced to achieve the best rendition using both item-based and user-based models. In conclusion, according to experimental findings on benchmark data sets, our suggested measures—in particular, NPMS—had competitive prediction and recommendation quality, and successfully addressed the data sparseness issue when compared to existing representative measures.

Therefore, future research will focus on adopting more novel similarity measures and carefully examining how each measure, including the Jaccard's combination, affects the performance of CF. This will be done using both user-based and item-based models. In particular, future work seeks to introduce an improved version of SMD to address the NHSM and SMD's shortcomings when the item-based model is taken into account. The main objective of this work was to efficiently use item and user-based models to reduce the impact of the sparsity problem on CF performance. The inclusion of as many similarity measurements, including the most current ones [27,28,29] in the comparison study, is another objective of the next effort. Moreover, we plan to integrate our proposed measures with multi-criteria ratings [3] and deep learning [30] for preference learning, and apply these models for several CF-related applications such as aware graph neural network [31] deep learning [32], and fuzzy-based recommender systems [33].