Abstract
This paper proposes a hybrid model to improve Information Content (IC) related metrics of semantic similarity between words, named IC+SP, based on the essential hypothesis that IC and the shortest path are two relatively independent semantic evidences and have approximately equal influences to the semantic similarity metric. The paradigm of IC+SP is to linearly combine the IC-related metric and the shortest path. Meanwhile, a transformation from the semantic similarity of the concepts to that of the words is presented by maximizing every component of IC+SP. 13 improved IC-related metrics based on IC+SP are formed and implemented on the experimental platform HESML Lastra-Díaz (Inf Syst 66:97–118, 2017). Pearson’s and Spearman’s correlation coefficients on well-accepted benchmarks for the improved metrics compare to those for the original ones to evaluate IC+SP. I introduce the Wilcoxon Signed-Rank Test needing no standard distribution hypothesis, while, this hypothesis is required by T-Test on the sample of small size. T-Test, as well as the Wilcoxon Signed-Rank Test, conduct on the differences of the correlative coefficients for improved and original metrics. It is expected that the improved IC-related metrics could significantly outperform their corresponding original ones, and the experimental results, including the comparisons of mean and maximum of correlation coefficients as well as the p-value and confidence interval of both tests, accomplish the anticipation in the vast majority of cases.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
In the area of Artificial Intelligence (AI) and numerous scientific fields, it is always challenging to improve the cognitive skills of machines to automate various complex tasks [2].Footnote 1 One of the keys to this problem is to computationally assess semantic similarity/relatedness between things in ways human beings do. In fact, in the last decades, numerous researchers from various fields have studied and developed many different approaches to estimate semantic similarity/relatedness between words. These approaches have been applied to facilitate various natural language processing (NLP) as well as information retrieval (IR) tasks [3] such as document categorization or clustering [4], word sense disambiguation [5], query interpretation [6, 7], information extraction [8] etc. In some occasions, semantic similarity and semantic relatedness are used interchangeably, as both of them could refer to semantic likeness. However, in most works in the literature as well as in this work, the concept of semantic similarity is more specific than semantic relatedness [9]. Generally speaking, the semantic similarity between concepts only states how taxonomically (is-a relationship) near they are, while their semantic relatedness includes any other relations between them such as meronymy, antonymy, cause-effect, etc. besides taxonomy. For example, “car” is similar to “truck”, but is also highly related to “wheel” and “road” which are not high similar to “car”. It is that the semantic relationships implied in semantic similarity and relatedness are not the same, the computational strategies used to assess semantic similarity and relatedness are normally different. The corpus-based semantic metrics mainly rely on the distribution hypothesis that the statistical co-occurrence between word contexts in a corpus reflects the degree of semantic relationship between words [2]. As the co-occurrence could appear with various semantic relations, thus, it is considered that the corpus-based semantic metrics might confuse the semantic similarity with relatedness [10] and are better to apply to estimate the semantic relatedness to some extent. On the other hand, the ontology-based strategies have proved to be the most successful of semantic similarity metric approaches [11], for the ontology-based semantic metrics are formed depending on “is-a” taxonomies which imply the semantic similarity degree between concepts. However, since the semantic graphs or taxonomies could construct based on the public corpus, e.g., Strube and Ponzetto [12] introduced WikiRelate formed as a taxonomy based on the pages of Wikipedia, the standard ontology-based semantic metrics could combine with the corpus-based ones to form hybrid models [13,14,15]. Furthermore, in recent years, deep learning architectures have promoted hybrid models [16, 17]. This work concentrating the ontology-based semantic metrics are not only for their improvements but also to benefit for the study of hybrid semantic ones. With the emergence of BERT [17], semantic similarity research has made significant progress. However, it still has a lot of room for improvement based on the results of the improved models of BERT tested on text standard datasets [18,19,20,21]. The semantic similarity between words is widely used in the construction of these models, thus, this work will also benefit the semantic similarity research of sentence, text, et al.
A large number of ontology-based semantic similarity metrics between words have been proposed and applied in various domains in recent years, but the robust semantic similarity assessments between words remain a challenging task. In most of the related works, ontology-based semantic similarity metrics are briefly divided into three main categories as follows:
-
(1)
Feature-based metrics. The essential idea of these metrics originating from the feature model proposed by Tversky [22] is to regard the feature set of a concept to be the core semantic evidence. The same features of the concepts reflect their semantic similarity degree while the different features imply the degree of their dissimilarity.
-
(2)
Path-based or edge-counting metrics. The original model was proposed by Rada et al. [23] and developed in the later works of Wu and Palmer [24], Leacock and Chodorow [25], Li et al. [26] and so on. The key idea of this category is to take the length of the shortest path between concepts as the main semantic evidence.
-
(3)
IC-based metrics. The classic models of this category came from [9, 27, 28]. Their essential idea is to take the IC of a concept to be the main semantic evidence. Since the intrinsic IC (iIC) was proposed in the pioneering work of Seco et al. [29], the definition of IC has been the principal point of IC-based metric research.
Apart from the aforementioned three categories of ontology-based semantic similarity metrics, many more researchers nowadays try to take advantages of more than one of these semantic evidences to propose new hybrid semantic metrics. The shortest path and the IC are commonly-used semantic evidences to be combined [26, 30,31,32,33]. Furthermore, since the shortest path and the features usually utilize in the definitions of almost all iICs, thus, an iIC could also be regarded to be a hybrid metric model. In the work of Seco et al. [29], the number of the descendants of a concept that could be considered as its feature utilize to define iIC. In the later iICs proposed by Zhou et al. [34], Sebti and Barfroush [35], Sánchez et al. [36], Yuan et al. [37], Hadj Taieb [38] and Zhu and Iglesias [33], the taxonomic features of the concept, e.g., the descendant set, the ancestor set, the subsumed leafs, the entities, are normally used together with the concept’s depth. During the last decade, the iIC has become one of the essential points of semantic metric research. This research tendency is in accordance with the hybrid metrics having become one of the research mainstreams, because their core ideas, i.e., trying to take advantages of more types of semantic evidences, could be regarded the same. Lastra-Díaz and García-Serrano [39] constructed a series of well-found iICs by extracting different taxonomic features based on two axioms.
During the last decade, hybrid ontology-based semantic similarity metrics (including iIC models) have become the mainstream in the area. Most of the state-of-the-art ontology-based metrics took the IC and the shortest path as the highest influenced semantic evidences. However, the state-of-the-art metrics either extracted the taxonomic features building new IC models or complicatedly nonlinearly combined these two types of semantic evidences forming the computational algorithm of the semantic similarity. These newly proposed metrics generally come across a problem that it is usually difficult even impossible to explain the proposed iICs and the complicated algorithms. Furthermore, tuning parameters that utilize in some of the metrics make the explanation problem worse, as Pirró [40] and Sánchez et al. [41] pointed out that tuning parameters could exceedingly influence the measurement results obtained from the methods proposed in [26]. Therefore, the main motive of this work is to construct an improved model of IC-related semantic similarity metrics which remains interpretable and a simple structure meanwhile. Moreover, most of the state-of-the-art ontology-based semantic similarity metrics proposed in the literature just reported their raw results without making the statistical test. In the researches conducting the statistical test, T-Test commonly performs on the sample whose size is usually small. However, T-Test on small sample generally hypothesizes the normal distribution of the sample which does not test in the researches. Therefore, the other motive of this work is to make extensive statistical tests on the correlation coefficients for different IC-related metrics including not only T-Test but also a nonparametric test - Wilcoxon Signed-Rank Test which does not require the hypothesis of the normal distribution of the sample.
In this work, I detailedly reviewed the classical and state-of-the-art ontology-based semantic similarity metrics in recent years. Based on the IC-related semantic similarity metrics, a hybrid ontology-based model denoted as IC+SP is proposed linearly combining the IC-related semantic similarity metric and the shortest path. I developed the improved IC-related semantic similarity metrics based on IC+SP on platform of HESML [1]. WordNet 3.0 is applied as the ontology to measure the semantic similarity of concept pairs. Furthermore, 5 gold benchmarks (including 65 pairs of nouns from Rubenstein and Goodenough [42], 28 pairs of nouns from Miller and Charles [43], 201 pairs of nouns from WordSim353 [44], 65 pairs of nouns from Pirró [40] and 665 pairs of nouns from Sim-Lex-999 [45]) are used as the experimental datasets to evaluate the performances between the improved IC-related metrics based on IC+SP model and their corresponding original ones. Pearson’s and Spearman’s correlation coefficients between computative and human judgemental semantic similarities are both used to compare IC-related metrics to their derivative improved ones formed based on IC+SP model. There are four principal quantitative indicators for comparison in this work, i.e., average value and maximum value of the raw correlation coefficients as well as p-value and confidence interval of statistical test on the raw correlations. The T-Test is commonly applied to be a standard test in related literature, so it also utilizes in this work. However, the set of correlation coefficients generally does not accord with the normal distribution. Therefore, the Wilcoxon Signed-Rank Test is introduced in this work, for it is a nonparametric test needing no hypothesis of the normal distribution. Based on these comparative indicators, it could confirm whether the improved IC-related metric based on IC+SP model statistically significantly outperform its original one or not.
The remainder of the paper organizes as follows. In “Ontology-based semantic similarity metrics” Section reviews the state-of-the-art ontology-based semantic similarity metrics mainly focusing on the IC-related ones and the iICs. In “A hybrid IC-related model IC+SP” Section presents the improved model, IC+SP. In “Evaluation” Section introduces the experimental data and method as well as reports and analyzes the experimental results. The final section summarizes the paper and prospects the future work.
Ontology-based semantic similarity metrics
Ontology-based approaches to measuring semantic similarity between concepts could commonly divide into three categories, which are feature-based approaches, path-based or edge-counting approaches, IC-based approaches. This classification commonly bases on different types of semantic evidence. The state-of-the-art strategies in this research field generally combined different types of semantic evidence to make hybrid semantic similarity metrics or designed new iIC model of the concept utilizing in the IC-related metrics. For the latest and detailed survey, readers could refer to Chapter 3 of the survey book of Harispe et al. [2]. This work focuses on IC-related semantic similarity metrics, and therefore, the feature-based and path-based families of approaches would briefly introduce while IC-related category reviews detailedly.
Feature-based and edge-counting approaches
Feature-based approaches have been introduced into the community of semantic similarity research since the feature model was proposed by Tversky [22]. Generally, the strategy of the feature model is to represent a concept to be a set of its features. Thus, the semantic similarity between a pair of concepts is assessed using a function of the sets of their common features and non-common ones. If A and B are the sets of the concepts \(a'\) and \(b'\)s features, thus the semantic similarity of a and b, \(si{m_{Tversky}}\), could define to be the function (F) of A and B as follows:
In Eq. (1), the function F has two main concrete formulations which are the contrast model (\(si{m_{CM}}\)) and the ratio model (\(si{m_{RM}}\)).
In an ontology, one popular paradigm is to take the ancestors of a concept as its features. Maedche and Staab [46] as well as Sánchez et al. [41] constructed their feature-based metrics based on this intuitive strategy. Apart from the ancestors representing as the features, the synsets, attributes (features) and neighbor concepts of a concept could also be considered to be its features. For example, Rodriguez and Egenhofer [47] computed the semantic similarity using the weighted sum of different type of feature similarities between two concepts. Each sort of the aforementioned features could calculate one feature similarity using the ratio model. In the work of Petrakis et al. [48], a piecewise function is proposed to assess the semantic similarity in which three different representations, i.e., synsets, glosses and neighbor concepts, of the features utilized. In the work of D’Amato [49], the instances of a concept could be considered to be its features, if they are available. Currently, pure feature-based approaches are rare to propose, but the idea of the feature model could be found in IC-based and hybrid semantic similarity metrics. For instance, from the standpoint of feature model, the metrics proposed by Lin [28] and Jiang and Conrath [
A hybrid IC-related model IC+SP
The state-of-the-art hybrid ontology-based semantic similarity metrics normally are defined as a function whose variables contain IC and shortest path [26, 30, 31, 33, 34, 39]. These hybrid metrics have considered that both IC and shortest path are valuable semantic evidences to access the semantic similarity between concepts. However, the combination ways of IC and shortest path defined in the state-of-the-art hybrid IC-related metrics are commonly lack of rigorous theory support while mainly rely on the empirical judgment of the researchers. For instance, with regard to the metric of Zhu and Iglesias [33], the authors stated: “the IC of two concepts’ LCS is used to weight their shortest path length so that those concept pairs having same path length can have different semantic similarity score if they have different LCS”. Thus, the combination of IC and shortest path in their metric is for avoiding the same semantic similarity scores of those concepts pairs having the same path length. But in theory, they did not explain why the exponential function of the IC of two concepts’ LCS could be used to weight their shortest path. There is also a lack of theoretical analysis of the combinational way of IC and shortest path in other hybrid IC-related metrics. Furthermore, the experiment survey demonstrated that the hybrid IC-related metrics did not statistically significantly outperform each other and most of the non-hybrid IC-related ones on well-accepted benchmarks. In other words, either the IC-related metrics could not improve by including the shortest path, or IC and shortest path combine in the wrong ways in the state-of-the-art hybrid IC-related metrics. Herein, a hybrid model, IC+SP, is proposed linearly combining the two semantic similarity components which are constructed based on IC and shortest path respectively. This new hybrid model is based on the essential but simple hypothesis that IC and the shortest path are two relatively independent semantic evidences and have approximately equal influences to the semantic similarity metric. Thus, this work argues that it would be a better strategy to linearly combine the IC-related metrics and shortest path than the known nonlinear combinations of them.
Shortest path component
Rada et al. [23] and Pedersen et al. [59] respectively defined edge-counting semantic similarity metrics with simple structures based on only the shortest path between the concepts. Herein, the shortest path component of IC+SP is defined also only based on the shortest path but in a more intuitive way. Let a and b be two concepts in a taxonomic graph, the shortest path component of IC+SP, denoted as \(si{m_{sp}}\), is defined as follows:
where SP(a, b) is the shortest path length of the concepts a and b in a taxonomic graph.
It is not difficult to figure out that \(si{m_{sp}}\) has the same value when SP(a, b) is equal to 0 or 1 according to Eq. (4). Thus, with the taxonomy of WordNet, we could figure out that \(si{m_{sp}}(``cat",``cat")=si{m_{sp}}(``cat",``feline")=1\). To some extent, the results of \(si{m_{sp}}\) are not accord to the human cognition, as the semantic similarity between “cat” and “cat” should not be equal to that of “cat” and “feline”. However, \(si{m_{sp}}\) is a part of IC+SP. As the “IC” component of IC+SP, \(si{m_{ic}}\), could ensure that \(si{m_{ic}}(``cat",``cat")>si{m_{ic}}(``cat",``feline")\), thus, IC+SP would not be inconsistent with human cognition.
Assessing the semantic similarity
The IC component of IC+SP could be any one of IC-related metrics. Thus, IC+SP model denoted as \(si{m_{ic + sp}}\), could be defined as follows:
where a and b are two concepts and \(si{m_{ic}}\) and \(si{m_{sp}}\) are the IC and shortest path components respectively. For instance, the the improved Lin based on IC+SP could be denoted as Lin+SP. The semantic similarity between a and b could be calculated by the formula \(si{m_{Lin + sp}}(a,b) = si{m_{Lin}}(a,b) + si{m_{sp}}(a,b)\).
According to the psychological study [22], human generally tent to pay attention to words similarities rather than their differences. Thus, the semantic similarity between two words could define with the maximal similarity score between the concepts that are the meanings of the two words respectively. Let \(C_1\) and \(C_2\) be the sets of the words \(w_1\) and \(w_2\)’s concepts respectively, thus, the semantic similarity between words \(w_1\) and \(w_2\), denoted as \(WordSi{m_{ic + sp}}({w_1},{w_2})\), is defined as follows:
where \(\mathop {\max }\) is a function that represents the maximum value of \(si{m_{ic}}+si{m_{sp}}\) calculated by different pairs of concepts, and \(c_1\) and \(c_2\) are the elements of the sets \(C_1\) and \(C_2\) respectively.
With regard to the similarity judgment of two words, as the IC and shortest path are regarded as two independent semantic evidences to evaluate the similarity between two words, herein, another transformation from the semantic similarity values of the concepts into that of the words could be defined as follows:
where the parameters have the same meanings as those in Eq. (6).
Equation (7) is the standard form of the IC+SP model. Two words, \(w_1\) and \(w_2\), whose concept sets are \(C_1\) and \(C_2\) sets respectively. According to the IC+SP model, the semantic similarity value of \(w_1\) and \(w_2\) is calculated with the sum of two components which are \(\mathop {\max }\nolimits _{\begin{array}{c} \scriptstyle {c_1} \in {C_1}\\ \scriptstyle {c_2} \in {C_2} \end{array}} (si{m_{ic}}({c_1},{c_2}))\) and \(\mathop {\max }\nolimits _{\begin{array}{c} \scriptstyle {c_1} \in {C_1}\\ \scriptstyle {c_2} \in {C_2} \end{array}} (si{m_{sp}}({c_1},{c_2}))\). \(\mathop {\max }\nolimits _{\begin{array}{c} \scriptstyle {c_1} \in {C_1}\\ \scriptstyle {c_2} \in {C_2} \end{array}} (si{m_{ic}}({c_1},{c_2}))\) means the max semantic similarity value of two concepts that are in \(C_1\) and \(C_2\) sets respectively measured by IC-related metric; while \(\mathop {\max }\nolimits _{\begin{array}{c} \scriptstyle {c_1} \in {C_1}\\ \scriptstyle {c_2} \in {C_2} \end{array}} (si{m_{sp}}({c_1},{c_2}))\) also means the max semantic similarity value of two concepts that are in \(C_1\) and \(C_2\) sets respectively but measured by SP metric (equation (4)).
For most of the word pairs, the semantic similarities calculated by Eqs. (6) and (7) could get the same values. However, on some occasions, Eq. (6) could get a higher similarity value than that of Eq. (6). Thus, Eq. (7) could truly get the maximum value which more accords with the psychological study [22] that human generally tent to pay attention to words similarities rather than their differences. Herein, Eq. (7) is used to transform the semantic similarity between concepts into that of the words.
Evaluation
The goal of the experiments in this section is to evaluate the proposed hybrid model, IC+SP. To guarantee the reliability of the experiments, HESML developed by Lastra-Díaz and García-Serrano [1] is used to replicate their experiments. The experimental results utilize as the baseline to evaluate IC+SP. Meanwhile, the improved IC-related metrics based on IC+SP are developed on the platform of HESML. Next, experimental datasets, evaluation method, experimental results are presented.
Experimental platform and comparison baseline
To the best of my knowledge, a Java software library called HESML developed by Lastra-Díaz and García-Serrano has most comprehensively implemented known ontology-based semantic similarity metrics and IC models especially iIC models. It is easy to replicate the state-of-the-art metrics based on HESML and implement newly improved ones by extending the library. Therefore, herein, HESML is used as the experimental platform. First, the original semantic similarities and their correlation coefficients on experimental benchmarks are got by the replication experiments of the classic and state-of-the-art IC-related metrics in Tables 1 and 2. And each of the 25 IC models in Tables 3 and 4 are used in the replications of 13 IC-related metrics. The raw correlation coefficients obtained in this work are consistent with those reported in Lastra-Díaz and García-Serrano [1]. Thus, these correlation coefficients of replication experiments are used as the comparison baseline to evaluate the proposed hybrid model IC+SP. Secondly, 13 improved IC-related metrics based on IC+SP model, each of which contains one of IC-related metrics in Tables 1 and 2, are implemented by extending HESML. Herein, the correlation coefficients of semantic similarities on experimental benchmarks for 13 improved IC-related metrics obtained and reported. Moreover, Python 3 is used as the platform to conduct statistical tests of the correlation coefficients and visualize the analysis results.
Confidence intervals of the T-Test on the differences of the paired correlation coefficients for all improved IC-related metrics and their corresponding original ones combining each IC on all benchmarks at the level of 0.95 confidence degree. (The coordinates of the x-axis from 1 to 25 respectively refer to the IC of Resnik, Seco et al., Zhou et al., Blanchard et al., Sánchez et al., Sánchez and Batet, Ment et al., Yuan et al., Hadj Taieb et al., Adhikari et al. in Table 3 and CondProbHypo, CondProbUniform, CondProbLeaves, CondProbCosine, CondProbCorpus, CondProbRefHyponym, CondProbRefUniform, CondProbRefLeaves, CondProbRefLogistic, CondProbCosine, CondProbRefCorpus, CondProbRefLogisticLeaves, CondProbRefCosineLeaves, CondProbRefLeavesSubsumersRatio in Table 4.)
Confidence intervals of the Wilcoxon Signed-Rank Test on the differences of the paired correlation coefficients for all improved IC-related metrics and their corresponding original ones combining each IC on all benchmarks at the level of 0.95 confidence degree. (The coordinates of the x-axis from 1 to 25 respectively refer to the IC of Resnik, Seco et al., Zhou et al., Blanchard et al., Sánchez et al., Sánchez and Batet, Ment et al., Yuan et al., Hadj Taieb et al., Adhikari et al. in Table 3 and CondProbHypo, CondProbUniform, CondProbLeaves, CondProbCosine, CondProbCorpus, CondProbRefHyponym, CondProbRefUniform, CondProbRefLeaves, CondProbRefLogistic, CondProbCosine, CondProbRefCorpus, CondProbRefLogisticLeaves, CondProbRefCosineLeaves, CondProbRefLeavesSubsumersRatio in Table 4.)
Experimental datasets
Several datasets, in which the semantic similarity of each word pair is judged by human beings, have been applied as benchmarks to evaluate the computational algorithms. By comparing, we have selected the most widely used datasets that have been validated by many studies as experimental datasets. In this work, five well-accepted benchmarks are collected and used to evaluate the IC+SP which list as follows:
-
1.
RG65. The first created benchmark, RG65, was designed by Rubenstein and Goodenough [42] in 1965. It might be the most used dataset for evaluating the measurements of semantic similarity of the word pairs. The benchmark consists of 65 noun pairs whose semantic similarities were judged on a scale from 0 (semantically unrelated) to 4 (highly synonymous) by a group of 51 students who are all native English speakers.
-
2.
MC28. In 1991, Miller and Charles [43] selected 30 word pairs, MC30 benchmark, from the RG65 and redesigned an experiment to assess the similarity of these word pairs by a group of 38 undergraduate students. The correlation between RG65 and this benchmark is up to 0.97. As RG65 and MC30 benchmarks have been validated by many studies over the years, they have become the standard tests to evaluate the accuracy of the semantic similarity results. However, to maintain consistency with the dataset of the comparative baseline used in Lastra-Díaz and García-Serrano [11], only 28 word pairs of MC30 are selected in this work denoted as MC28.
-
3.
WS201. Developed by Gabrilovich [44], WordSim353, containing 353 word pairs, is a commonly used test dataset for measuring similarity and relatedness of word pairs. Furthermore, the authors identified that two different subsets of the benchmark, i.e., WS353sim containing 203 word pairs as well as WS353rel consisting of 252 word pairs, are suitable for evaluating semantic similarity and semantic relatedness respectively. WS353sim includes 201 noun pairs and 2 verb pairs. To unify the evaluation standard to the noun collection in the WordNet, I use WS201 that is the collection of 201 noun pairs in WS353sim as the benchmark for evaluating the measurement of semantic similarity.
-
4.
PS65. Pirró [40] replicated the human judgment experiment on the dataset RG65 and \( P \& {S_{full}}\) contains the modern semantic similarity results on RG65. The whole semantic similarities of word pairs were judged by a group of 101 participants composing of 76 native and 25 non-native English speakers. Herein \( P \& {S_{full}}\) is used denoted as PS65.
-
5.
SL665. SimLex-999 [45] is another gold standard dataset for measuring the semantic similarity of a pair of concepts rather than the relatedness. The benchmark has 999 word pairs, divided into 666 noun pairs, 222 verb pairs and 111 adjective pairs, whose semantic similarities judged by at least 36 native speakers on a scale betweeb 0.0 (low similar) and 10.0 (high similar). Herein, the collection of 665 noun pairs in SL999-noun, denoted as SL665, are used in our experiments on the purpose of maintaining consistency with the dataset of the comparative baseline in Lastra-Díaz and García-Serrano [11].
Evaluation methods
The correlation coefficient between semantic similarity scores obtained by computational metrics and those of human judgment is the well-accepted assessment indicator to evaluate the semantic similarity metric in related literature. Both Pearson’s and Spearman’s correlation coefficients have been widely used in literature, while most the researchers only apply one of them. In most of the rank-based selection tasks such as NLP and IR, Spearman’s correlation coefficient is considered to be better to evaluate the metrics. Herein, the advantages and disadvantages of two types of correlation coefficients are not analyzed while both of them, denoted as r and \(\rho \), are used to evaluate the metrics respectively. As each of 25 IC models could be applied to any IC-related metric, every IC-related metric could get 25 correlation coefficients (r or \(\rho \)) on each benchmark. The average and maximum value of these correlation coefficients could fairly and comprehensively reflect the performance of a metric. Thus, they are used to be the direct comparison indicators in this work.
In order to more rigorously evaluate the improved IC-related metrics based on IC+SP model, herein two types of statistical tests which are the T-Test and Wilcoxon Signed Rank Test [60] conducted on the correlation results. The T-Test widely utilizes as a standard test in most of the related literature, so it uses in this work too. However, T-Test on a sample of small size hypothesizes the normal distribution of the sample which did not test in most of the researches. Nevertheless, according to the Kolmogorov–Smirnov (KS) Test [61] on the correlation coefficients of each IC-related metric, they do not accord with normal distribution in most cases. Therefore, the Wilcoxon Signed-Rank Test is introduced and conducted on the correlation results as it does not require the standard distribution hypothesis. For each statistical test, p-value and confidence interval at the level of 0.95 confidence degree are applied to reflect whether the performance of a metric is statistically significantly better or worse than the comparative one.
Results
For each IC-related metric evaluated by Lastra-Díaz and García-Serrano [11] in Tables 1 and 2, a corresponding improved IC-related metric is generated based on IC+SP. For example, for the metric of Lin [28] denoted as Lin, its corresponding metric based on IC+SP, denoted as Lin+SP, could be easily constructed using Lin plus SP according to the formulas (5) and (7). Thus, 13 improved IC-related metrics based on IC+SP are built, each of which pairs with an original IC-related one. And the reported results of the original IC-related metrics could use as the comparative baseline of this work.
Each improved IC-related metric could combine any IC to assess the semantic similarity of word pairs. Thus, for each benchmark dataset, the Pearson’s (r) and Spearman’s (\(\rho \)) correlation coefficients for an improved IC-related metric combining different ICs are calculated. From Tables 5, 6, 7, 8, 9, 10, 11, 12, 13, and 14, each of them lists the correlation coefficients (r or \(\rho \)) for 13 improved IC-related metrics combining different ICs on a specific benchmark dataset. The last two rows of each table show the average as well as the maximum correlation values of each column respectively. These tables have the same structure with the correlation results for the original IC-related metrics reported by Lastra-Díaz and García-Serrano [11]. Readers are recommended to refer these reported correlation results to test the results reported in this work.
For each improved IC-related metric, by combining different ICs, a collection of correlation coefficients(r or \(\rho \)) is obtained on a benchmark dataset, which could pair with those obtained according to its corresponding original one. In Table 15, each cell demonstrates a p-value of the T-Test on the set of paired correlation coefficients of every single benchmark dataset and those of all datasets (last two rows). The p-values of the Wilcoxon Signed-Rank Test on the same sets of paired correlation coefficients list in Table 16. Except for p-values, confidence intervals of both tests on the differences between the paired correlation coefficients for every improved IC-related metric and the corresponding original oneon all benchmark datasets at the level of 0.95 confidence degree are plotted in Figs. 1 and 2. Moreover, the confidence intervals of the differences of paired correlation coefficients on each dataset are plotted from Figs. 7, 8, 9, 10, 11, 12, 13, 14, 15 and 16 in the appendix. Furthermore, confidence intervals of both tests of paired correlation coefficients for every improved and original IC-related matric combining an IC on all benchmarks plot in Figs. 3 and 4.
Analysis and discussion
This work hypothesizes that IC+SP model will improve the performance of IC-related metric by adding SP component to it. Thus, the correlation coefficients for the improved IC-related metrics compare to those for their corresponding original IC-related ones. Figure 5 shows the means of correlation coefficients for every pair of IC-related metric (original and improved) combining all ICs on all benchmark datasets. It is clear that the average correlation coefficient, no matter r or \(\rho \), for each improved IC-related metric is higher than that for the original one. For Pearson’s correlation (r), Meng12 and coswJC get the best average values for original and improved metrics which are 0.767 and 0.785 respectively. While the average correlations of improved Lin, FaIth, Meng12, cosJC, Lis9, Meng14, Gao, coswJC are all higher than that of original Meng12. For Spearman’s correlation (\(\rho \)), Meng14 and coswJC get the best average values for original and improved metrics which are 0.767 and 0.785 respectively; and the average correlations of improved Lin, Faith, Meng12, cosJC, Meng14, coswJC are all higher than that of original Meng14. Besides the average correlation coefficient, the maximum correlation is another significant indicator to evaluate the metrics. As five benchmark datasets utilize to the evaluation of the metrics, the average correlation coefficient on five datasets is used to characterize the performance of an IC-related metric combining an IC. Thus, by combining different ICs, an original or improved IC-related metric could obtain a collection of the average correlations on five datasets, the maximum of which utilizes for comparison. The maximum of average correlation coefficients for every improved IC-related metric and its corresponding one combining one IC on all benchmark is plotted in Fig. 6. For the Pearson’s correlation coefficient (r), It could find that most of the improved IC-related metrics outperform the corresponding original ones except for Garla and Zhou, while for the Spearman’s correlation coefficient (\(\rho \)), Zhou and Gao are two exceptions. Furthermore, the maximum of the Pearson’s correlation coefficients (r) of improved Lin, JC, PS, FaIth, Meng12, cosJC and coswJC are all higher than original Zhou which is the best of original ones. For each benchmark dataset, the improved IC-related metrics also perform better than the original ones in the vast majority of cases from the perspectives of the mean and maximum of the correlation coefficients. Meanwhile, the best correlation coefficients for the original metrics on each benchmark datasets have promoted except for SL665. Readers are recommended to make further analysis according to the raw data in the appendix.
To some degree, simple statistics could display that the improved IC-related metrics outperform the original ones. However, the results are a lack of rigorous statistical test. To make more rigorously evaluation, T-Test, as well as the Wilcoxon Signed-Rank Test, are conducted on the correlation results. In Tables 15 and 16, most of the p-values are less than 0.05, which mean the differences of paired correlation coefficients for improved and original IC-related metrics are statistically significant at the level of 0.95 confidence degree. From the last two rows in Table 15, the differences of paired correlation coefficients on all benchmark datasets are, statistically, significant according to T-Test. Meanwhile, for Wilcoxon Signed-Rank Test on the differences of paired correlations on all datasets, only improved Meng14 and Gao do not statistically significantly outperform their original ones according to the last two rows of Table 16. The improved Lin, PS, FaITH, Meng12 outperform the original ones significantly on every single benchmark according to either T or Wilcoxon Signed-Rank Test, while improved cosJC and coswJC make it too under Wilcoxon Signed-Rank Test. The improved JC and Zhou could make it on most of the benchmarks only except for MC28 (\(\rho \)) under both tests. Moreover, according to the p-values of both tests on differences of paired correlations on all benchmarks, all improved IC-related metrics (on r) and those except for Meng14 and Gao (on \(\rho \)) outperform the original ones at the level of 0.99 confidence degree.
Compared with p-value, confidence interval at a certain level could provide more information. The confidence intervals illustrated in Figs. 1 and 2 could also tell whether a test result is statistically significant or not as the p-values do. The lower bound of a confidence interval over zero means the differences of paired correlation coefficients for the improved IC-related metric and the original one are statistically significant. In general, there are few differences between the confidence intervals of the T-Test and Wilcoxon Signed-Rank Test on the same type of correlation coefficients (r or \(\rho \)) for a metric by comparing the Figs. 1 and 2. The reason for both tests yielding approximately the same results could be because T-Test on the large sample size requires no rigorous standard distribution hypothesis. The sample size for each metric on all benchmark datasets is 125 larger than 30. Thus, in this case, the confidence intervals of both the T-Test and Wilcoxon Signed-Rank Test are valid to reflect that the improved IC-related metrics generated based on IC+SP statistically significantly improve the correlation coefficients (r or \(\rho \)) for their corresponding original ones. Furthermore, for most of IC-related metrics, more Pearson’s correlation coefficients (r) could be increased than Spearman’s correlations (p) by their improved ones, while, Lin, FaITH, Meng12, cosJC and Lis9 are exceptions for T-Test and only Meng12, cosJC and Lis9 for Wilcoxon Signed-Rank Test. Moreover, for each IC, the correlation coefficients for all IC-related metrics could pair with those for the improved ones. And, Figs. 3 and 4 plot the confidence intervals of both tests on the paired correlations according to different ICs, from which, intuitive, the improved IC-related metrics have statistically significantly risen the correlations coefficients (r or \(\rho \)) on matter which test performs. In addition, the confidence interval of both tests of the paired correlation coefficients on the single benchmark according to different metrics demonstrate from Figs. 7 and 16. Because the sample size (25) of the correlation coefficients based on the single benchmark is smaller than that (125) of all benchmarks, the confidence intervals of both test on the single benchmark is less stable than those of all ones. However, most of the confidence intervals have shown that the improved IC-related metrics yield statistical significant higher correlation coefficients than their original ones. Readers could compare Figs. 1, 2, 3 and 4 with Figs.7, 8, 9, 10, 11, 12, 13, 14, 15 and 16 to observe the differences caused by their sample sizes.
Conclusion and future work
Assessing the semantic similarity of word pairs is a fundamental component in various applications in the artificial intelligence and information science related fields. This paper presents an improved model of IC-related metrics, IC+SP, based on the essential hypothesis that IC and the shortest path are two relatively independent semantic evidences and have approximately equal influences to the semantic similarity. Thus, the strategy of IC+SP is to linearly combine the IC-related metric with the shortest path. It is expected that the IC-related metrics could improve by the IC+SP model, and the experimental results accomplish the anticipation. In terms of simple indicator, the average value of correlation coefficients for every improved IC-related metric on all benchmark datasets is higher than that of the corresponding original one, and the maximum of correlation coefficients approximately yields the same results with only two exceptions. For more rigorous evaluation, the T-Test as well as the Wilcoxon Signed-Rank Test conduct on the differences of paired correlation coefficients for every improved as well as original IC-related metric which combines different ICs. The results of the T-Test indicate that the correlation coefficients for every improved IC-related metrics are statistically significantly higher than those for the corresponding original one. The results of the Wilcoxon Signed-Rank Test demonstrate the same conclusion while only the Spearman’s correlation coefficients for improved Meng14 and Gao are the exceptions. Thus, it could conclude that the proposed improved model, IC+SP, statistically advance the IC-related semantic similarity metrics between words.
This work makes three main contributions in the field. Firstly, a linear model combining IC-related metric and the shortest path is proposed, which proves to be valid. Second, the Wilcoxon Signed-Rank Test introduces to conduct on the differences of paired correlation coefficients needing no standard distribution hypothesis with which the correlation coefficients are commonly not accord. Third, by maximizing every component of IC+SP, a transformation from the semantic similarity of the concepts to that of the words is presented.
The semantic similarities for Res, JC, PS, Meng12 and Zhou are not normalized, which means the range of their values is not confined between 0 and 1 ([0,1]). Thus, their improved ones are not the truly linear combination of the IC-related metric and the shortest path. Therefore, it will be a research goal to evaluate the linear combination of the normalization of their semantic similarity metrics and the shortest path. Furthermore, in this work, I only focus on semantic similarity metrics, however, semantic relatedness is needed to automate more complicated tasks in various applications. It will be a future research direction to extend the semantic similarity study to the semantic relatedness domain. In addition, as more public knowledge Graphs (KGs) have become available, they contain more concepts and are still constantly growing. It will be another research direction to utilize IC+SP in Knowledge Graph.