1 Introduction

Knowledge graphs have become the important cornerstone of the research and development of artificial intelligence technology. In recent years, the scale of knowledge graphs has increased at an unprecedented rate, and data processing with millions of vertices (106) and hundreds of millions of edges (108) has become commonplace [1]. Therefore, it is necessary to consider how to perform distributed query processing to cope with the growing demand for knowledge graphs.

In the Semantic Web community, the Resource Description Framework (RDF) has become a de-facto standard format for knowledge graphs and has been extensively applied [1]. An RDF dataset consists of a set of triples 〈spo〉 and can be transformed into a graph where the resources denoted by s and o are vertices, and the attributes denoted by p are labeled edges. SPARQL [2] is the standard graph query language on RDF graphs. A SPARQL query can be regarded as the subgraph homomorphism problem [3], which is recognized as an NP-complete problem [4, 5].

The efficient processing of subgraph matching queries over large-scale RDF graphs in a distributed setting is a challenging problem. Recently, the partial evaluation technique [6] has been applied to solve the problem of regular path queries on distributed environment [7,8,9]. The queries Q are partially evaluated in parallel to obtain partial results on each fragment of data Fi on each site Si, then all the partial results are transmitted to a master site. Finally, assemble these partial results to get the final results of Q.

Based on partial evaluation technique, the partial evaluation and assembly framework has been proposed to answer SPARQL queries [10]. However, a large number of partial results can be generated during the partial evaluation phase, making the assembly phase a computational bottleneck. To improve the efficiency of assembly phase, Peng et al. [11] proposed the LEC feature-based optimization strategy to prune some unpromising intermediate results. However, existing works only focus on the assembly phase of query processing, while ignoring the partial evaluation phase, and efficient index-based methods are not effectively utilized by the partial evaluation phase to speed up the search process. The following example demonstrates the drawback of computing partial matching results by the method in [11].

Example 1

As shown in Fig. 1a, given a distributed RDF graph G0, and a query Q = (?a, spouse, ?s) ∧ (?a, director, ?b) ∧ (?b, country, ?c) ∧ (?c, capital, ?d), the query engine traverses the dataset on Fi according to the query graph to obtain all the candidate sets of query variables. In fragment Fi, the candidate sets of query is denoted as \(B_{g_{i}}\), where the subscript gi is used to distinguish among different fragments. Based on the candidate generation strategy, the candidates of internal query nodes of query Q are \(B_{g_{1}}\) = {〈?a, (v2, v5)〉, 〈?b, (v3, v4, v21)〉, 〈?c, (v8)〉} and \(B_{g_{2}}\) = {〈?a, (v11, v15, v22)〉, 〈?b, (v13, v14)〉, 〈?c, (v9)〉}. The internal query nodes denote the query nodes that have more than one edge (the nodes ?s and ?d are not internal query nodes because there only exists one edge connected with them in query graph Q). The candidate vertices are in yellow, green, and blue, respectively, in Fig. 1a. When computing local partial matches on each site, each candidate vertex will start a graph exploration, so that the search process will iterate six times on the first site and six times on the second. To further explore, we can find that a graph exploration from either v2 or from v3 can obtain the local partial match (〈?s, v1〉, 〈?a, v2〉, 〈?b, v3〉, 〈?c, v9〉, 〈?d, v10〉) on F1. As a result, this strategy can generate a large number of repeated local partial results, leading to a degradation in the performance of partial evaluation.

Fig. 1
figure 1

A distributed RDF graph G0 with candidates of query Q0 colored. (a) Before filtering candidates, (b) After filtering candidates

To handle this problem, we propose an effective optimization strategy to accelerate the partial evaluation phase by a constructed index named inner boundary node index which exploits characteristics of partial evaluation and assembly framework (the formal definition will be explained in detail in Section 4). For the example shown in Fig. 1, the following candidate sets are generated after filtered by IBN-Index: \(B_{g_{1}}\) = {〈?a, NULL〉, 〈?b, (v3, v4, v21)〉, 〈?c, (v8)〉}, \(B_{g_{2}}\) = {〈?a, (v22)〉, 〈?b, (v13, v14)〉, 〈?c, (v9)〉}. These candidate vertices of ?a, ?b and ?c are colored yellow, green and blue respectively in Fig. 1b. The size of the whole candidate sets is much smaller than the candidates mentioned in Example 1. Furthermore, the local partial match (〈?s, v1〉, 〈?a, v2〉, 〈?b, v3〉, 〈?c, v9〉, 〈?d, v10〉) will be only generated once which is started from v3 on fragment F1.

Since the growing number of partial matches heavily influence the assembly stage, filtering out part of the partial results becomes an effective way to speed up the query. Therefore, we propose another method that utilizes constructed boundary node index (the formal definition will be explained in detail in Section 5) to filter out part of the false local partial matches in advance, reducing the cost of communication and centralized computation.

We summarize the contributions of the paper with the following three aspects:

  • Based on partial evaluation and assembly framework, we propose an inner boundary node index (IBN-Index) and a partial evaluation algorithm that utilizes IBN-Index to filter the candidate sets of query nodes, which can significantly speed up partial evaluation phase.

  • To reduce the number of local partial matches, the boundary node index (BN-Index) is constructed by exploiting the characteristics of local partial matches. Furthermore, a BN-Index-based filter algorithm is proposed.

  • Extensive experiments on benchmark datasets have been conducted to verify the efficiency and scalability of our method. The experimental results show that our method outperforms the state-of-the-art method.

The rest of this paper is organized as follows. Section 2 reviews the related work. In Section 3, we present the preliminaries and problem definition. An overview of the methods is depicted in Section 4. In Section 5 and Section 6, we propose the inner boundary node index and the boundary node index, with their corresponding algorithm, respectively. Finally, experimental evaluations are presented in Section 7 and we draw conclusions in Section 8.

2 Related work

Due to performance, confidentiality, and security factors [12, 13], the cluster-based distributed data management architecture has become the inevitable research trend to deal with the knowledge graph. In this section, we will review several distributed subgraph matching research on large-scale RDF graphs, which can be classified into two categories, including MapReduce-based graph systems, and specialized RDF systems. Furthermore, existing works on partial evaluation and assembly and graph indexing methods are summarized.

2.1 MapReduce-based graph systems

SHARD [14], a MapReduce-based triple store for RDF graphs, is able to process SPARQL queries, which decomposes the query graph into a set of triples (a triple containing variables) and binds variables to the vertices of the data graph by iterating on the triple patterns. Meanwhile, it is necessary to satisfy all the constraints in the query. For example, each round of the MapReduce operation adds only one query clause through the join operation. Likewise, the smallest decomposition unit of the query graph in HadoopRDF [15] is also the triple pattern, and it also utilizes the MapReduce framework to divide the RDF triples into multiple small files based on the predicate. However, both methods mentioned above ignore the structural information of the query graph, require multiple MapReduce iterations, and require a large number of join transactions, resulting in high query cost.

2.2 Specialized RDF systems

Trinity.RDF [16], a distributed in-memory key-value store, stores RDF data in the native form, with vertex identifiers as keys and adjacent lists of vertices as values. Trinity.RDF finds the optimal exploration plan and reduces the number of intermediate results using the graph exploration instead of join operations, while the final results need to be obtained using a single thread on the master node. In addition, systems based on partial evaluation and assembly framework have also been extensively and deeply studied in recent years.

The method of partial evaluation and assembly is first applied in distributed XML data management by Peter et al. [17]. The key idea is to transmit the whole query graph to each site that is partially evaluated in parallel, and after each node computes the partial results of the query, the results are transmitted as compact Boolean functions to the master node, which are combined to obtain the result. Fan et al. [7, 18, 19] subsequently propose a series of algorithms based on partial evaluation and assembly framework to deal with XQuery on distributed XML data, reachability query and graph simulation on distributed graph. Peng et al. [10, 11] design a subgraph matching query algorithm based on partial evaluation and assembly framework to process SPARQL queries on distributed RDF data and propose relevant optimization strategies.

However, the huge overhead caused by repeated partial matches during partial evaluation is not handled in aforementioned methods. Furthermore, when the number of intermediate results obtained in the local computation phase is extensive, the aforementioned methods may suffer from a performance bottleneck in the assembly phase.

2.3 Graph indexing

As a classic space-for-time strategy, graph indexing has been researched extensively and deeply in the past years. Graph indexing methods can be classified into value-based indexing and structure-based indexing.

Through value-based methods, a graph index is usually constructed on one or more properties of an entity. SB-Tree [20] is a variant of B-Tree, which has a better performance on dynamic data. Hexastore [21] and RDF-3X [22] index the RDF data in six possible ways. Based on S-Tree [23], gStore [24] proposed VS-Tree to prune the search space efficiently.

Structure-based methods focus on mining the features of a graph, such as a path, subtree, or other substructure, indexing them to filter the search space and accelerate the query process. Closure-Tree [25] proposed graph closure, which is a generalized graph that represents several graphs, and based on it to organize graphs as a tree. Both subgraph queries and similarity queries can benefit from this method. SPath [26] takes the shortest paths around the vertex neighborhood as the basic process unit and decomposes the query into a set of shortest paths to exploit its indexing. K-path-bisimulation [27] is a path-based index, the path with identical length and the same edge label sequence is divided into the same catalog, and the query is decomposed into a set of paths.

In this paper, the features of partial evaluation and assembly framework are exploited profoundly, and two kinds of indexes are constructed to compensate for the shortcomings of previous methods. Although the proposed index-based methods are value-based, the structural information of the graph is also included in the indexes.

3 Preliminaries

Let U and L be the disjoint infinite sets of URIs and literals. Then, a tuple in the form of 〈spo〉 ∈ U × U × (UL) is called an RDF triple, where s is the subject, p the predicate, and o the object. Given an RDF dataset as a finite set of triples, it can be converted to its corresponding RDF graph. In this paper, we focus on the problem of subgraph matching query over a distributed RDF graph. This section will present preliminaries for distributed RDF graphs and subgraph matching queries. Table 1 lists the notations frequently used in this paper.

Table 1 Frequently used notations

Definition 1 (RDF Graph)

Given an RDF dataset as a finite set of triples in the form 〈spo〉, its corresponding RDF graph is G = (V, E, Σ), where the set of vertices V is the union of all s and o. For each 〈spo〉, there is a directed edge eE from the vertex s to the vertex o, where p is the label of that edge e. Here, Σ is the set of all labels, i.e., Σ = {p ∣ 〈spo〉 ∈ G}.

Definition 2 (Distributed RDF Graph)

RDF graph G is partitioned into n disjoint ‘entity sets’ {\(\mathcal {E}_{1}\), ..., \(\mathcal {E}_{n}\)}, where each \(\mathcal {E}_{i}\) = (Vi, Ei, Σi). Here, (1) for each i ∈{1,...,n}, \(\mathcal {E}_{i}\) is a subset of G, where \(V_{i} \subseteq V\), \(E_{i} \subseteq E\), and \({\Sigma }_{i} \subseteq {\Sigma }\); (2) for each i, j ∈ {1,...,n} ∧ ij, there is \(\mathcal {E}_{i}\)\(\mathcal {E}_{j}\) = ; and (3) \(\bigcup ^{n}_{i=1}\) \(\mathcal {E}_{i}\)G.

To ensure the integrity and consistency of the RDF graph when partitioned in a distributed system, each computing node needs to store some copies of the edges that cross between different entity sets. Let the copy of the associated edges with other partition be denoted as \({N_{i}^{c}}\).

Definition 3 (Fragment)

Graph G is partitioned into n fragments \(\mathcal {F}\) = {F1, ..., Fn}, such that Fi = \(\mathcal {E}_{i} \cup {N_{i}^{c}}\). In other words, G can be considered as a distributed RDF graph w.r.t. \(\mathcal {F}\), such that:

  1. 1)

    For each \(\mathcal {E}_{i}\) = (Vi, Ei, Σi), Vi, Ei, and Σi represent the set of internal vertices, the set of edges, and the set of labels in Fi, respectively. Formally, Vi = {\(s \mid \langle s,\;p,\;o \rangle \in \mathcal {E}_{i}\)} ∪ {\(o \mid \langle s,p,o \rangle \in \mathcal {E}_{i}\)}, \(E_{i} \subseteq V_{i} \times V_{i}\), and Σi = {\(p \mid \langle s,p,o \rangle \in \mathcal {E}_{i}\)};

  2. 2)

    \({N_{i}^{c}}\) = (\({V_{i}^{e}}\), \({E_{i}^{c}}\), \({{\Sigma }_{i}^{c}}\)), where \({E_{i}^{c}}\) is the set of crossing edges between Fi and other fragments. If an internal vertex of Fi has a direct edge with any vertex v in Fj, where ij, then \(v\in {V_{i}^{e}}\). Formally, \({E_{i}^{c}} \subseteq V_{i} \times V_{j}\), \({{\Sigma }_{i}^{c}}\) = {\(p \mid \langle s, p, o \rangle \in {E_{i}^{c}}\) }, the set of boundary vertices between Fi and Fj is \({V_{i}^{e}}\) = {\(s \mid \langle s, p, o \rangle \in {E_{i}^{c}} \wedge s \in V_{j}\)} ∪ {\(o \mid \langle s,\;p,\;o \rangle \in {E_{i}^{c}} \wedge o \in V_{j}\)}, i, j = 1,2,...,nij;

Fig. 2
figure 2

A distributed RDF graph G1

Let \(\mathcal {S}\) = {S0, S1, \(\dots\), Sn} be a set of n + 1 computing nodes, i.e., sites, in a cluster. Without loss of generality, each fragment Fi is stored at a slave site Si for i ∈ {1, \(\dots\), n}.

Example 2

As shown in Fig. 2, given a distributed RDF graph G1 extracted from the DBpedia dataset, G1 can be divided into four parts \(\mathcal {F}\) = {F1, F2, F3, F4}, which are respectively stored on the corresponding sites {S1, S2, S3, S4} in the cluster. For a fragment F2 = \(\mathcal {E}_{2}\)\({N_{2}^{c}}\), the partition \(\mathcal {E}_{2}\) = (V2, E2, Σ2), and V2 = {v4, v5, v9, v10, v11, v12, v13, v14, v15}, E2 = {(v4, v5), (v9, v10), (v9, v14), (v14, v15), (v13, v11), (v11, v12)}. The copy between F2 and other fragments \({N_{2}^{c}}\) = (\({V_{2}^{e}}\), \({E_{2}^{c}}\), \({{\Sigma }_{2}^{c}}\)), where \({V_{2}^{e}}\) = {v3, v6, v17}, \({E_{2}^{c}}\) = {(v3, v9), (v5, v6), (v13, v17)}. In particular, we colored the incoming and outgoing vertices of fragment F2 in blue.

Given an RDF graph G and a query graph Q as a set of triple patterns, a subgraph matching problem is to find the subgraphs over G that satisfy all the triple patterns in Q. Such a subgraph matching problem is a conjunctive query (CQ) on G, which is the focus of this paper. In the following, we formally present the query graphs and the other necessary definitions adapted from [28].

A query graph includes m triple patterns 〈srpror〉, where the value of each sr, or can either be a member of V, or ‘not labeled’. If a sr or or is ‘not labeled’, sr or or belongs to a special set Var, and the name of each element in Var starts with the character ‘?’. Similarly, the value of each pr can either be a member of Σ, or Var.

Definition 4 (Query Graph)

Given an RDF graph G, a CQ Q over G is defined as: Q(z1, \(\dots\), zt)\(\gets \bigwedge _{1\le r \le m} tp_{r}\), where tpr = 〈srpror〉 is a triple pattern. sr,orVV ar, zl is a variable and zl ∈ {sr∣1 ≤ rm} ∪ {or∣1 ≤ rm}. A CQ Q is also referred to as a query graph.

Before defining subgraph matching, we recapitulate certain definitions of the map**. For a map** μ, dom(μ) is its domain. Two map**s μ1 and μ2 are compatible, i.e., μ1 \(\sim\) μ2, iff for every element vdom(μ1) ∩ dom(μ2), it holds that μ1(v) = μ2(v). Furthermore, the set-union of two compatible map**s, i.e., μ1μ2, is also a map**.

Definition 5 (Subgraph Matching)

The semantics of a CQ Q over an RDF graph G is defined as:

  1. 1)

    μ is a map** from the vertices in Q to the vertices in V, i.e., map** from \(\overline {s}\) = {s1,...sm} and \(\overline {o}\) = {o1,...om} to the vertices in V;

  2. 2)

    \((G, \mu ) \vDash Q\) iff 〈μ(sr), μ(pr), μ(or)〉∈ E and the labels of srpr and or are the same as that of μ(sr), μ(pr), and μ(or), respectively, if sr, prorVar;

  3. 3)

    PQ is the set of all results, where each result satisfies the subgraph matching query Q over G.

Problem statement

Consider a distributed RDF graph G, w.r.t., a fragmentation \(\mathcal {F}\) = {F1,...,Fn}, and let Fi stored in the cluster \(\mathcal {S}\) = {S0S1,...Sn}. For simplicity, we assume that each site Si hosts one fragment Fi. Given a query graph Q, the problem is to find all subgraph matching results PQ of Q in G.

Example 3

Given a CQ, Q = (?a, spouse, ?s) ∧ (?a, director, ?b) ∧ (?b, country, ?c) ∧ (?c, capital, ?d). Q consists of five query vertices and its semantic is to find the films directed by a person with his spouse, the film’s country and the capital of the country. The corresponding query graph is shown in Fig. 3, with one of the query results being highlighted in purple in Fig. 2.

Fig. 3
figure 3

A query graph Q with 5 query variables

4 Overview

The partial evaluation and assembly framework is extended to answer SPARQL queries over a distributed RDF graph G, as shown in Fig. 4. In the execution model, there are two phases: the partial evaluation phase and the assembly phase. In addition, two optimization strategies are designed and embedded in this framework.

Fig. 4
figure 4

Overview of the optimized partial evaluation and assembly method

Before the query starts, the entire RDF graph G is divided into multiple fragments according to a certain partitioning strategy, and an index named BN-Index is built on each fragment. The fragments and corresponding BN-Index are then transmitted to each site, and an index named IBN-Index is further constructed on each site locally. When querying, the master node sends the entire query graph to all slave nodes, and the subsequent partial evaluation phase can be summarized into three processes. (1) Each site Si first receives the complete query graph Q and finds all candidate sets of the query graph variables. (2) The query engine uses the IBN-Index to filter the candidates, and executes the graph exploration algorithm according to the filtered candidates to find local matches. (3) Finally, BN-Index is utilized to filter the local partial matches after graph exploration.

The local partial matches are then sent to the master site to compute the complete SPARQL matches, which is called the assembly stage. Benefiting from the filtering effect of BN-Index, the number of partial matches is drastically reduced, which alleviates the assembly bottleneck problem to a certain extent.

To better illustrate the effect of IBN-Index, it is necessary to explain the partial evaluation process of gStoreD in detail. After each site Si receives the complete query graph, the candidate set of each query variable is obtained according to the predicates connected with the variables. Specifically, the vertices in the candidate set can be classified into internal candidate vertices and boundary candidate vertices. The internal candidate vertices denote the vertices contained in the subgraph Fi allocated on Si, while the boundary candidate vertices denote those vertices connected with Fi but do not belong to it. After obtaining the candidate sets, all the sites transmit the internal candidate vertices to the master site, and the collection of all internal candidates is resent to all sites.

In order to find partial results on Fi, graph exploration starts with each internal candidate vertex of the internal query variables. Specifically, for a query graph, the query variables can be classified into internal query variables and satellite variables, depending on the edges connected with the query node. If a query node only has one edge, it is denoted as a satellite variable.

The reasons why we choose these special candidate vertices as the starting vertices are considered from two aspects. (1) A boundary vertex is also an internal vertex on another site simultaneously so that it will be set as a starting vertex on that site, and the path connected with it will not be lost. (2) If a partial match on Si only matches a satellite node, it will also be found on other sites. Take the partial match in purple on S1 in Fig. 2 as an example, the partial match will be found on S2 and connected with v6 as a complete SPARQL match. Therefore, only starting with internal query variables will not lose any partial results.

At each expansion step of graph exploration, query engine judges whether the matching node belongs to the collection of all the internal candidates. The graph exploration process will not stop until all maximal partial matches are obtained.

5 Inner boundary node-based algorithm

Recall that in partial evaluation and assembly framework, given a distributed RDF graph G, each site Si receives a part of the graph Fi and constructs IBN-Index according to the subgraph. Then, when answering query Q, each site Si computes local partial matches utilizing the constructed index. In this section, the structure of the IBN-Index is defined, and the IBN-Index construction algorithm is introduced. Then we present the subgraph matching algorithm utilizing IBN-Index on each site and give the complexity analysis of the proposed method.

5.1 Inner boundary node

The local partial match computation algorithm based on partial evaluation and assembly framework have been proposed in [10]. First, an in-depth analysis of the performance problem of existing work in computing local matches during partial evaluation is carried out, and on this basis, an optimization using the inner boundary node index is proposed.

When computing local partial matches, each internal candidate vertex of the candidate vertices sets starts the graph exploration to find the local partial matches. It should be noted that it is not necessary to traverse all internal nodes as the starting point of graph exploration. The reasons can be considered from the following two aspects.

(1) Intuitively, a complete SPARQL match only needs to be traversed from one vertex so that the candidate vertices in other candidate sets can be discarded. (2) Besides, as mentioned in [10], a local partial match is the overlap** part of an unknown crossing match and a fragment Fi. There must be a crossing edge derived from an internal vertex connected with other fragments. Therefore, only searching from the vertices connected with other fragments can get all the partial matches. These kinds of vertices are defined as inner boundary nodes, and the formal definition is given as follows:

Definition 6 (Inner Boundary Node.)

Given a distributed RDF graph Q and a fragmentation \(\mathcal {F}\) = {F1,...,Fn}. In fragment Fi = \(\mathcal {E}_{i} \cup {N_{i}^{c}}\), if an internal vertex v of Fi has a direct edge with any vertex u in Fj, where ij, then v is an inner boundary node.

Based on inner boundary nodes, the internal entity set Vi of fragment Fi can be divided into two mutually exclusive subsets, pure internal node set Pi and inner boundary node set Di. Formally, Vi = PiDi, where Di = {\(s \mid \langle s ,p ,o \rangle \in {E_{i}^{c}} \wedge o \in V_{j}\)} ∪ {\(o \mid \langle s ,p ,o \rangle \in {E_{i}^{c}} \wedge s \in V_{j}\)}, i,j = 1,2,...,nij. The definition of the inner boundary node index is presented as follows:

Definition 7 (Inner Boundary Node Index.)

Given a fragment Fi of RDF graph G, the inner boundary node index, i,e., IBN-Index of Fi is a key-value map IIBN where

  1. 1)

    for any tuple (vtag) ∈ IIBN, the key is a vertex vVi, and the value tag is a Boolean value denoting if v is an inner boundary node or not; if a vertex v is an inner boundary node, its corresponding tag will be set to boolean True, otherwise it will be set to False;

  2. 2)

    for any vertex vVi, there exists a unique tuple in IIBN with v as the key and a Boolean tag as the value.

Example 4

As shown in Fig. 2, on site S1, the inner vertices (inner boundary nodes) connected with vertices on other sites (S2 and S3) are v3, v6, and v8. And on site S2, the inner boundary nodes are v9v5, and v13. As a result, the inner boundary node index on site s1 is \(I_{1}^{IBN}\) = {(v3True), (v6True), (v8True), (v1False), (v2False), (v7False), (v30False), (v31False), (v31False), (v32False), (v33False), (v34False)}. And the IBN-Indexes on site S2 is \(I_{2}^{IBN}\) = {(v5True), (v9True), (v13True), (v4False), (v10False), (v11False), (v12False), (v14False), (v15False)}.

figure a

To improve space efficiency of IBN-Index, the strategy of dictionary encoding is adopted, such that each vertex is encoded into an integer by hash operation.

The construction approach of the IBN-Index is shown in Algorithm 1. First, the inner boundary node set and IBN-Index are initialized by the fragment identifier Fi (line 1). Then, for each triple 〈spo〉, if it is a crossing edge and the subject s (or object o) is an internal vertex, the s (or o) is put into the inner boundary node set Di (lines 2-6). For each node v in the internal node set Vi, if it is also an inner boundary node in Di, a map** (vTrue) will be put into \(I_{i}^{IBN}\); otherwise, a map** (vFalse) will be put into \(I_{i}^{IBN}\) (lines 7-11).

5.2 IBN-index based partial evaluation

In order to answer query Q, each site Si computes the local partial matches based on the known fragment Fi. The formal definition of local partial match was defined in [10]. Intuitively, a local partial match PMi is an overlap** part between a crossing match M and fragment Fi.

figure b

Algorithm 2 describes the local partial match computation process utilizing the IBN-Index. The key idea is to use the inner boundary node index to filter the candidate sets, thereby reducing the search space when finding local partial results. Since the result of partial evaluation can be divided into complete SPARQL matches and local partial matches, the correctness of Algorithm 2 can be considered from the following two aspects. (1) Since some complete SPARQL matches contain only pure internal vertices (e.g., the dashed partial match on S1 in Fig. 2), i.e. they do not have any inner boundary nodes, in order to ensure that all complete SPARQL matches are obtained, we keep a complete candidate set in which all vertices will start graph exploration (line 8). Here a greedy strategy is applied, selecting the candidate set with the smallest size as the reserved set (line 6). (2) In other candidate sets, only the candidate vertex judged as an inner boundary node will start the graph exploration (lines 13-19).

figure c

Example 5

We take site S1 as an example. As shown in Fig. 2, for query Q, the candidate sets of each internal query variables on site S1 are Bg1 = {〈?a, (v2, v31, \(v_{v_{8}}\))〉, 〈?b, (v3, v32)〉, 〈?c, (v33)〉}. Firstly, the candidate sets are sorted to find the candidate set with the smallest size and the set 〈?c, (v33)〉 is reserved. In the candidate set of ?a, the nodes v2 and v31 are not inner boundary nodes according to the IBN-Index, so they are all filtered out. As for the candidate set of ?b, v3 will not be filtered, while v32 will be discarded. The filtered set of candidate sets is Bg1 = {〈?a, (v8)〉, 〈?b, (v3)〉, 〈?c, (v33)〉}. As a result, graph exploration can find all partial matches on S1 starting only from v8, v3, and v33. Likewise, the candidate sets on other sites are also filtered out of a large number of candidate vertices using the same method.

Space complexity of IBN-Index

For each fragment Fi, each vertex corresponds to a tag indicating whether it is an inner boundary node. The extra space is bounded with \(O(|V_{i}| + |{V_{i}^{e}}|)\), where Vi is the set of internal nodes of fragment Fi and \({V_{i}^{e}}\) is the set of boundary nodes of fragment Fi.

6 Filter local partial matches with boundary node index

As shown in Fig. 5, since the RDF data graph is distributed and stored on multiple sites, the boundary node on each site becomes a bridge connecting any two sites. Node 1 (in blue) on F1 is a boundary node for F2, while it is an internal node in the view of F1. As a result, it is named an inner boundary node on F1, while it is a boundary node for F2. In partial evaluation and assembly framework, after the local partial matches are obtained, all of them are sent to the master site uniformly. However, not all the partial results can continue to be joined to form a complete SPARQL match on the master site. Therefore, it is unnecessary to transmit local partial matches, which are not associated with partial matches on other sites, to the master site. Based on the above problem, this paper proposes an optimization strategy for pre-judging edge labels from boundary nodes to reduce the communication overhead between local sites and the master site.

Fig. 5
figure 5

Fragmentation of a data graph

The definition of boundary nodes (see the definition of \({V_{i}^{e}}\)) has been given in Section 3, which means vertices that belong to other fragments but are directly connected to internal vertices in Fi. For an RDF graph G, when dividing the data, we record each boundary node’s out-edge and in-edge information on the fragment Fi as boundary node index. The formal definition of boundary node index is as follows:

Definition 8 (Boundary Node Index)

. Given a fragment Fi of RDF graph G, the boundary node index, i,e., BN-Index of Fi is a key-value map \(I_{i}^{BN}\) where

  1. 1)

    \(I_{i}^{BN}\) = \(I_{i}^{Out} \cup I_{i}^{In}\);

  2. 2)

    for any tuple \((v,v.Out) \in I_{i}^{Out}\), the key is a vertex \(v \in {V_{i}^{e}}\), and the value v.Out = {(p1p2, ..., pn) ∣ij ∧〈vplu〉∈ Fjl ∈{1,...,n}};

  3. 3)

    for any tuple \((v,v.In) \in I_{i}^{In}\), the key is a vertex \(v \in {V_{i}^{e}}\), and the value v.In = {(p1p2, ..., pn) ∣ij ∧〈u,pl,v〉∈ Fjl ∈ {1,...,n}}.

figure d

The construction approach of boundary node index is shown in Algorithm 4. First, the BN-Index is initialized by the identifier Fi (line 1). Then, for each triple 〈spo〉 in RDF graph G, if s (or o) is a boundary node in Fi and the triple 〈s, p, o〉 ∉ F, the the (sp) (or (op)) will be put into \(I_{i}^{Out}\) (or \(I_{i}^{In}\)) (lines 2-8). Algorithm 4 will iterate over each triple until there is no triple left.

Example 6

As shown in Fig. 2, on site S3, the boundary nodes (with their predicates in orange) are \({V_{i}^{e}}\) = {v8, v13, v23, v26, v28}. And the corresponding boundary node index is \(I_{3}^{Out}\) = {〈v8, (spouse)〉, 〈v13, (director)〉, 〈v23, (director)〉, 〈v28, (prime_minister)〉, 〈v26, NULL〉}, \(I_{3}^{In}\) = {〈v26, (director)〉,〈v8NULL〉, 〈v13NULL〉, 〈v23NULL〉, 〈v28NULL〉}.

Example 7

As shown in Fig. 6, the local partial matches on S3 is PM3 = {(〈?snull〉, 〈?anull〉, 〈?bv13〉, 〈?cv17〉, 〈?dv16〉), ((〈?snull〉, 〈?anull〉, 〈?bv23〉, 〈?cv17〉, 〈?dv16〉), (〈?snull〉, 〈?av26〉, 〈?bv19〉, 〈?cv17〉, 〈?dv16〉), (〈?sv20〉, 〈?av21〉, 〈?bv22〉, 〈?cv28〉, 〈?dnull〉)}. According to the boundary node index, the last two local partial matches could not constitute any final results. Then these two matching pairs will not be sent to the master node as a message.

Fig. 6
figure 6

Filtering partial matches with BN-index on site S3 on graph G1

The filtering partial matches process is presented briefly in Algorithm 5. On site Si, each partial match is iterated to judge on the boundary nodes (line 3). Only when the value (also known as the predicates belong to other fragments) of BN-Index of the boundary node contains the unmatched predicates on the query graph, can the partial match be further matched and reserved to transmit to the master site.

figure e

Space complexity of BN-Index

For each fragment Fi, the number of the BN-Index is \(O(|{V_{i}^{e}}|)\) at most. As a result, the extra space of BN-Index is bounded with \(O(|{V_{i}^{e}}|)\), where \({V_{i}^{e}}\) is the set of crossing edges.

7 Experimental evaluation

In order to verify the effectiveness and efficiency of the IBN-Index method and BN-Index filtering method under the partial evaluation and assembly framework, a comparative experiment with gStoreD [10, 11] was performed over several benchmark RDF datasets. The proposed algorithm is implemented on top of gStoreD, and is deployed on a 3-node cluster, of which each node is in a Docker. The three dockers are all deployed on a machine with 16 cores Intel Xeon Silver 4216 2.10 GHz processors, 512 GB of RAM, and 1.92 TB SSD, running the 64-bit CentOS 7.7 operating system.

7.1 Datasets and queries

In this experiment, the proposed method and gStoreD are evaluated using the LUBM [29] synthetic benchmark dataset of different scales. The statistics of the datasets are shown in Table 2. We need to compare the query efficiency of the method based on IBN-Index, the method based on BN-Index and the combination of the two, and the original gStoreD on different queries. It is necessary to gradually change the number of intermediate results that the query satisfies while limiting the basic structure of the query. Therefore, we choose benchmark datasets rather than real-world datasets to keep the dataset size positively correlated with the number of intermediate results. In addition, to eliminate the impact of the partitioning strategy on query performance, we use a random partitioning method to divide each dataset into four fragments.

Table 2 Datesets
Table 3 Queries

As shown in Table 3, eight complex queries of different scales on the LUBM dataset are presented, i.e., \(Q_{1}\sim Q_{8}\). To exhibit the effect of IBN-Index and BN-Index, the proposed queries may generate large amount of intermediate results in the partial evaluation phase, as showed in Fig. 7a.

7.2 Experimental results

To verify the effectiveness of the IBN-Index and BN-Index based partial evaluation algorithm, extensive experiments were conducted.

Fig. 7
figure 7

Query evaluation on LUBM datasets. (a) Partial evaluation results on LUBM datasets, (b) Query evaluation time of Q1 ∼ Q8

Fig. 8
figure 8

The constructing time and size of IBN-Index and BN-Index. (a) The constructing time and size of IBN-Index, (b) The constructing time and size of BN-Index

Fig. 9
figure 9

The graph exploration in partial evaluation on LUBM datasets

Exp 1. Number of partial matching results. To make it more intuitive to observe and evaluate the performance of the partial evaluation algorithm with different queries and datasets, the number of partial matching results and complete SPARQL matches of \(Q_{1} \sim Q_{8}\) over LUBM10, LUBM20, and LUBM30 is recorded. The maximal total number of all partial evaluation results (including complete SPARQL matches and local partial matches) generated from each site are depicted in Fig. 7a, which determines the time consumption of the partial evaluation and influences assembly phases. As shown in Fig. 7a, for Q1, the number of partial evaluation results increases approximately linearly with the size of the datasets. For the query \(Q_{2} \sim Q_{8}\), their partial evaluation results on LUBM20 are the most, which is affected by the partitioning strategy.

Exp 2. The construction time and space occupied by IBN-Index and BN-Index. Figure 8a shows the largest time overhead and space occupation of the IBN-Index among all slave nodes on different datasets. It can be observed that the time to construct the IBN-Index and the size of the IBN-Index are positively correlated with the size of the graph. Figure 8b presents the construction time and space of the BN-Index, which have similar trends to that of the IBN-Index.

For IBN-Index, since the value of each node is only a Boolean value, the space required for the index is small, which guarantees the time and space complexity of the proposed method. As for BN-Index, the space required is correlated with the number of boundary nodes, which depends on the graph partitioning strategy. Although the random partitioning method we use will produce a large number of intermediate results, the experimental results prove the time and space complexity of BN-Index. Overall, the size of the BN-index is proportional to the size of the graph except that on LUBM20. The reason is that the number of crossing edges on LUBM20 is even more than that on LUBM30, which can be verified by the partial evaluation results depicted in Fig. 7a.

Exp 3. Efficiency of IBN-Index Based Optimization. A measurement of the statistical consequences of graph exploration time in the partial evaluation phase of \(Q_{1} \sim Q_{8}\) over different datasets is shown in Fig. 9. It can be observed that the partial evaluation method based on IBN-Index outperforms gStoreD on all queries and can improve query performance by 1.64 times in the best case. Furthermore, the method combining IBN-Index and BN-Index (the grey lines) improves the query efficiency by 1.79 times in the best case. As the size of the dataset increases, the proportion of time reduction also increases.

Interestingly, the optimization becomes more significant as the number of query nodes increases. The reasons can be summarized in two aspects. (1) When dealing with more query nodes, the number of candidate sets also rises, leading to more repetitive partial evaluation results. Therefore, the IBN-Index can be affected on more candidate sets resulting in a better pruning efficiency. (2) As the length of the result of partial evaluation expands, the candidate nodes are more likely to be pure internal nodes that can be filtered by IBN-Index.

Exp 4. Efficiency of BN-Index Based Optimization. Although gStoreD already has a partial match filtering strategy, the BN-Index-based method could have the same filtering effect and even higher efficiency than gStoreD. As shown in Fig. 9, the efficiency of graph exploration during the partial evaluation of the BN-Index-based method (the blue lines) exceeds gStoreD in most cases, and the strength is expanded as the scale of datasets grows. The reason is that when dealing with a large number of candidate vertices, gStoreD collects all the internal candidate vertices from all slave sites and transmits them back to all sites, which enlarges the searching space of graph exploration seriously. However, BN-Index will not enlarge the candidate sets.

Exp 5. Scalability. To prove the scalability of the IBN-Index and BN-Index based methods, the whole query times of the improved partial evaluation and assembly method over eight queries are presented in Fig. 7b. It is obvious that the query time of our method is nearly linear with the scale of the datasets. Unfortunately, all runs of query Q3, including gStoreD and the IBN-Index and BN-Index based methods, are failed on the LUBM30 dataset due to the limited memory.

8 Conclusion

In this paper, we proposed an inner boundary node index-based method and a boundary node index-based method to improve the computing efficiency of the subgraph matching queries in distributed settings based on the partial evaluation and assembly framework. Moreover, we also proved that the IBN-Index and BN-Index are both time-efficient and space-effective. The extensive experimental results on benchmark datasets verified the efficiency and scalability of the proposed method, which clearly outperforms gStoreD when large-scale intermediate results need to be processed.