Introduction

With the rapid growth of data and the swift expansion of data volume, skyline queries have gained widespread application and development. The application of skyline queries is evident in numerous fields, including data mining, Geographic Information Systems (GIS), spatial databases, location services, and multi-criteria decision-making. Skyline queries represent a typical multi-objective optimization problem. Currently, both domestically and internationally, research on skyline query technologies primarily focuses on skyline queries in a complete data environment. Examples include probabilistic skyline queries [1], skyline queries for massive datasets [2], skyline queries in mobile edge computing [3], techniques for skyline queries in obstacle spaces [4], clustering design in spatial databases [5], LSM index storage technology in databases [6], k-dominant Skyline query algorithm for dynamic datasets [7], privacy-preserving skyline queries [8], and K-dominant spatial skyline queries [9], among others. The technology of skyline queries continues to advance steadily.

Due to the prevalence of incomplete data in practical scenarios, where data is often characterized by missing values, the methods for skyline queries on incomplete data hold significant importance in areas such as multi-objective optimization and location services. Currently, existing approaches for skyline queries on incomplete data typically involve direct processing of the data. However, these methods exhibit low efficiency in data classification, leading to reduced accuracy in results. Additionally, during the query process, dataset redundancy can compromise the overall performance of the algorithm. Addressing these shortcomings, this paper proposes a skyline query method for multidimensional incomplete data based on classification trees. The primary contributions of this paper are as follows:

  1. 1)

    To address issues in skyline queries with incomplete data, such as data redundancy, low data classification efficiency, and slow classification speed, this paper introduces a method based on incomplete data-weighted classification trees. Integrating the characteristics of incomplete data with traditional tree structures, the method efficiently splits dimensions and adds weighted labels. Compared to existing classifications, this method achieves efficient classification with just three layers of tree structure. It separates the multiple dimensions of incomplete data, storing them in intermediate nodes. Each dimension is assigned weight values based on missing values. Classification is completed through horizontal indexing of leaf nodes, guided by varying weight values. The proposed algorithm resolves challenges of low data classification efficiency and slow speed, enhancing overall classification and skyline query performance.

  2. 2)

    To address challenges in existing skyline queries for multidimensional incomplete data, characterized by the presence of substantial useless data and data redundancy leading to low query efficiency, this paper introduces a multidimensional incomplete data skyline query algorithm. Building on an effective classification foundation, the algorithm introduces the concept of optimal virtual points for skyline queries in incomplete data. These optimal virtual points are composed of the maximum values for each dimension from local skyline points and are labeled with the source points for each dimension value. Compared to prior research, optimal virtual points rapidly identify dominating points, significantly reducing the number of comparisons and improving query efficiency. The algorithm incorporates optimal virtual points into different classifications, comparing them with local skyline points. If local skyline points dominated by optimal virtual points are still dominated by the source points of the optimal virtual points, these points form shadow points. Local skyline points not dominated by optimal virtual points constitute candidate skyline points. Further dominance comparisons identify global skyline points. Optimal virtual points efficiently filter out a substantial amount of useless data, reducing the number of tuple comparisons. The proposed algorithm significantly enhances the performance and efficiency of skyline queries, avoiding the issues of high time consumption and low efficiency associated with extensive redundant data.

The structure of the remaining content in this paper is outlined as follows. In "Basic definition", the definition of incomplete data skyline queries is provided. "Basic definition" introduces incomplete data-weighted classification trees. "Incomplete data weighted classification tree classification algorithm" proposes a classification algorithm for multidimensional incomplete data based on incomplete data-weighted classification trees, utilizing them for the classification of multidimensional incomplete data. "Multi-dimensional incomplete data skyline query" further presents a skyline query algorithm for multidimensional incomplete data. Experimental analysis is presented in "Experimental analysis".

Related work

Skyline queries are widely researched in various fields. Reference [10] proposes a top-k skyline query algorithm based on user preferences and data partitioning. This algorithm, implemented in MapReduce, addresses the issue of low efficiency in top-k skyline queries for large datasets by leveraging user preferences and data partitioning. It effectively enhances query efficiency and exhibits good scalability. However, its drawback is its applicability to static datasets and not suitable for dynamic dataset queries. Reference [11] introduces a distributed skyline query algorithm (DSQ) designed for handling skyline queries in a distributed environment. The algorithm employs data block filtering and dominance graph-based data point filtering to eliminate redundant data tuples. Through a rotation-based scheduling plan, skyline results can be obtained in parallel without creating bottleneck nodes, thus improving query processing efficiency. Nevertheless, a drawback of this algorithm is the need to maintain a considerable number of spatial structures, resulting in substantial spatial overhead. Reference [12] introduces the top-k Manhattan space skyline query problem concerning monotonic scoring functions. This function quantifies the fitting degree of each point in set P under the L1 distance for a given query. Reference [13] presents effective algorithms for continuous skyline queries on large datasets using the MapReduce framework. The main idea of the algorithm is to calculate the skyline query only once at the initial position, and then update the result as the query point moves, avoiding recomputation from scratch each time. This approach significantly improves the algorithm's efficiency. In Reference [14], numerous applications necessitate the analysis of data evolving over time, leading to the proposal of a novel algorithm, SLS, to evaluate skyline queries on data streams with a low cardinality domain. Reference [15] has designed three efficient algorithms, namely IMSS, OIMSS, and PMSS, which combine the advantages of various techniques, including distance-based priority scanning, virtual point testing, and intelligent pruning heuristics. In particular, the PMSS algorithm integrates some parallel programming techniques. This algorithm offers the flexibility to process queries within subsets of available dimensions, demonstrating significant flexibility. Additionally, it provides a set of pruning rules capable of eliminating redundant data objects, thereby enhancing query performance. However, this algorithm is not suitable for distributed systems and data stream environments. Reference [16] introduces an effective algorithm (SSQ) for processing subspace skyline queries using MapReduce. This algorithm can derive meaningful subsets of points from the complete skyline point set of any subspace. Reference [17] introduces two new skyline queries to identify information-rich and concise skyline sets, namely the minimal skyline query and the extended minimal skyline query. The algorithm’s approximate set offers a subset of potentially large object sets containing the best-matching or most interesting objects. As the approximate set captures the primary distribution of the skyline, its semantics are user-friendly, providing a better basis for decision-making. However, a limitation of the algorithm lies in the challenge of determining the distance threshold. Additionally, the algorithm has certain shortcomings in querying the minimal skyline set across various environments.

Research on the skyline query problem with incomplete data is limited both domestically and internationally. The concept of incomplete data skyline queries was first introduced by Khalefa [18]. Currently, there is extensive research on incomplete data skyline queries in various environments. Examples include probability-based incomplete data skyline queries [19, 20], skyline queries for incomplete data in cloud environments [21], skyline queries in incomplete dynamic databases [22], and skyline preference queries for incomplete data [23]. The technology for incomplete data skyline queries continues to advance. Reference [24] investigates the use of crowdsourcing for skyline queries on incomplete data. A novel query framework, termed Bayesian Clusters, is proposed, considering the data correlation using Bayesian networks. The algorithm utilizes a typical c-table model on incomplete data to represent objects. The paper introduces an effective task selection strategy that balances budget and latency constraints. Specifically, calculating the probability for each object to serve as an answer object is as challenging as the #SAT problem. To address this, an adaptive DPLL algorithm is proposed to expedite computations. However, this algorithm still has certain limitations in optimizing the quality of incomplete data queries. In Reference [25], a new table-scan-based TSI algorithm is introduced to handle incomplete data on massive datasets. The TSI algorithm addresses the issues of non-transitivity and cyclic dominance in two phases. In the first phase, TSI calculates candidates through continuous scanning of the table, directly discarding tuples dominated by others. In the second phase, TSI retrieves candidates through another continuous scan, incorporating pruning operations to reduce execution costs. Reference [26] introduces an algorithm for incomplete dynamic skyline queries, aiming to identify the skyline on dynamic and incomplete databases. The algorithm involves pruning and selecting superior local skylines. The pruning process attempts to recognize new skylines using derived skylines before performing insert or update operations on the database. The algorithm accelerates query speed by eliminating dominating tuples. In Reference [27], a novel definition of skyline is proposed. It leverages a probability model on incomplete data, where each point has a probability of appearing in the skyline. Specifically, it returns K points with the highest skyline probabilities. This algorithm can offer users valuable reference decisions. However, there may be certain drawbacks in terms of accuracy. Reference [28] presents a novel privacy-preserving aggregate reverse skyline query (PPARS) scheme, ensuring complete query privacy simultaneously. Specifically, it transforms the ARS query problem into a combination of set-membership testing and logical expressions. It employs prefix encoding, Bloom filter techniques, and fully homomorphic encryption to operate on the transformed logical expressions, obtaining non-disclosing query requests, query results, and access patterns. In Reference [29], a crowdsourcing algorithm for a single dataset is proposed to filter datasets containing unknown attributes. Subsequently, a global hierarchy-preference-tree index is established based on the known attributes of both incomplete and complete datasets. The efficiency of the query is enhanced by filtering linked tuples based on global preference scores and the results of each crowdsourcing round.

In the realm of incomplete data classification, Reference [30] introduces a general classification model tailored for incomplete data. Existing classification methods are effectively integrated. Initially, an attribute subset is selected based on information gain metrics, and a complete view is generated from incomplete data. Subsequently, these selected views are utilized to obtain multiple base classifiers. Finally, the base classifiers are efficiently combined with decision trees to form the ultimate classifier. In Reference [31], a minority oversampling technique based on multiple inferences is proposed to address both imbalanced and incomplete data classification. Most instances are computed only once, while minority instances undergo oversampling using multiple distinct computations without directly manipulating their observed values. Consequently, compared to traditional approaches, minority instances exhibit greater diversity with minimal data distortion.

Basic definition

The data set in this paper is a D-dimensional incomplete data which is \(O= \{ o_{1},o_{2} ,\ldots,o_{n}\}\), oi represents a tuple in the incomplete data set.

Definition 1. Dominant relation [1]. Given a D-dimensional data set O, for any two objects, o1 and o2 in O, o1 and o2 have a dominating relationship (o1 \(\prec\) o2) if and only if two conditions are satisfied: (1) o1 is not worse than o2 in any attribute. \(\forall_{i} \), \(o_{1} .[i] \ge o_{2} .[i]\), ( 2) At least one attribute j exists, where o1 is better than o2.∃j, \(o_{1} .[j] > o_{2} .[j]\).

Definition 2.skyline query [1].Given a data set O, skyline query returns a data object set R. Any data object in R will not be dominated by other data objects in O. \({{ R = \{ o_i|\{ o_j}} \in {{O,}}\,{{o_j}}\, \le \,{{o_i\} \} }}\).

Definition 3.Incomplete data domination [26]. Given a D-dimensional incomplete data set O, consider two points p and q on the D-dimensional space, let M be a common dimension set of p and q incomplete data. Under \(|M| \le D\),if \(\forall s_{i} \in M,p.s_{i} \ge q.s_{i}\) and \(\exists s_{i} \in M,p.s_{i} > q.s_{i}\), then p does not completely dominate q.

Definition 4.Incomplete data skyline query [26]. Assume that a set of incomplete data points O, where each point \(p = \{ u_{1} ,u_{2} ,...,u_{d} \}\) has at least one known dimension \(u_{i}\), the incomplete data skyline query \(O_{sky} \subset O\), so that each point \(p \in O_{sky}\) is not dominated by other points in O, for any \(q \in O - O_{sky}\) is dominated by other points in O.

Multidimensional incomplete data classification algorithm based on incomplete data weighted classification tree

In order to handle skyline queries under multi-dimensional incomplete data, it is necessary to classify the dataset before conducting the skyline query. In this section, considering the slow classification speed issue based on the characteristics of incomplete data, we propose the incomplete data weighted classification tree and further introduce the incomplete data weighted classification tree classification algorithm.

Basic definition

The incomplete data weighted classification tree is a tree-like structure similar to B + trees. It can perform dimension missing judgments and classification operations on multi-dimensional incomplete data. Based on the characteristics and properties of traditional binary trees and multi-dimensional incomplete data, and in order to efficiently classify multi-dimensional incomplete data and accelerate the efficiency of multi-dimensional incomplete data skyline queries, a novel multi-dimensional incomplete data weighted classification tree is proposed for the first time. The definition is as follows.

Definition 5.The root node N of the incomplete data weighted classification tree T stores the dimensional values of the multidimensional incomplete data tuples to be classified. Considering the characteristic of multidimensional incomplete data having values in multiple dimensions, these dimensions are sequentially written from the first dimension to the nth dimension, from left to right, into the second layer's leaf nodes. Based on the missing status of each dimension, weights are assigned: the weight is 0 for missing values and 1 for non-missing values. Subsequently, the aforementioned data is stored in the leaf nodes of the third layer, and data is exported through horizontal indexing, classifying it based on the different dimension weights. After classification, the data is sequentially imported into the respective classes. Finally, clearing the original tree’s stored data completes the classification of multidimensional incomplete data.

The incomplete data weighted classification tree T comprises leaf nodes, weights (w), internal nodes, and the root node N. After importing the incomplete data set O into the constructed weighted classification tree model, a classification operation is performed. The data tuples are sequentially stored in the root node N of the weighted classification tree. The classification of incomplete data proceeds from the top to the bottom of the tree. The root node holds the tuples from the dataset to be classified. Dimension attribute values are stored in internal nodes from left to right based on different data dimensions. Completed judgment operations on dimension attribute values are stored in the leaf nodes. The judgment process involves entering the left child node with a weight of 0 if the dimension attribute value in the internal node is missing. If the dimension attribute value is not missing, the right child node is entered with a weight of 1. After completing these operations, missing judgment operations are performed on the dimension attribute values of internal nodes from left to right and stored in the leaf nodes. Further classification is then performed based on the different weights of each dimension, and data is exported to the corresponding bucket through horizontal indexing. After one round of classification, the dimension attribute values in the leaf nodes are cleared, and subsequent data classification operations are sequentially performed, completing the classification of the entire multidimensional incomplete data set. The construction status of the horizontal index is depicted in Fig. 1. Each leaf node contains five attributes: identity id, weight, the address of this node, next address, and numerical value of dimensions. After construction, each leaf node possesses a known order of identity id and the address of the current node. Additionally, an array maintains the relationship between leaf node ids and addresses. With this array, the address of the next node can be found for each leaf node sequentially. Each leaf node sequentially completes the construction of the horizontal index..The incomplete data weighted classification tree model is shown in Fig. 1:

Fig. 1
figure 1

Incomplete data weighted classification tree

Incomplete data weighted classification tree classification algorithm

The classification algorithm for incomplete data weighted classification trees mainly consists of two stages. The first stage is the modeling stage of the incomplete data weighted classification tree. The second stage is the classification stage of multi-dimensional incomplete datasets in the incomplete data weighted classification tree model.

Stage1: incomplete data weighted classification tree modeling.

The generation of the incomplete data weighted classification tree is a divide-and-conquer, top-down process. In this study, the dataset is an incomplete dataset O, and a training set (samples) is selected from this dataset to build the incomplete data weighted classification tree T. At this stage, the training set is randomly selected. This training set comprises 70% of the incomplete dataset. It encompasses all classification scenarios of data tuples to ensure the integrity and accuracy of the created training tree.The tree’s root node stores data tuples and branches into intermediate nodes. Each branch sequentially stores the dimension attributes of the tuples. Intermediate nodes of the classification tree branch based on different weight values. The left branch of an intermediate node is assigned a weight of 0, and the right branch is assigned a weight of 1. Following these steps, all the leaf nodes are formed. Horizontal indexing is established in the leaf nodes, creating connections from the leftmost leaf node to the rightmost leaf node. This operation establishes spatial relationships for each leaf node. When all dimension attribute values and training set samples have been traversed, the construction of the incomplete data weighted classification tree model is complete.

In this stage, the dimension missingness of the dataset is used as a test attribute. Building upon the modeling from the first stage, the final result is the incomplete data weighted classification tree, which can be employed to classify incomplete datasets.

Stage 2: Classify the data set.

Based on the constructed incomplete data weighted classification tree, the d-dimensional incomplete dataset O is input into the tree for classification training. At this stage, the training set is randomly selected. This training set comprises 70% of the incomplete dataset. It encompasses all classification scenarios of data tuples to ensure the integrity and accuracy of the created training tree. The root node of this tree forms internal nodes based on different dimensions. Internal nodes branch according to whether the attribute values for that dimension are missing, assigning different weights to the left and right branches. Finally, successful classification is achieved based on a horizontal index reflecting the varying weights of each dimension. After classifying a data tuple, the classified tuple is output to a bucket using a horizontal index, and then all leaf nodes are emptied. This process is repeated until all data is fully classified. The number of internal nodes in the incomplete data weighted classification tree, denoted as n, represents the dataset having n dimensions, meaning the incomplete data set can be divided into at most 2n classes.

Theorem 1. Given an incomplete data set O, input the dataset into the incomplete data weighted classification tree. Based on missing value judgments and weight assignment for each dimension of the tuples, tuples with equal dimension weights are then output to the same bucket using a horizontal index, thereby completing the data classification.

Proof: For an incomplete dataset O, assume that the missing dimensions of a tuple are denoted by "-". The tuple of any missing dimension is \(o = \{ o_{1} , - ,o_{2} ,...,o_{i} , - ,...,o_{d} \}\). Each dimension data will first be assigned to the second layer. Starting from the first dimension i, if dimension i is missing, the weight of \(i \in node_{leftChild}\) is 0, otherwise the weight of \(i \in node_{rightChild}\) is 1. where \(node_{leftChild}\) is the left child node and \(node_{rightChild}\) is the right child node. Then judge the second dimension j, if the dimension j is missing, then the weight of \(j \in node_{leftChild}\) is 0, otherwise the weight of \(j \in node_{rightChild}\) is 1. Continuing in this manner until all dimensions are evaluated. Output this data to the corresponding bucket through a horizontal index based on the classification. Proceed to sequentially evaluate other tuples in the dataset, performing the aforementioned classification operation. Ultimately, the classification of all data is completed.

The sample dataset is illustrated in Fig. 2, representing an incomplete data set with 5 dimensions. Using the weighted classification tree for classifying data tuples, the tuple data is entered into the tree and stored in the first-layer node. Each dimension's data is initially placed in the second layer of the classification tree, followed by evaluating if the dimension data is missing. If a dimension is missing, it is stored in a leaf node with a weight of 0; if not, it is stored in a leaf node with a weight of 1. This process continues until all dimensions of the data are classified. Subsequently, based on the differing weights of each dimension and utilizing a horizontal index, the data is exported to distinct buckets. Repeat the aforementioned steps to complete the classification of the entire sample dataset. The dataset is ultimately divided into five classes, labeled as C1 to C5.

Fig. 2
figure 2

Sample data set

Figure 3 shows the classified sample data sets.

Fig. 3
figure 3

Sample data set after classification

The basic idea of the algorithm for the incomplete data weighted classification tree, as proposed in Theorem 1, is as follows. Firstly, utilize the sample training set “samples” to construct a weighted classification tree model. Data tuples are stored in the root node N, and the dimension attributes of the data tuples are sequentially stored in intermediate nodes based on different dimensions. Then, considering the missing status of dimension attribute values, assign weight values to the values and store them in leaf nodes. If a dimension attribute value is missing, it is assigned to the left branch with a weight of 0. If the dimension attribute value is not missing, it is assigned to the right branch with a weight of 1. Subsequently, based on the different weights of dimensions, export the data to the corresponding buckets through a horizontal index. Repeat the process for each tuple until all tuples are classified. The resulting classes, C1, …, Cn, are returned upon the completion of the classification process.

Based on the above discussion, this section further gives the weighted decision tree classification algorithm, as shown in algorithm 1:

Algorithm 1
figure b

Weighted classification tree classification algorithm for incomplete data

The algorithm consists of two parts. The first part involves creating a weighted classification tree using the training set (lines 2–15). Assuming the training set has n tuples and d dimensions, its time complexity is O(2d*n + d*2dlogn). The second part entails classifying the incomplete dataset O based on the weighted classification tree (lines 16–20). Assuming there are m tuples in the incomplete dataset with d dimensions, the time complexity is O(2d*m + 3d*2dlogm). The above algorithm thus completes the classification of all data.

Multi-dimensional incomplete data skyline query

Builds on the work done in "Basic definition" on classifying incomplete data using weighted classification trees for incomplete data. This section performs a skyline query on the classified data.

Skyline query algorithm for multi-dimensional incomplete data

Definition 6. After incomplete data classification, any class can be represented by a 0–1 vector, where the dimension of the missing data value is represented by 0, and the dimension of the non-missing data value is represented by 1.

Suppose there are two tuples, P (-, 7,8, -) and Q (2, -, -, 6), which are divided into classes Ci and Cj. According to definition 6, two vectors, Ci (0,1,1,0) and Cj (1,0,0,1), are obtained. The two vectors are "AND" operated. If the result is not all 0, the tuples in the two classes can be compared; If the results are all 0, the tuples in the two classes cannot be compared.

Definition 7. Virtual point [18]. If the local skyline point p in the Ni bucket dominates the q point in the Nj bucket, then point p can be used as a virtual point. The main idea is that the virtual point can reduce the number of comparisons between tuples by controlling the relationship.

Definition 8. Optimal virtual point. The role of the optimal virtual point is to improve execution efficiency. The optimal virtual point \(E{ = (}e_{{1}} {,}e_{{2}} {,}...{,}e_{{\text{n}}} {)}\) is composed of the optimal value of the local skyline point dimension value in each bucket, where the missing dimension is represented by "-", where ei represents the optimal attribute value of the tuple dimension i in the local skyline.

Definition 9. Shadow skyline [18]. When local skyline points are dominated by the optimal virtual point and simultaneously dominated by the source point of the optimal virtual point, the dominated points within the group are removed, forming shadow skyline points. During the selection of global skyline points, comparison filtering is applied to reduce the number of comparisons, preventing issues of redundant data comparison.

Theorem 2. If there is an optimal virtual point and the bit operation of the class to which the bucket belongs is not "0", then the optimal virtual point is introduced into the bucket. If the dominance relationship of \(\exists V_{i}\) dominating \(\forall q_{j} \in N_{j}\) is satisfied, the number of tuples will be reduced.

Proof: The optimal virtual point is generated by taking the maximum value for each dimension from the local skyline tuples in each bucket, forming a new set of data representing the optimal virtual point. Introducing the optimal virtual point into other buckets for dominance comparison, if dominance is observed, a comparison is made with the data tuples composing the optimal virtual point. If the dominance relationship persists, the dominated data is then inserted into shadow points. Through these operations, a significant reduction in data volume is achieved, consequently decreasing the number of comparisons and improving the efficiency of the algorithm.

Figure 4 shows the calculation process of the optimal virtual point.

Fig. 4
figure 4

Optimal virtual point calculation

Figure 4 illustrates the process of calculating the optimal virtual point. First, a skyline query is conducted in each bucket, eliminating points dominated within the bucket. The remaining points constitute the local skyline points within the bucket. Following the dominance principle, tuples O1, O21, O28 were removed for class C1; tuples O14, O20, O27, O37 for class C2; tuples O3, O11, O18, O22 for class C3; tuples O4, O16 for class C4; and tuples O5, O12, O30, O40 for class C5. After this, the local skyline for each bucket is obtained. Subsequently, the optimal virtual point is selected for each bucket, and the dimension indices of each optimal virtual point indicate the source points of its data. The optimal virtual point is a tuple consisting of the optimal values of each dimension of the local skyline points in each bucket. Missing dimensions are denoted by "-".The locus points of the first dimension of point V1 are points O13 and O38 in the barrel, and these two locus points provide the values of the first dimension of V1. The locus points of the second dimension of point V1 are points O34 and O38 in the barrel, and these two locus points provide the values of the second dimension of V1. The locus points of the third dimension of point V1 are points O14 and O8 in the barrel, and these two locus points provide the values of the third dimension of V1 The fourth dimensional locus of the V1 point is the O34 point in the bucket, and this locus provides the value of the fourth dimension of V1. The formation and source points of the other optimal virtual points are introduced in turn.

Figure 5 shows the calculation process of skyline of classified data.

Fig. 5
figure 5

Classified data skyline calculation

Figure 5 shows the skyline query process for the sample dataset. The skyline calculation is first performed within each bucket. The tuple O5, O12, O30, O40, tuple O4, O16, tuple O3, O 11, O18, O22, tuple O14, O20, O27, O37 and tuple O1, O21, O28 are removed by skyline computation classes C1, C2, C3, C4 and C5 respectively. Then the local skyline of each bucket is obtained. The optimal virtual point of each bucket is selected according to the definition of the optimal virtual point. The optimal virtual points are added to the local skyline of each bucket, and then the shaded points and candidate skyline points are obtained. Take O13 as an example. The optimal virtual point V5 dominates O13, and then the constituent points of V5 are compared with O13 for domination. If O13 is still dominated, then O13 is written into the shaded points. The candidate skyline points and the shaded points are sorted out by performing the above operations on the local skyline points in turn. As shown in the figure, two candidate skyline points, O8 and O10, are selected. Then the candidate skyline points are compared with the shaded points, and the global skyline point O8 is obtained because the candidate skyline point O8 is not dominated by any shaded points.

The main idea of the algorithm proposed in this section is as follows. The data tuples in each bucket are filtered by a dominant comparison to find the local skyline. The optimal virtual point is a point that consists of the maximum value of each dimension. The optimal virtual point is introduced into the bucket to which the category belongs. If the optimal virtual point dominates the bucket, the dominated point is compared with the point from which the optimal virtual point originated. If there is still a dominance relationship, the dominance comparison is performed in the bucket, and the dominated point in the local skyline is inserted into the shaded skyline. The candidate skyline points are composed of the local skyline points in each bucket that are not dominated by the optimal virtual point. Since the candidate skyline points may be dominated by the shadow skyline, the shadow skyline points are compared with the candidate skyline points, and the dominated candidate skyline points are deleted if there are points dominated by the shadow skyline points in the candidate skyline points. The final global skyline point is obtained.

Theorem 3. Any global skyline point P will be output by the multi-dimensional incomplete data skyline query algorithm (Algorithm 2).

Proof: Suppose there exists a global skyline point P. However, point P is not output by Algorithm 2. In the algorithm, only if P is dominated will it be discarded. Therefore, we have three cases: (1) P is dominated by points of the same class. Since P is assumed to be a global skyline point, according to the definition of skyline points, points of the same class will not dominate P. Otherwise, it contradicts the assumption. Therefore, P will not be discarded, and P will proceed to the next step of the algorithm. (2) P is dominated by the optimal virtual point and its source point or by shadow points. According to the assumption, P is a global skyline point. According to the skyline definition, the result of comparing P with other source points and shadow points in the common dimensions is that P is not dominated. Otherwise, it contradicts the assumption. Therefore, P will not be discarded, and P will proceed to the next step of the algorithm. (3) P is dominated by other candidate skyline points. According to the dominance relation and skyline definition, P is a global skyline point. P will not be dominated by candidate skyline points. Otherwise, it contradicts the assumption. Therefore, P will not be discarded, and P will be output. In conclusion, if there exists a global skyline point, it will definitely be output by the algorithm.

Theorem 4. Any point derived from the multi-dimensional incomplete data skyline query algorithm (Algorithm 2) will be a global skyline point.

Proof: Suppose Algorithm 2 outputs a point P, but there exists a point Q that dominates P. There are three cases for Q: (1) Q and P belong to the same class. Because Q dominates P, P is discarded. Therefore, the algorithm will not output point P, contradicting the assumption. (2) Q exists in shadow points or Q exists in the source points of the optimal virtual point that dominates P. If the algorithm derives P, it means P enters the candidate skyline points. Because Q dominates P, P is discarded. Therefore, the algorithm will not output point P, contradicting the assumption. (3) Q exists in the candidate skyline points. Because the algorithm outputs P, P also exists in the candidate skyline. Since Q dominates P, P is discarded. Therefore, the algorithm will not output point P, contradicting the assumption. In conclusion, any point derived from the algorithm is a global skyline point.

In summary, if there exists a global skyline point, it will definitely be output by the algorithm according to the above theorems. Therefore, the algorithm can correctly identify global skyline points and is convergent.

Based on the above research, we further give a multidimensional incomplete data skyline query algorithm, as shown in algorithm 2:

Algorithm 2
figure c

Multi-dimensional incomplete data skyline query algorithm

The multidimensional imperfect data skyline query algorithm first inserts each class into a created bucket. A skyline query is performed within each bucket according to the dominance principle. Any tuple that is not dominated is a local skyline. Dominated tuples are deleted. The optimal virtual points are then extracted from each bucket, i.e., the maximum value of each dimension of the local skyline points in each bucket is extracted to form the optimal virtual points. All optimal virtual points are then introduced into buckets other than this one for skyline computation. The optimal virtual points are introduced to reduce the number of comparisons and hence the complexity of the algorithm. If the local skyline within the bucket is dominated by the optimal virtual point, it is located at the source of the optimal virtual point and compared. If they are still dominated by the source point, they are inserted into the shadow skyline. Points that are not dominated are inserted into the candidate skyline. Finally, the candidate skyline points in this bucket are compared with the shadow skyline in the non-dominated bucket. If the candidate skyline in this bucket is dominated, the dominated point is removed. Based on the dominance relationship, the candidate skylines in each bucket are then compared, and the dominated candidate skyline points are removed. This results in a global skyline point. Assuming there are n tuples with d-dimensional data. The time complexity of lines 1–5 is O(2d*n). Lines 6–13 have a time complexity of O(d*2d*logn). The time complexity of lines 14–15 is O(d*4d*logn). Lines 16–27 have a time complexity of O(d*2d*n). Assuming there are z shadow points and w candidate skyline points. The time complexity of lines 28–32 is O(z*d*w). The time complexity of lines 33–36 is O(d*w*logw).

Based on the above discussion, integrating the two stages of the classification tree algorithm with weight based on incomplete data and skyline query algorithm is the complete multidimensional incomplete data skyline query algorithm (BTIS) proposed in this paper. This method can efficiently skyline query multidimensional incomplete data.

Experimental analysis

This paper presents a skyline query algorithm for multidimensional incomplete data. In this paper, we propose a skyline query algorithm for multidimensional incomplete data. The algorithm first classifies the incomplete data according to the weighted classification tree. Finally, a skyline query is performed on the classified incomplete data. In this section, experiments are designed to evaluate the performance of the algorithm.

The environment used in the experiment is: operating system: Windows 10 (64 bit), CPU: i5—12500H, GPU: RTX3080.

The experimental data encompass real datasets, synthetic datasets, and a road network dataset. The real dataset is derived from the MovieLens 1M dataset obtained from the GroupLens website, which features multidimensional attributes. Seventeen attributes were selected from the MovieLens dataset, and the data completeness is 95%. To introduce randomness, some statistical data in this dataset were randomly missing, ensuring a 10% missing rate. The synthetic dataset was generated using standard data synthesis tools, creating a complete 1M,17-dimensions dataset. Subsequently, some attributes were randomly removed to achieve a 10% missing rate. The attributes of the synthetic dataset are independent and follow a normal distribution. The road network dataset is based on a partial road network dataset from North California, adjusted to have 17 dimensions and a 10% missing rate.

We compared the BTIS algorithm in this paper with the Sky-iDS algorithm [20], the SIDS algorithm [22], the △skyline algorithm [33], and PTKD algorithm [32] in terms of result accuracy, dataset missing rate, dataset dimensions, and dataset size.

Comparative analysis of algorithms

In this section, we have completed the comparative analysis of four experiments.

Experiment 1 compared the execution time of BTIS, SIDS, Sky-iDS, △skyline and PTKD algorithm under different dataset sizes. The dimensions of the three datasets in the experiment were set at 17, with dataset sizes ranging from 0 to 900K. The comparison involved BTIS, SIDS, Sky-iDS, △skyline and PTKD algorithm. The relationship between dataset size and CPU execution time is illustrated in Figs. 6, 7, and 8.

Fig. 6
figure 6

Effect of MovieLens dataset size on execution time

Fig. 7
figure 7

Effect of manual dataset size on execution time

Fig. 8
figure 8

Effect of road network dataset size on execution time

Figures 6 and 8 shows that the BTIS algorithm takes the least amount of time to execute. This is because as the size of the data becomes larger, the number of data sorting and tuple comparisons becomes larger, resulting in a longer running time for SIDS, due to the existence of the RankList structure in the Sky-iDS algorithm. The number of tuples in the structure and the number of sorts in the structure increase significantly and increase the execution time of the algorithm as the size of the dataset grows. The △skyline algorithm performs a one-by-one comparison when removing useless data sets. The PTKD algorithm involves a complex computation process in skyline queries, resulting in increased runtime as the data scale grows. The BTIS algorithm removes a large amount of useless data in the presence of incomplete data with weighted classification trees and optimal virtual points. The algorithm reduces the number of comparisons and has a lower run time for the same data size. The CPU execution time of the BTIS algorithm increases slowly as the data set increases. Changes in the size of the data set have a small impact on the runtime. Through the comparison of Figs. 6 to 8, it can be observed that the execution time of the road network dataset is slightly higher. This is due to the complexity of the road network dataset, which results in longer processing time for the algorithm under the same data volume. The algorithm's execution time on the road network dataset is slightly higher than that of the other two datasets.

From Figs. 6 and 7, it is observed that in the MovieLens experiments, algorithms such as SIDS, Sky-iDS, PTKD, and △skyline exhibit rapid changes in CPU execution time when the horizontal axis interval is between 300 k – 400 k. This is due to the presence of some non-uniformly distributed outliers in MovieLens. In contrast, artificially generated datasets have uniformly distributed values, resulting in smaller fluctuations in runtime for algorithms like SIDS, △skyline, and BTIS. Additionally, BTIS algorithm demonstrates minimal fluctuations in execution time across both datasets, indicating good algorithm performance.

Experiment 2 compares the execution time of BTIS, SIDS, Sky-iDS, △skyline and PTKD algorithm in different dimensions. The size of the three datasets in the experiment was 1M. The range of dimensionality is from [3] to [17]. The CPU execution times of BTIS, SIDS, Sky-iDS, △skyline and PTKD algorithm vary with increasing data dimensionality, as shown in Figs. 9, 10 and 11:

Fig. 9
figure 9

Impact of MovieLens dataset dimension on execution time

Fig. 10
figure 10

Impact of manual dataset dimension on execution time

Fig. 11
figure 11

Impact of road network dataset dimension on execution time

In Figs. 9, 10, and 11, as the horizontal axis (representing dimensions) increases within the interval from 3 to 11, the CPU runtime of the four algorithms tends to stabilize, with very little difference in runtime among them. As the dimensionality increases, the execution time of the Sky-iDS algorithm increases significantly between the horizontal coordinate interval 15 and 18. This indicates that the performance of the algorithms deteriorates as the dimensionality of the data becomes larger. As the RankList structure is present in the Sky-iDS algorithm and the size of the dataset becomes larger, the number of tuples and the number of sorts within the structure increases significantly. As the dimensionality of the data increases, the Sky-iDS algorithm needs to traverse and sort multiple dimensions in all datasets. As a result, the performance of the Sky-iDS algorithm decreases as the dimensionality of the dataset increases. The ∆skyline algorithm performs a one-by-one comparison when removing useless data sets. The execution time of the ∆skyline algorithm increases significantly as the dimensionality of the data set increases, but the BTIS algorithm performs well. This is because the algorithm first classifies the data using an incomplete data weighted classification tree and then selects the optimal dummy points to massively reduce the redundant data. The algorithm reduces the size of the data, the number of comparisons between the data is reduced, and the algorithm's running time is relatively smooth.

Experiment 3,the execution times of BTIS, SIDS, Sky-iDS, △skyline and PTKD algorithm were compared at different missing rates. Two datasets with a size of 1M and a dimension of 17 were used in the experiments. To ensure the random missingness of data and meet the conditions of missing rates from 10 to 50%, 10% to 50% of the data in the two datasets were randomly deleted. As the missing rate increased, the CPU execution times of the BTIS algorithm and the SIDS, △skyline, Sky-iDS and PTKD algorithm varied, as shown in Figs. 12, 13 and 14.

Fig. 12
figure 12

Impact of data missing rate on execution time

Fig. 13
figure 13

Impact of data missing rate on execution time

Fig. 14
figure 14

Impact of data missing rate on execution time

From Figs. 12, 13 and 14, it can be seen that as the missing rate increases, the execution time of the Sky-iDS algorithm decreases slowly. The execution time of the BTIS algorithm, △skyline algorithm, PTKD algorithm and SIDS algorithm decrease significantly. This is because the BTIS algorithm uses an incomplete data-weighted classification tree to improve the classification efficiency and accuracy of the classified data. After classification, the BTIS algorithm also uses the optimal virtual point to screen out a large amount of redundant data, reducing the number of comparisons and hence, the algorithm's running time. The PTKD algorithm employs parallel processing in skyline queries, which reduces computation time. The △skyline algorithm extracts the dominant optimal value when identifying data pairs and greatly reduces the number of comparisons, thus reducing the running time. The SIDS algorithm uses a sorted array, which stores the IDs of tuples that do not have missing values for the corresponding dimension. As the missing rate slowly increases, the number of tuples inserted into the array decreases, and the execution time of the algorithm also decreases. Due to the existence of the RankList data structure in the Sky-iDS algorithm, the bucket structure of the algorithm does not decrease significantly as the number of missing dimensions in the dataset increases. Therefore, the number of comparisons between tuples will not be significantly reduced, and the algorithm’s running time will not be significantly reduced. From Fig. 14, it can be observed that as the dimension missing increases, the execution time of the algorithm significantly decreases. This is because with the increase in dimension missing, the complexity of the road network dataset decreases sharply, resulting in a noticeable acceleration in the algorithm's processing speed.

Experiment 4 compares the accuracy of BTIS, SIDS, Sky-iDS, △skyline and PTKD algorithm in different dimensions. The size of the three datasets in the experiments is 1M. The range of dimensionality is from [3] to [17]. The variation of accuracy of BTIS, SIDS, Sky-iDS, △skyline and PTKD algorithm in the experiments is shown in Figs. 15, 16 and 17:

Fig. 15
figure 15

Influence of MovieLens dataset dimension on accuracy

Fig. 16
figure 16

Influence of dimension of manual dataset on accuracy

Fig. 17
figure 17

Influence of dimension of road network dataset on accuracy

In Figs. 15, 16 and 17, the accuracy of BTIS, SIDS, ∆skyline, Sky-iDS and PTKD algorithm decreases as the dimensionality of the data increases. As can be seen from the graph, the accuracy of the five algorithms decreases slowly in the interval 3–9. The decrease in accuracy of the Sky-iDS algorithm is significant in the interval 12–15. Since the RankList structure is present in the Sky-iDS algorithm, the number of tuples and the number of sorts within the structure increases significantly as the dimensionality of the dataset becomes larger. The Sky-iDS algorithm requires traversal and sorting of multiple dimensions in all datasets, which results in poor results and reduced accuracy. The SIDS algorithm also has a greater reduction in accuracy in the interval 12–15 in the horizontal coordinate. This is because the SIDS algorithm requires tuple sorting and deletion of the dominated tuple. This operation removes a larger amount of data by mistake, resulting in a decrease in accuracy. The ∆skyline algorithm also deletes data when comparing global best points. The accuracy of the ∆skyline algorithm also decreases as the dimensionality of the data increases. The BTIS algorithm, which removes a large amount of redundant data, has a nearly flat accuracy rate as the dimensionality of the data increases.

Conclusion

With the rapid development of computer information technology, skyline querying is playing an important role in a large number of real-life scenarios. Traditionally skylines have been used for complete data, i.e., data that is complete and not missing. In production environments where multidimensional data is increasing, there is a large amount of incomplete data in these multidimensional data. How to efficiently obtain global skyline points from multidimensional incomplete data is a difficult problem. At present, the existing research results have major limitations in skyline querying of multidimensional incomplete data. This paper proposes a skyline query method for multidimensional incomplete data based on classification trees to address the problem of low query efficiency caused by redundant data in the query process. The proposed method first solves the problem of low classification efficiency and slow classification speed by using the classification tree of incomplete data with weight. Then the classified data is skyline queried. The optimal virtual points are used to filter out redundant data and reduce the number of comparisons. The algorithm improves the performance and efficiency of skyline query. Experimental results show that the algorithm proposed in this paper has good performance. The categorized index space and data dimensions are linearly related in the figure. Future research work could focus on how to further reduce the spatial memory of this structure.