Hierarchical adaptive evolution framework for privacy-preserving data publishing

You, Mingshan; Ge, Yong-Feng; Wang, Kate; Wang, Hua; Cao, **li; Kambourakis, Georgios

doi:10.1007/s11280-024-01286-z

Hierarchical adaptive evolution framework for privacy-preserving data publishing

Open access
Published: 12 July 2024

Volume 27, article number 49, (2024)
Cite this article

Download PDF

You have full access to this open access article

World Wide Web Aims and scope Submit manuscript

Hierarchical adaptive evolution framework for privacy-preserving data publishing

Download PDF

Mingshan You¹,
Yong-Feng Ge¹,
Kate Wang²,
Hua Wang¹,
**li Cao³ &
…
Georgios Kambourakis⁴

Abstract

The growing need for data publication and the escalating concerns regarding data privacy have led to a surge in interest in Privacy-Preserving Data Publishing (PPDP) across research, industry, and government sectors. Despite its significance, PPDP remains a challenging NP-hard problem, particularly when dealing with complex datasets, often rendering traditional traversal search methods inefficient. Evolutionary Algorithms (EAs) have emerged as a promising approach in response to this challenge, but their effectiveness, efficiency, and robustness in PPDP applications still need to be improved. This paper presents a novel Hierarchical Adaptive Evolution Framework (HAEF) that aims to optimize t-closeness anonymization through attribute generalization and record suppression using Genetic Algorithm (GA) and Differential Evolution (DE). To balance GA and DE, the first hierarchy of HAEF employs a GA-prioritized adaptive strategy enhancing exploration search. This combination aims to strike a balance between exploration and exploitation. The second hierarchy employs a random-prioritized adaptive strategy to select distinct mutation strategies, thus leveraging the advantages of various mutation strategies. Performance bencmark tests demonstrate the effectiveness and efficiency of the proposed technique. In 16 test instances, HAEF significantly outperforms traditional depth-first traversal search and exceeds the performance of previous state-of-the-art EAs on most datasets. In terms of overall performance, under the three privacy constraints tested, HAEF outperforms the conventional DFS search by an average of 47.78%, the state-of-the-art GA-based ID-DGA method by an average of 37.38%, and the hybrid GA-DE method by an average of 8.35% in TLEF. Furthermore, ablation experiments confirm the effectiveness of the various strategies within the framework. These findings enhance the efficiency of the data publishing process, ensuring privacy and security and maximizing data availability.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In today’s digital age, data plays a crucial role in driving innovation and decision-making, while the issue of privacy has become increasingly pertinent [1,2,3,4,5]. Privacy-Preserving Data Publishing (PPDP) has emerged as a paramount need, seeking to strike a delicate balance between sharing valuable information and safeguarding individuals’ sensitive data [6,7,8,9]. This practice entails the dissemination of datasets that have been carefully anonymized or transformed to protect the privacy of individuals while still allowing researchers and organizations to extract meaningful insights [10,11,12,13]. Due to a heightened awareness of privacy risks, stricter regulations related to data protection, and an understanding of the necessity of responsible data processing, the popularity of privacy-protected data is on the rise [14,15,16].

Preserving data privacy while maintaining data utility is indeed a challenging task, and it remains one of the most pressing and complex challenges in the field [2, 17,18,19]. Various approaches have emerged to address this challenge, including data anonymization, differential privacy, secure multiparty computation, and homomorphic encryption [20,21,22,23]. Among these techniques, data anonymization is widely adopted. It involves modifying or removing identifying information from a dataset to protect individual privacy. Techniques such as generalization, suppression, perturbation, or data synthesis are employed to mitigate the risk of re-identification while ensuring the usefulness of the analyzed data [24, 25]. However, despite its widespread adoption and effectiveness in enhancing privacy protection, data anonymization also faces significant computational challenges due to NP-hardness restrictions on optimization [26], making it often impractical to find an exact solution within a reasonable time frame.

Some studies have introduced Evolutionary Algorithms (EAs), a common solution to NP-hard problems [27], to optimize data anonymization schemes. Among EAs, Differential Evolution (DE) and Genetic Algorithm (GA) are popular choices [28]. These algorithms find wide applications in fields such as engineering, machine learning, and bioinformatics, providing flexible and powerful optimization methods [25, 29]. DE uses differential mutation and crossover to evolve populations, emphasizing exploration. On the other hand, GA models natural selection and genetics, focusing more on exploitation. Both methods have shown effectiveness in solving complex optimization problems. This paper leverages an innovative framework that combines GA and DEs (including DE variants) for better performance and robustness in PPDP.

Previous academic studies have delved into using GA and DE algorithms for data anonymization. However, it is important to note that both GA and DE have multiple variants, and the efficacy of these variants, when applied to this problem, has yet to be thoroughly tested and evaluated. The performance of these different variants on data anonymization problems can vary significantly and needs to be explored experimentally. This exploration can help identify the most effective algorithm variants for optimizing data anonymization schemes. Additionally, the performance of different algorithms on different datasets can be uneven. Therefore, there is an urgent need for the development of more robust and effective algorithms.

This paper aims to enable a more practical application of EA in data anonymization processing, by improving the performance and robustness of the algorithm. We develop an effective adaptive strategy that dynamically combines the strengths of GA and DE mutation strategies, resulting in improved algorithm performance. Furthermore, we design a GA priority strategy and a random-based-DE priority strategy, which use GA and random-based-DE with greater probability in the early stage of evolution to enhance the population diversity and explorative search. In the later stages of evolution, best-based-DE is predominantly utilized to expedite the convergence of the search process and explorative search. This paper contributes to PPDP in the following ways.

This paper introduces an innovative Hierarchical Adaptive Evolutionary Framework (HAEF), seamlessly integrating GA, DE, and variants of DE. In contrast to previous algorithms utilizing solely GA or both GA and DE with a single mutation strategy [30,31,32], HAEF integrates a richer array of evolutionary strategies. As a result, it offers more outstanding performance and stronger robustness anonymization algorithm for PPDP.
In addition, this paper develops a novel adaptive strategy that intelligently leverages the unique characteristics of GA and various DE mutation strategies. Compared with the previous approach that mechanically alternated between GA and DE strategies [32], our method selects the subsequent evolution strategy based on the historical success probability of each strategy.
Furthermore, we design a GA-prioritized strategy and a random-prioritized strategy. The GA-prioritized strategy utilizes GA and random-based-DE strategies more in the early to middle stages of the evolution process, enhancing the algorithm’s robustness. The best-based-DE strategies are utilized more in the middle to late stages, improving the algorithm’s convergence speed.
To validate the effectiveness of our proposed method, we conduct comprehensive experiments. Comparative experiments with previous methods confirm the superiority of our framework. Ablation experiments further demonstrate the effectiveness of this strategy.

The paper is structured in the following manner: Section 2 offers an overview of the current literature in the field. Section 3 delves into a detailed definition of the optimal anonymization problem of t-closeness. Section 4 elucidates the HAEF approach that has been proposed. The experimental setup is specified in Section 5, and the experimental results are analyzed in Section 6. Section 7 discusses the limitations of this work and its potential real-world applications. Lastly, Section 8 concludes the paper by summarizing the essential findings and contributions.

2 Related work

Data anonymization is a crucial method for PPDP. Various privacy assessment models have been proposed to ensure the privacy and confidentiality of sensitive data, with some of the most classic models including k-anonymity, l-diversity, and t-closeness [33,34,35]. k-anonymity ensures that each record in a dataset is not distinguishable from at least k-1 other records, thereby protecting individual identities [33]. On the other hand, l-diversity prevents attribute leakage by requiring that each equivalence class contains at least l distinct sensitive attribute values [34]. t-closeness, an important model, requires that the distribution of a sensitive attribute in any equivalence class be very similar to the distribution of that attribute in the whole dataset. This approach enhances privacy guarantees by ensuring that sensitive attributes are not overly concentrated in certain equivalence classes [35].

There has been considerable research on finding the optimal way to anonymize data for k-anonymity. However, more research is needed to find optimal techniques for t-closeness anonymization. A study by Liang et al. [26] has proven that finding the optimal t-closeness anonymization solution is NP-hard. The limited number of studies addressing this issue underscores the need for further research.

Traditional search methods for finding the best anonymization scheme rely on depth-first or breadth-first traversal methods, supplemented by search space optimization techniques. For instance, Kohlmayer et al. [36] proposed the Flash algorithm, which uses a depth-first traversal search to achieve the best k-anonymity. While this method can guarantee the optimal solution for simple datasets, it may not be effective for complex datasets.

EAs have emerged as powerful tools for solving the NP-hard problem of data anonymization, particularly for complex datasets [1]. These algorithms combine evolutionary principles with optimization techniques to search for high-quality solutions efficiently. Ge et al. proposed the Information-Driven Genetic Algorithm (IDGA) [30] and the Information-Driven Distributed Genetic Algorithm (ID-DGA) [31], which use GA-based approaches to optimize k-anonymity. The Two-Layer Evolutionary Framework (TLEF), introduced by You et al. [32], is a notable improvement to EAs for optimizing data anonymization schemes. By integrating GA and DE, TLEF delivers enhanced performance. However, the simplicity of TLEF’s hybrid strategy suggests that further refinement is possible.

In the domain of EAs, research has explored the adaptation of different strategies, particularly within DE [37, 38]. One such example is the Self-adaptive Differential Evolution algorithm (SaDE) [38], which introduces a novel approach where the learning strategy and control parameters need not be pre-specified. This adaptability enables SaDE to adjust its strategy dynamically during optimization, potentially resulting in even better performance.

3 Problem formulation

3.1 Data anonymization

When publishing data, it is common practice to anonymize it using various techniques (such as generalization and suppression) to convert the original dataset D into an anonymous dataset T. The original dataset, often in table format, typically contains explicit identifiers, quasi identifiers, sensitive attributes, and non-sensitive attributes.

Explicit Identifiers: These are attributes that directly identify record owners. Examples include names, social security numbers, and email addresses.
Quasi Identifiers (QID): These attributes can potentially reveal the identity of individuals when combined with other information. Examples include zip codes, birth dates, and gender.
Sensitive Attributes (SA): These attributes contain private or sensitive information that needs to be protected. Examples include medical conditions, sexual orientation, and financial status.
Non-Sensitive Attributes: These attributes include other information that is not considered sensitive. Examples include job titles, educational background, and hobbies.

When it comes to anonymizing data, the primary goal is to transform the QIDs within the table into anonymized $QID^{'}s$ to eliminate the risk of data re-identification. Explicit identifiers, which directly identify individuals, are typically removed from the table altogether. Non-sensitive attributes, which lack sensitive information, do not require special handling and may remain unchanged. However, SAs contain essential information analysts need and are therefore retained.

The process of anonymizing data, known as $\mathcal {M}$, typically involves implementing techniques such as generalization, suppression, perturbation, data synthesis, etc. This paper adheres to the framework of IDGA [30] and ID-DGA [31], employing the strategies of generalization and suppression to achieve data anonymization, denoted as $\mathcal {M}\{G, S\}$. Generalization (G) entails substituting particular attribute values with broader or less precise ones, reducing the likelihood of re-identification, whereas Suppression (S) involves entirely removing specific records to prevent the disclosure of sensitive information. An example of an anonymized table is shown in Figure 3, which hides part information and makes privacy more difficult to be exposed.

$$\begin{aligned} \mathcal {D}\lbrace QID,SA \rbrace \,\xrightarrow {\mathcal {M}\left\{ G,S \right\} }\,\mathcal {T}\lbrace QID^{'},SA\rbrace \end{aligned}$$

(1)

For example, the original table in Figure 1 initially contains six QIDs, one SA, and r records. To protect privacy, the table is anonymized using generalization (G) and suppression (S) techniques. As shown in Figure 3, the G operator is a sequence of natural numbers with a length of n, which designates the necessary level of generalization for each QID. Typically, G is used alongside a generalization hierarchy (illustrated in Figure 2), where each value corresponds to a generalization level in the taxonomy tree [11]. Another operator, S, is a sequence consisting of 0s and 1s with a length of 8 (the same as the number of records in the original table), where zero indicates a record to be deleted, and one suggests a record to be retained (Figure 3).

3.2 Privacy model

In PPDP, data to be released must first undergo a privacy assessment to ensure proper anonymization before it can be released. Only anonymized data that meets privacy requirements is considered qualified for release. This paper employs the t-closeness privacy model for analysis, which aims to establish a certain level of similarity between the distributions of sensitive attributes in both the original and anonymized datasets. Please refer to Definition 1 for a more detailed explanation.

Definition 1

(The t-closeness Principle). An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness.

Given a distribution of equivalence class $ S=(s_1, s_2, \cdots , s_m) $, and the distribution of the whole table $ Q=(q_1,q_2,\ldots ,q_m) $, one well-known way to define the distance between the distribution is Euclidean distance:

$$\begin{aligned} D[S,Q]=\sqrt{(s_1-q_1)^2+(s_2-q_2)^2+\cdots +(s_m-q_m)^2} \end{aligned}$$

(2)

where $(s_1, s_2,\cdots ,s_m)$ and $(q_1,q_2,\cdots ,q_m)$ represent the proportion of a sensitive attribute in the corresponding class and the whole table.

It is noteworthy that regardless of whether the original data contains one single or multiple sensitive attributes, the calculation for t-closeness of a dataset remains consistent. In the case of a single sensitive attribute, the proportional distribution of each value within the sensitive attribute can be directly computed. In scenarios involving multiple attributes, it becomes necessary to calculate the proportional distribution of combinations of values across all sensitive attributes.

For an anonymized dataset (T) containing l equivalent classes, its Anonymity Degree AD(T) is defined by the furthest class, as shown in (3).

$$\begin{aligned} AD(T) = Max(D[S_1,Q],D[S_2,Q],\ldots ,D[S_l,Q]) \end{aligned}$$

(3)

Therefore, according to Definition 1, if an anonymized dataset has t-closeness , then:

$$\begin{aligned} AD(T)\le t \end{aligned}$$

(4)

3.3 Utility metrics

The utility of dataset T is calculated according to its Transparency Degree (TD) [11], which implication is how much useful information remains in the released data after suppression and generalization:

$$\begin{aligned} \text {{ TD}}(T)=\sum _{{r}'\in T}\sum _{v_g\in {r}'}\frac{1}{\left| v_g \right| } \end{aligned}$$

(5)

where ${r}'$ indicates records remained in T after the suppression process; $v_g$ is the generalized value in record ${r}'$; $\left| v_g \right| $ is the number of domain values that are descendants of $v_g$.

3.4 Optimal anonymization

Optimal anonymization describes the solution that results in minimal information loss according to a given metric [36, 39]. In the specific context of this paper, it is defined as follows.

Definition 2

(Optimal anonymization). For anonymized dataset T, the optimal anonymization solution can satisfy the privacy requirement $AD(T) \le t$ and achieves the highest utility degree Max(TD(T)).

As the study [26] demonstrated, for every constant t, this is an NP-hard problem. In other words, finding the optimal solution is computationally expensive or impossible. A second best approach is to find a relatively optimal solution within limited computational time (maxEvaluationTime).

Therefore, the optimization problem to be solved in this paper is to find the most efficient combination of suppression and generalization solutions $\mathcal {M}\{G, S\}$ within the maxEvaluationTime to maximize the data utility metric TD(T) within a given t-closeness threshold t.

$$\begin{aligned} {\left\{ \begin{array}{ll} & \text {Max}(\text {{ TD}}(T))\\ & AD(T)\le t \\ & EvaluationTime \leqslant {\max }EvaluationTime \end{array}\right. } \end{aligned}$$

(6)

4 Hierarchical adaptive evolution framework

The HAEF algorithm utilizes a two-hierarchical adaptive architecture that combines the benefits of GA and DE variant algorithms. GA utilizes natural selection and genetics to generate promising solutions, while DEs offer potent search mechanisms and exploration capabilities. By harnessing the power of both techniques, HAEF can effectively navigate the solution space and uncover optimal solutions that enhance optimization. Furthermore, the innovative adaptive strategy of HAEF prioritizes using GA and random-based-DE mutation strategies during the early stages of iteration to broaden the search space. Later, it utilizes best-based-DE strategies to approach the global optimum, resulting in a more efficient and effective optimization process.

4.1 Workflow of HAEF

The overall process of HAEF is illustrated in Figure 4. It is a method that anonymizes original data using two technique s: generalization and suppression. As mentioned in Section 3.1, the generalization operator is a list of natural numbers signifying the levels of the generalization hierarchy (Figure 2) to which the values of QID should be mapped. The suppression operator is represented as a binary sequence of values (0 and 1) equal in length to the number of records in the dataset. ‘0’ indicates the removal of the data in the corresponding row, while ‘1’ denotes the retention of data in the respective record.

Given a G and an S, a table D’s anonymization method is established. By calculating the AD and TD of the anonymized table T, the effectiveness of this anonymization scheme can be evaluated. Taking the $M\{G,S\}$ in Figure 3 as an example, the anonymized table contains 6 records. After generalizing the QIDs, these records are grouped into three groups: $(R_1, R_2, R_5); (R_4); (R_7, R_8)$.

As long as the anonymized table T meets the privacy constraints (4), scheme M satisfies PPDP. However, the optimal anonymization not only needs to satisfy the privacy constraints but also requires finding a scheme that maximizes TD.

In order to seek the optimal solution, HAEF employs a hybrid approach combining EA and DE. Initially, the algorithm generates an initial population comprising two sub-populations: population a represents schemes for generalization (G), while population b represents schemes for suppression (S). The size of the population is denoted by NP, indicating the number of individuals within the population. Subsequently, it is imperative to compute and validate the privacy and data transparency of each anonymized table generated by every individual in the population. Following the attainment of fitness values, a selection process is conducted among the individuals within the population. This selection method adheres to the criteria proposed in IDGA [30] and ID-DGA [31]. The best-performing individual selected is then utilized for subsequent evolution or is directly outputted as the determined optimal anonymization scheme.

The algorithm iteratively searches for the optimal solution within the predefined maximum validation iteration range. HAEF incorporates a dual-layer adaptive scheme, intelligently leveraging the advantages of GA, DE, and the variants of DE. In the hierarchy of GA-DE adaptation, HAEF employs a probabilistic approach to randomly select either the GA or DE method for the evolution of the new generation. The probability of selecting GA or DE is determined by the ratio of successful offspring generated by GA and DE in previous generations, which proceed to the subsequent round of evolution. Additionally, HAEF incorporates a GA-prioritized strategy, wherein GA is given higher probability usage during the initial stages of evolution, and DE is favored with higher probability towards the later stages.

Upon selecting GA, two individuals are randomly chosen from the population to serve as parents. These parents then undergo crossover, wherein segments of their genetic material are exchanged, generating offspring. Subsequently, the mutation is applied to randomly alter portions of the genetic makeup of some newly generated individuals, forming a new generation of individuals. Through validation and selection of the new individuals, those that outperform their parent individuals are retained, while those that do not meet the criteria are eliminated. This process results in the formation of the next generation of the population.

When DE is selected, HAEF incorporates an adaptation layer, intelligently choosing the mutation strategy for the DE layer based on probabilities. The selection probabilities of all mutation strategies are determined by the ratio of successful offspring generated by each mutation strategy in past generations to the total number of offspring generated by DE. Furthermore, HAEF implements a random-prioritized strategy, favoring the use of random-based DE mutation strategies in the initial stages of evolution and shifting towards the use of best-based-DE mutation strategies in the later stages.

Once one of the six mutation strategies is selected, each individual in the original population will generate a new individual accordingly. Subsequently, this new individual will exchange portions of its elements with the original individual through crossover. The fitness of each new individual is then validated, and these individuals are compared with the original individuals. If a new individual outperforms the original individual, it replaces the original individual; otherwise, it is discarded. This process forms a new population generated by DE for the next generation.

When the maximum validation iteration is reached, the HAEF ends and outputs the current best anonymization scheme.

4.2 GA-DE adaptation hierarchy

As shown in Algorithm 1, the HAEF follows the general procedure of an EA. Firstly, in Step 2, a Population $P_0\{X_{i,0}|i=1,2,\cdots ,NP\}$ contains NP Individuals is initialized. Each Individual includes two parts: $ X_{i,0}^{a} $ is for method of generalization $\mathcal {M}\{G\}$ and $ X_{i,0}^{b} $ is for method of suppression $\mathcal {M}\{S\}$.

$$\begin{aligned} X_{i,0}^{a}=(x_{1,i,0}^{a},x_{2,i,0}^{a},\cdots ,x_{j,i,0}^{a})|i=1,2,\cdots ,NP \end{aligned}$$

(7)

$X_{i,0}^{a}$ is a list contains j natural numbers which elements are randomly generated according to a uniform distribution $0\le x_{j,i,0}^{a}\le x_{j}^{up}$, where $x_{j}^{up}$ is the highest generalization hierarchy level of the jth attribute. For $\mathcal {M}\{G\}$, $ j=1,2,\cdots ,n $, where n is the number of QID attributes.

$$\begin{aligned} X_{i,0}^{b}=(x_{1,i,0}^{b},x_{2,i,0}^{b},\cdots ,x_{j,i,0}^{b})|i=1,2,\cdots ,NP \end{aligned}$$

(8)

$X_{i,0}^{b}$ is randomly generated list according to a uniform distribution $x_{j,i,0}^{b}\in \{0,1\}$. For $\mathcal {M}\{S\}$, $ j=1,2,\cdots ,r $, where r is the number of entries recorded in table D.

In the following Step 3, HAEF calculate the AD(T) and TD(T) of each Individual in $P_0$ with (3) and (5). Then, following the selection criteria introduced in [30], the best Individual in $P_0$ ($X_{best,0}$) is selected in Step 4 for the later process.

From Steps 6 to 16, HAEF enters a while loop for evolutionary operations. In the loop, firstly, HAEF performs the first adaptation hierarchy deciding either GA or DE evolutionary strategy will be applied to update the current population $P_g\{X_{i,g}^{a}, X_{i,g}^{b}\}$ (shown in Step 9). HAEF randomly selects GA or DE according to its corresponding probabilities $ p_{ga} $ and $ p_{de} $ (Step 9). If GA is selected, HAEF will perform Algorithm 2 (Steps 10-11). Otherwise, DE is selected and Algorithm 3 will work (Steps 12-13).

The value of g representing the current generation will increase by 1 each time through the loop. The loop will be broken when $g\le \gamma $ where $\gamma $ is the pre-defined max generation.

4.2.1 GA-prioritized adaptation

GA and DE are commonly utilized, each with distinct advantages and drawbacks. GA excels at preserving a varied population during optimization, efficiently exploring the search space and minimizing the risk of local optima. It achieves this through a crossover operation that blends information from multiple candidate solutions to generate novel solutions, thereby uncovering new areas of the search space and potentially discovering superior solutions. While DE may struggle to balance exploration and exploitation, GA tackles these challenges adeptly with its crossover operations. Additionally, DE can prematurely converge to suboptimal solutions in complex multimodal optimization problems.

Hence, this paper designs a GA-DE adaptation hierarchy to dynamically combine GA and DE, predominantly employed in the early stages of evolution to preserve population diversity and avoid premature convergence to the local optimum. Conversely, DE dominates the middle and late stages of evolution to approach the global optimum quickly.

In the GA-DE adaptation hierarchy, HAEF adopts a GA-prioritized adaptation strategy. For the first evolution, the EA strategy is chosen according to the initial probability $p_{ga}^{0}=1$ and $p_{de}^{0}=0$. Then $p_{ga}$ and $p_{de}$ are updated every UpdateInterval generations according to (9) and (10) (shown in Step 7), where UpdateInterval is a pre-defined parameter.

$$\begin{aligned} p_{de}=\frac{1}{2}(\frac{ns_{de}\cdot (ns+nf_{ga})}{ns_{ga}\cdot (ns_{de}+nf_{de})+ns_{de}\cdot (ns_{ga}+nf_{ga})}+\frac{g}{\gamma }) \end{aligned}$$

(9)

$$\begin{aligned} p_{ga}=\frac{1}{2}(\frac{ns_{ga}\cdot (ns_{de}+nf_{de})}{ns_{ga}\cdot (ns_{de}+nf_{de})+ns_{de}\cdot (ns_{ga}+nf_{ga})}+1-\frac{g}{\gamma }) \end{aligned}$$

(10)

where $ns_{de}$ and $ns_{ga}$ are, respectively, the counts of successful trial vectors generated by the DE and GA during the UpdateInterval generation. Correspondingly, $nf_{de}$ and $nf_{ga}$ are the count of discarded trial vectors generated by DE and GA.

The algorithm stops when it reaches the maximum number of generations. The best anonymization solution, labeled as $X_{best,\gamma }$, is determined from the population $P_\gamma $. This solution is the most effective configuration found by the HAEF and provides optimal anonymization for the given dataset.

4.2.2 GA

This paper presents an implementation of a basic GA, detailed in Algorithm 2. The algorithm iterates through each pair of parents in the parental population using a for loop (from Step 2 to Step 9). For each pair, the algorithm performs a crossover operation between the father and mother, producing an offspring that inherits traits from both parents (Step 3). The crossover rate is regulated by $CR_{ga}$. Then, in Step 4, the offspring undergoes a mutation operator that introduces small, random changes to its genetic composition. The proportion of random changes is controlled by $MR_{ga}$. In Step 5, the offspring’s AD and TD are evaluated based on (3) and (5). If an offspring is deemed more competitive than any parent individual, that parent is replaced by the offspring (Steps 6-8). This process repeats until all parent individuals are considered. Once GA evolution is completed, the updated population $P_{g+1}$ and the best individual $Individual_{best, g+1}$ of this round are returned to the mainstream.

4.3 DEs adaptation hierarchy

As shown in Algorithm 3, DE uses varying mutation strategies during execution. As shown in Steps 2-20, through the for loop, each input individual of the previous generation enters a cycle of DE: mutation, crossover, and selection.

4.3.1 Random-prioritized adaptation

In DE, the mutation strategy plays a crucial role in exploring the search space and reaching the optimal solution. Random-based mutation strategies (e.g., DE/rand/1, DE/rand/2, DE/current-to-rand/1) and best-related mutation strategies (e.g., DE/best/1, DE/best/2, DE/current-to-best/1) differ in how they select candidate solutions for mutation.

The random-based mutation strategies facilitate the search space exploration by selecting random solutions from the population for mutation. This can help to escape from local optima and find diversified solutions. This approach is generally more robust to noise and disturbances in the objective function since it relies on various mutation candidate solutions. However, the convergence rate of these strategies may be slower than that of best-based strategies, especially in optimization problems where exploiting the best solution is crucial for convergence.

Best-based strategies focus on exploiting the best solutions in the population, leading to faster convergence to the optimal solution. These strategies tend to produce higher-quality solutions since they prioritize the best solution for mutation. However, the best-based strategies may prematurely converge to a sub-optimal solution, especially in problems where search space exploration is crucial.

In summary, DE mutation strategies should be chosen based on the optimization problem’s requirements for exploration and exploitation. Random-based strategies are better for exploring extensive search spaces, while best-based strategies are better for converging toward the best solution. Therefore, this paper uses an adaptive hierarchy to select the DE mutation strategy dynamically. Random-based mutation strategies tend to be used more frequently in the early stage of the evolution process. In contrast, best-based mutation strategies tend to be used more regularly in the middle and late stages of evolution.

When updating the population with DE, each individual has the opportunity to select a predefined mutation strategy based on the corresponding probability (Step 3). In the first round, this probability is given by the initialization state. The subsequent probability $p_{de}^{\delta }$ is renewed every UpdateInterval generations by the (11), (13) and (12) in the mainstream.

$$\begin{aligned} sm_{de}^{\delta }=\frac{ns_{de}^{\delta }}{(ns_{de}^{\delta }+nf_{de}^{\delta })+\varepsilon }+\varepsilon \end{aligned}$$

(11)

where $ \delta = 1,2,\cdots ,6$ represents a corresponding mutation strategy in (14)-(19). $ns_{de}^{\delta }$ and $nf_{de}^{\delta }$ are, respectively, the counts of successful and failed trial vectors corresponding to $\delta $ during UpdateInterval generations.

This paper presents different evolutionary probability updating methods for random-based and best-based mutation strategies. The aim is to increase the usage of the ‘random’ strategy in the early stages of evolution and the ‘best’ strategy in the middle and late stages. There are three types of “random" based mutation strategy (refer to (14), (16), and (18)), and the corresponding probability of evolution $p_{DE}^{\delta }|\delta =1,3,5$ is updated using (12). Similarly, there are three types of “best" based mutation strategies (refer to (15), (17), and (19)), and the corresponding probability of evolution $p_{DE}^{\delta }|\delta =2,4,6$ is updated using (13).

$$\begin{aligned} p_{de}^{\delta }=\frac{1}{4}(\frac{sm_{de}^{\delta }}{sm_{de}^{1}+sm_{de}^{2}+sm_{de}^{3}+sm_{de}^{4}+sm_{de}^{5}+sm_{de}^{6}}+1-\frac{g}{\gamma })|\delta =1,3,5 \end{aligned}$$

(12)

$$\begin{aligned} p_{de}^{\delta }=\frac{1}{4}(\frac{sm_{de}^{\delta }}{sm_{de}^{1}+sm_{de}^{2}+sm_{de}^{3}+sm_{de}^{4}+sm_{de}^{5}+sm_{de}^{6}}+\frac{g}{\gamma })|\delta =2,4,6 \end{aligned}$$

(13)

4.3.2 Mutation operation

At generation g, DE employs the mutation and crossover operations to produce a trial vector $U_{i,g}$ for each individual vector $X_{i,g}$, also called target vector, in the current population.

For each target vector $X_{i,g}$ in generation g, an associated mutant vector $ V_{i,g}=\left\{ v_{1i,g},v_{_{2i,g}},...,v_{ni,g} \right\} $ can usually be generated by using one of the following 6 strategies:

“DE/rand/1”:

$$\begin{aligned} V_{i,g}=X_{r_{1},g}+F\cdot \left( X_{r_{2},g}-X_{r_{3},g} \right) . \end{aligned}$$

(14)

“DE/best/1”:

$$\begin{aligned} V_{i,g}=X_{best,g}+F\cdot \left( X_{r_{1},g}-X_{r_{2},g} \right) . \end{aligned}$$

(15)

“DE/rand/2”:

$$\begin{aligned} V_{i,g}=X_{r_{1},g}+F\cdot \left( X_{r_{2},g}-X_{r_{3},g} \right) +F\cdot \left( X_{r_{4},g}-X_{r_{5},g} \right) . \end{aligned}$$

(16)

“DE/best/2”:

$$\begin{aligned} V_{i,g}=X_{best,g}+F\cdot \left( X_{r_{1},g}-X_{r_{2},g} \right) +F\cdot \left( X_{r_{3},g}-X_{r_{4},g} \right) . \end{aligned}$$

(17)

“$DE/current-to-rand/1$”:

$$\begin{aligned} V_{i,g}=X_{i,g}+F\cdot \left( X_{r_{1},g}-X_{i,g} \right) +F\cdot \left( X_{r_{2},g}-X_{r_{3},g} \right) . \end{aligned}$$

(18)

“$DE/current-to-best/1$”:

$$\begin{aligned} V_{i,g}=X_{i,g}+F\cdot \left( X_{best,g}-X_{i,g} \right) +F\cdot \left( X_{r_{1},g}-X_{r_{2},g} \right) . \end{aligned}$$

(19)

where indices $r_{1}$, $r_{2}$, $r_{3}$, $r_{4}$, $r_{5}$ are random and mutually different integers generated in the range [1, NP], which should also be different from the current trial vector’s index i. F is a factor in (0, 2] for scaling differential vectors, and $X_{best,g}$ is the individual vector with the best fitness value in the population at generation $g$.

If the value of some elements in the newly generated mutant vector exceeds the corresponding upper and lower bounds, we deal it with the following rules. For population a, if $ v_{j,i,g}^a < 0$, replace its value with 0; if $ v_{j,i,g}^a > x_{j}^{up}$, replace its value with a random integer in $ (0,x_{j}^{up}] $. For population b, if $ v_{j,i,g}^b < 0$, randomly replace its value with 0 or 1; if $ v_{j,i,g}^b > 1 $, replace its value with1; if $ 0 \le v_{j,i,g}^b \le 1 $, then randomly and uniformly replace its value with 0 or 1 with probability $v_{j,i,g}^b$.

4.3.3 Crossover operation

Subsequently, the “binominal” crossover operation is performed between the generated mutant vector $V_{i.g}$ and its corresponding target vector $X_{i,g}$, resulting in a trial vector $U_{i,g}=\left( u_{1i,g},u_{2i,g},...,u_{ni,g} \right) $.

$$\begin{aligned} u_{j,i,g}=\left\{ \begin{matrix} v_{j,i,g}, & if ( rand_{j}[0,1]\le CR_{de} ) or j=j_{rand}) \\ x_{j,i,g}, & otherwise \end{matrix}\right. \end{aligned}$$

(20)

where $CR_{de}$ is a user-specified crossover constant in the range [0, 1) and $j_{rand}$ is a randomly chosen integer in the range [1, NP] to guarantee that the trial vector $U_{i,g}$ will differ from its corresponding target vector $X_{i,g}$ by at least one element. For population a, j takes on the values of 1 to n, and for population b, j ranges from 1 to r.

4.3.4 Selection operation

Next, the trial vectors’ fitness values are evaluated using (3) and (5). Then, a selection operation similar to the one described in [30] is performed. If neither the trial vector nor it’s corresponding target vector meet the privacy preservation requirement, the Individual with a higher AD is considered more competitive. If only the trial vector meets the requirement, it is considered more competitive. Finally, if both the trial vector and its corresponding target vector meet the privacy preservation requirements, the Individual with a higher TD value replaces the target vector and enters the population of the next generation. The mathematical expression of this operation is as follows:

$$\begin{aligned} X_{i,g+1}= {\left\{ \begin{array}{ll} U_{i,g}, & \text { if}\,\, AD(T(X_{i,g}))>AD(T(U_{i,g}))>t \\ U_{i,g}, & \text { if}\,\, AD(T(U_{i,g}))\le t, AD(T(X_{i,g}))> t \\ U_{i,g}, & \text { if}\,\, AD(T(U_{i,g}))\!\le \! t, AD(T(X_{i,g}))\!\le \! t, TD(T(U_{i,g}))\!>\!TD(T(U_{i,g})) \\ X_{i,g}, & \text { otherwise} \end{array}\right. } \end{aligned}$$

(21)

This process continues until all target vectors in $P_g\{X_{i,g}^a,X_{i,g}^a\}$ have been considered. Finally, the updated population $P_{g+1}$ and $X_{best,g+1}$ is output as the result of the algorithm, capturing the improvements made through the iteration process.

5 Experimental setup

This section describes the experimental setup we used in our study, including the dataset, hardware, software configurations, and specific steps taken to ensure reproducible and reliable results.

5.1 Dataset and test cases

This study employs the Hospital Inpatient Discharges 2015 database^{Footnote 1}, which is an official database provided by the New York State Department of Health. The database consists of 34 attributes, including Health Service Area, Facility Name, Age Group, Zip Code, and more, and contains over two million records.

This paper extracted four datasets from the original database to validate the proposed approach. Each dataset has different characteristics in QIDs and SA. These datasets involve sensitive attributes such as whether a patient is at high risk for emergency hospitalization, diagnosed with a mental disorder, an HIV-infected individual, or diagnosed with bronchial or lung cancer.

Considering the potential impact of varying the number of QIDs, record counts, and the balance of sensitive attribute values on model performance, we randomly construct four test cases based on each dataset. These test cases differ in terms of the number of QIDs, record counts, and the proportion of classes for the sensitive attributes. For specific details, please refer to Table 1 to explore the number of sensitive attributes (nSA), the number of QIDs (nQID), and the record count (nR).

Table 1 Properties of 16 test cases

Full size table

In Figure 5, we present an analysis of the balance/imbalance about sensitive attributes across each dataset. We examine diverse proportions of Class0 and Class1 for sensitive attributes: spanning from roughly balanced ($D_{9}$, $D_{10}$, $D_{11}$, $D_{12}$) to extremely imbalanced situations ($D_{13}$, $D_{14}$, $D_{15}$, $D_{16}$). Additionally, intermediate states are explored within the spectrum of $D_{1}$ to $D_{8}$.

5.2 Algorithm implementation

The algorithms in our study, including HAEF and the compared algorithms, are implemented in Python and executed on a local workstation running Windows 10 Pro. The workstation features an AMD Ryzen Threadripper PRO 3995WX CPU with 64 cores and a clock speed of 2.70 GHz, along with 256 GB RAM, providing ample computational resources for the experiments.

6 Experimental result

6.1 Comparison with existing approaches

To assess the efficacy of the proposed HAEF algorithm in achieving t-closeness, we conducted a series of experiments comparing it to three existing algorithms: DFS [36], ID-DGA [31], and TLEF [32]. DFS uses a traditional depth-first traversal search method, while ID-DGA and TLEF are advanced EAs developed for data anonymization.

The experiments included three different t-closeness thresholds: 0.1, 0.2, and 0.3. By testing the performance of each algorithm under varying thresholds, we are able to gain valuable insights into the effectiveness of the proposed method and how it compares to existing techniques.

6.1.1 Parameter settings

The experiments set the maximum fitness evaluation number as ten times the product of the number of quasi-identifiers and the number of records ($ 10\times nQID\times nR$) for the above four methods. Please note that we do not care whether DFS has traversed all solutions; we only define its maximum number of evaluations to facilitate fair comparison with EA-based methods (ID-DGA, TLEF and HAEF). Therefore, the solution finally obtained by DFS may or may not be optimal.

The population size (NP) for all three EAs is set to 30. For ID-DGA, TLEF and HAEF, the GA related parameters crossover rate $CR_{ga}$ was set to 0.5, while the mutation rate $MR_{ga}$ was set to 0.2. For TLEF and HAEF, the DE-related parameters scaling factor (F) is set to 1.3, and the crossover rate ($CR_{de}$) is set to 0.3. In addition, parameter UpdateInterval related to the update frequency of the HAEF adaptive strategy is set to be updated every ten generations. These parameter settings provided a standardized framework for evaluating the performance and behavior of each algorithm in our experiment.

6.1.2 TD comparison

The experimental results are summarized in Tables 2, 3, and 4, which showcase the performance of different algorithms for a specific number of evolutions. The TD values are achieved at t-closeness thresholds of 0.1, 0.2, and 0.3, respectively. Each table displays the average (Avg) and standard deviation (Std) of TD based on 25 independent runs for ID-DGA, TLEF and HAEF on 16 test datasets. The DFS only needs to be run once, and its single Result is listed.

Table 2 TD comparison when t=0.1

Full size table

In the tables, bold text highlights the maximum TD mean for datasets. The $^\dagger $ symbol indicates statistical significance test results using the Wilcoxon rank-sum test with a 0.05 level. Based on the results presented in the three tables, it is evident that the proposed HAEF algorithm has significant advantages under various privacy constraints.

Table 3 TD comparison when t=0.2

Full size table

Compared to the baseline algorithms DFS, ID-DGA, and TLEF, the proposed HAEF method demonstrates significant advantages under various privacy constraints. At $t=0.1$, except for $D_{5}$, $D_{7}$, $D_{9}$, and $D_{11}$, HAEF achieves the highest average TD on 12 out of 16 test datasets. Additionally, HAEF statistically outperforms the control group in 9 of these test datasets (excluding $D_{3}$, $D_{6}$, and $D_{15}$). At $t=0.2$, the HAEF algorithm achieves the highest average TD on 14 of 16 test datasets, except for $D_{5}$ and $D_{9}$. From a significance testing perspective, HAEF performs better than the baseline in 13 test datasets (excluding $D_{3}$). At $t=0.3$, HAEF attains the highest TD mean on most test datasets (excluding $D_{5}$ and $D_{14}$) and exhibits significant advantages in 11 datasets (excluding $D_{1}$, $D_{3}$, $D_{9}$ and $D_{11}$). This underscores the substantial impact of HAEF’s performance in the majority of the test datasets.

Table 4 TD comparison when t=0.3

Full size table

Figure 6 shows the overall performance (comparing the sum of TD values across 16 test cases) improvement of HAEF compared to the other three algorithms at different t values. When $t=0.1$, HAEF outperforms DFS, ID-DGA, and TLEF by 55.41%, 48.31%, and 11.75%, respectively; when $t=0.2$, the improvements are 43.14%, 37.38%, and 7.91%, respectively; and when $t=0.3$, HAEF increases 44.79%, 37.59%, and 5.39% respectively. Under the three privacy constraints, HAEF’s overall performance compared to DFS, ID-DGA, and TLEF increased by an average of 47.78%, 41.09%, and 8.35%, respectively.

Looking at the overall trend, as t increases, the advantage of HAEF is diminished, especially when compared to TLEF. This is because a larger t value implies looser privacy constraints and lower optimization complexity. Both HAEF and TLEF use the same evolutionary methods, namely GA and DE. However, TLEF combines GA and DE using a mechanical parity-interleaved approach, while HAEF intelligently selects the evolutionary path.

6.1.3 Convergence curves

This subsection presents convergence curves of DFS, ID-DGA, TLEF and HAEF at $t=0.2$ in Figures 7 and 8. These curves offer valuable insights into the algorithms’ performance and visually illustrate how the algorithms advance iteratively, giving us a better understanding of their convergence behavior and optimization capabilities.

As shown in Figures 7 and 8, each sub-figure features a legend in the lower right corner, showcasing the algorithms through distinct symbols and varying colors. The horizontal axis denotes the number of fitness evaluations (NFEs), while the vertical axis corresponds to the value of TD.

The convergence curves displayed in Figure 7 demonstrate multiple instances where HAEF outperforms other algorithms, such as $D_4$, $D_6$, $D_7$, $D_8$, $D_{10}$, $D_{12}$, $D_{13}$, $D_{14}$ and $D_{15}$. In most cases, the results of HAEF and TLEF are significantly superior to DFS and ID-DGA, with HAEF displaying exceptional performance overall, particularly in $D_{6}$, $D_{12}$, $D_{13}$, and $D_{15}$. We also observe that HAEF lags significantly behind other algorithms in the initial iteration phase. For example, compared to TLEF, which uses GA and DE in a balanced manner at each stage, the convergence speed of HAEF is not dominant in the beginning stage of evolution. This may be due to the intensive updating of GA at the beginning of its evolution. While this may slow the convergence rate, the overall algorithm is better suited to finding the global optimal solution, thanks to the GA algorithm’s development capabilities.

In Figure 7, DFS shows fewer results in some subfigures (e.g., b, e, h). This is because DFS has traversed all solutions within the maximum number of evaluations to obtain the optimal solution.

Figure 8 shows the algorithm performance of HAEF on the datasets that did not achieve the maximum TD mean ($D_{5}$, $D_{9}$) or the significant best ($D_{3}$). It can be seen that although HAEF did not achieve the maximum value, it provided a final result comparable to other methods that achieved the maximum TD mean. This shows the robustness of the HAEF.

6.1.4 Population diversity

To further analyze the changes in the population diversity of the HAEF method during the evolution process, this section uses Euclidean distance and Hamming distance to represent the diversity of the G population and S population, respectively. A larger Euclidean distance or Hamming distance between two individuals of the population means that the difference between two individuals is more significant, and vice versa [40]. The sum of the Euclidean distance or Hamming distance between each pair of individuals in the population indicates the diversity of the population.

This paper calculates the changing trend of the diversity of the G and S populations of HAEF and the compared EAs (ID-DGA and TLEF) on each test set when t=0.2 as the number of fitness evaluations increases. As shown in Figure 9, three typical changing trends ($D_{3}$, $D_{6}$ and $D_{16}$) of each algorithm are listed. The subgraphs a, b, and c respectively represent the Euclidean distance of the G population of each algorithm on the test set $D_{3}$, $D_{6}$, and $D_{16}$, and the subgraphs d, e, and f represent the Hamming distance of the corresponding S population. As can be seen from the figure, regardless of population G or population S, the HAEF proposed in this paper can maintain higher population diversity than the compared EAs in the early and middle stages of evolution. A higher population diversity can enhance exploratory search and help identify promising search regions.

6.2 Ablation test

This subsection examines the efficacy of various strategies employed in HAEF using ablation experiments. Three comparison algorithms were obtained by partially preventing the relevant strategies from taking effect, whose performance is shown in Table 5. In the table, the bold text highlights the maximum TD mean for datasets and the symbol $^\dagger $ indicates the statistical significance test results using the Wilcoxon rank-sum test at the 0.05 level.

Among them, the ‘without priority strategy’ approach halts the GA-prioritized adaptation strategy in the GA-DE adaptation hierarchy and the Random-prioritized adaptation in the DEs adaptation hierarchy. The ‘without GA-DE adaptation’ scheme employs only 6 DEs for adaptive modifications. Finally, the ‘without DEs adaptation’ approach discards the other five variant DEs and solely utilizes the ‘DE/best/1’ mutation strategy and GA adaptive evolution.

In the ablation test, HAEF achieved the highest average TD on 13 out of 16 test sets and had a significant advantage on 11. When comparing ‘without GA-DE adaptation’ and ‘without DEs adaptation’, both had advantages and disadvantages, but DEs are found to have more advantages than GA in solving problems of optimal t-closeness. Combining GA and DEs with a two-hierarchy adaptive strategy, the ‘without priority strategy’ scheme performed better on more datasets than the ‘without GA-DE adaptation’ scheme (which only uses DEs adaptation) and the ‘without DEs adaptation’ scheme. It achieved the optimal value and was the second most competitive solution. When comparing the first two sets of data in Table 5, HAEF performed better than ‘without priority strategy’ on most sets, indicating that the GA-prioritized and random-prioritized adaptation strategies proposed in the paper further improved the performance of the algorithm.

Table 5 Comparison of TD values across HAEF ablation trials when $t=0.2$

Full size table

7 Implication and limitation

The implication of this paper is significant from both academic and industry perspectives. From the academic standpoint, the paper proposes a high-performance hierarchical adaptive evolution framework. Given its advantages in optimization performance and versatility, it is worth utilizing in other data-driven scenarios. Additionally, the proposed adaptive strategies can be embedded into existing state-of-the-art evolutionary algorithms for further improvement. For the industry, this paper showcases a practical and outstanding application of evolutionary computation in data anonymization, leading to the potential utilization of existing data publishing systems.

While the proposed HAEF method in this paper has made significant progress compared to the baseline algorithms, one limitation is that it only considers the static privacy release of data, making it inapplicable in dynamic data publishing scenarios. A potential future work could be to adapt the proposed HAEF method to dynamic data systems.

8 Conclusion

In conclusion, the increasing demand for data release and the simultaneous need for privacy protection have emphasized the importance of PPDP. This problem remains challenging, especially with complex datasets with ineffective traditional traversal search methods. Although EAs show promise in tackling this challenge, they require refinement when applied to PPDP. This study introduces a new approach, the HAEF, which optimizes the t-closeness anonymity method by utilizing attribute generalization and record suppression. HAEF’s innovative two-layered design, consisting of a GA-prioritized adaptive strategy in the first layer and a ‘random’ prioritized adaptive strategy in the second layer, enhances exploration and search capabilities. Benchmark tests demonstrate HAEF’s superiority over traditional depth-first traversal search algorithms and the outperformance of existing algorithms like ID-DGA and TLEF on most datasets. Ablation experiments further confirm the effectiveness of various strategies within the framework. The proposed framework significantly improves data release efficiency, ensuring privacy, security, and maximum availability. Future research could explore extending HAEF’s application to other privacy-preserving techniques and evaluating its scalability to larger datasets.

Availability of Data and Materials

The data utilized in this paper is available at: https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/82xm-y6g8.

Notes

https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/82xm-y6g8

References

Li, J.-Y., Zhan, Z.-H., Wang, H., Zhang, J.: Data-driven evolutionary algorithm with perturbation-based ensemble surrogates. IEEE Trans. Cybern 51(8), 3925–3937 (2021). https://doi.org/10.1109/tcyb.2020.3008280
Article Google Scholar
Wang, H., Cao, J., Zhang, Y.: Ticket-based service access scheme for mobile users. Austral. Comput. Sci. Comm. 24(1), 285–292 (2002)
Google Scholar
Wang, H., Sun, L.: Trust-involved access control in collaborative open social networks. In: 2010 Fourth International Conference on Network and System Security, pp. 239–246 (2010) IEEE
Yin, J., Tang, M., Cao, J., Wang, H., You, M., Lin, Y.: Vulnerability exploitation time prediction: an integrated framework for dynamic imbalanced learning. World Wide Web, 1–23 (2022)
Venkateswaran, N., Prabaharan, S.P.: An efficient neuro deep learning intrusion detection system for mobile adhoc networks. EAI Endorsed Trans. Scalable Inf. Syst 9(6), 7–7 (2022)
Google Scholar
Kabir, M.E., Wang, H., Bertino, E.: A role-involved purpose-based access control model. Inf. Syst. Front. 14(3), 809–822 (2011). https://doi.org/10.1007/s10796-011-9305-1
Article Google Scholar
Sun, X., Li, M., Wang, H., Plank, A.: An efficient hash-based algorithm for minimal k-anonymity. In: Conferences in Research and Practice in Information Technology, vol. 74, pp. 101–107 (2008)
Sun, X., Wang, H., Li, J., Pei, J.: Publishing anonymous survey rating data. Data Min. Knowl. Disc. 23(3), 379–406 (2010). https://doi.org/10.1007/s10618-010-0208-4
Article MathSciNet Google Scholar
Wang, H., Wang, Y., Taleb, T., Jiang, X.: Editorial: special issue on security and privacy in network computing. World Wide Web 23(2), 951–957 (2019). https://doi.org/10.1007/s11280-019-00704-x
Cheng, K., Wang, L., Shen, Y., Wang, H., Wang, Y., Jiang, X., Zhong, H.: Secure k-NN query on encrypted cloud data with multiple keys. IEEE Trans. Big Data 7(4), 689–702 (2017). https://doi.org/10.1109/tbdata.2017.2707552
Article Google Scholar
Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Computing Surveys 42(4), (2010) https://doi.org/10.1145/1749603.1749605
Lau, B.P.L., Marakkalage, S.H., Zhou, Y., Hassan, N.U., Yuen, C., Zhang, M., Tan, U.-X.: A survey of data fusion in smart city applications. Information Fusion 52, 357–374 (2019) https://doi.org/10.1016/j.inffus.2019.05.004
Romero, C., Ventura, S.: Educational data mining and learning analytics: an updated survey. WIREs Data Mining and Knowledge Discovery 10(3), (2020) https://doi.org/10.1002/widm.1355
Ge, Y.-F., Orlowska, M., Cao, J., Wang, H., Zhang, Y.: Knowledge transfer-based distributed differential evolution for dynamic database fragmentation. Knowl.-Based. Syst. 229, 107325 (2021) https://doi.org/10.1016/j.knosys.2021.107325
Yin, J., Tang, M., Cao, J., Wang, H.: Apply transfer learning to cybersecurity: predicting exploitability of vulnerabilities by description. Knowl.-Based Syst. 210, 106529 (2020)
Patil, D.R., Pattewar, T.M.: Majority voting and feature selection based network intrusion detection system. EAI Endorsed Trans. Scalable Inf. Syst. 9(6), 6–6 (2022)
Google Scholar
Sun, X., Li, M., Wang, H.: A family of enhanced (l, $\alpha $)-diversity models for privacy preserving data publishing. Futur. Gener. Comput. Syst. 27(3), 348–356 (2011). https://doi.org/10.1016/j.future.2010.07.007
Article Google Scholar
Wang, H., Sun, L., Bertino, E.: Building access control policy model for privacy preserving and testing policy conflicting problems. J. Comput. Syst. Sci. 80(8), 1493–1503 (2014). https://doi.org/10.1016/j.jcss.2014.04.017
Article MathSciNet Google Scholar
Yang, J., Li, Y., Liu, Q., Li, L., Feng, A., Wang, T., Zheng, S., Xu, A., Lyu, J.: Brief introduction of medical database and data mining technology in big data era. J. Evid. Based Med. 13(1), 57–69 (2020). https://doi.org/10.1111/jebm.12373
Article Google Scholar
Zhu, T., Li, G., Zhou, W., Yu, P.S.: Differentially private data publishing and analysis: a survey. IEEE Trans. Knowl. Data Eng. 29(8), 1619–1638 (2017). https://doi.org/10.1109/tkde.2017.2697856
Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9(3–4), 211–407 (2014)
Lindell, Y.: Secure multiparty computation. Commun. ACM 64(1), 86–96 (2020)
Article Google Scholar
Acar, A., Aksu, H., Uluagac, A.S., Conti, M.: A survey on homomorphic encryption schemes: theory and implementation. ACM Computing Surveys (Csur) 51(4), 1–35 (2018)
Ge, Y.-F., Yu, W.-J., Cao, J., Wang, H., Zhan, Z.-H., Zhang, Y., Zhang, J.: Distributed memetic algorithm for outsourced database fragmentation. IEEE Trans. Cybern. 51(10), 4808–4821 (2021). https://doi.org/10.1109/tcyb.2020.3027962
Article Google Scholar
Ge, Y.-F., Wang, H., Bertino, E., Zhan, Z.-H., Cao, J., Zhang, Y., Zhang, J.: Evolutionary dynamic database partitioning optimization for privacy and utility. IEEE Trans. Dependable and Secure Comp. (2023)
Liang, H., Yuan, H.: On the complexity of t-closeness anonymization and related problems. In: Database Systems for Advanced Applications: 18th International Conference, DASFAA 2013, Wuhan, China, 22-25 April 2013. Proceedings, Part I 18, pp. 331–345 (2013). Springer
Kesavan, V., Kamalakannan, R., Sudhakarapandian, R., Sivakumar, P.: Heuristic and meta-heuristic algorithms for solving medium and large scale sized cellular manufacturing system np-hard problems: a comprehensive review. Materials Today: Proceedings 21, 66–72 (2020) https://doi.org/10.1016/j.matpr.2019.05.363 . International Conference on Recent Trends in Nanomaterials for Energy, Environmental and Engineering Applications
Pant, M., Zaheer, H., Garcia-Hernandez, L., Abraham, A.: Differential evolution: a review of more than two decades of research. Eng. Appl. Artif. Intell. 90,103479 (2020)
Ge, Y.-F., Bertino, E., Wang, H., Cao, J., Zhang, Y.: Distributed cooperative coevolution of data publishing privacy and transparency. ACM Trans. Knowl. Discov. Data 18(1), 1–23 (2023)
Article Google Scholar
Ge, Y.-F., Wang, H., Cao, J., Zhang, Y.: An information-driven genetic algorithm for privacy-preserving data publishing. In: Web Information Systems Engineering–WISE 2022: 23rd International Conference, Biarritz, France, 1–3 November 2022, Proceedings, pp. 340–354 (2022). Springer
Ge, Y.-F., Wang, H., Cao, J., Zhang, Y., Jiang, X.: Privacy-preserving data publishing: an information-driven distributed genetic algorithm. World Wide Web 27(1), 1 (2024)
Article Google Scholar
You, M., Ge, Y.-F., Wang, K., Wang, H., Cao, J., Kambourakis, G.: Tlef: two-layer evolutionary framework for t-closeness anonymization. In: Web Information Systems Engineering–WISE 2023 24th International Conference, Melbourne, VIC, Australia, 25–27 October 2023, Proceedings, pp. 235–244 (2023). Springer
Sweeney, L.: k-anonymity: a model for protecting privacy. Internat. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 3 (2007)
Article Google Scholar
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: 2007 IEEE 23rd International Conference on Data Engineering, pp. 106–115 (2006). IEEE
Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A., Kuhn, K.A.: Flash: efficient, stable and optimal k-anonymity. In: 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing, pp. 708–717 (2012). IEEE
Zhang, J., Sanderson, A.C.: Jade: adaptive differential evolution with optional external archive. IEEE Trans. Evol. Comput. 13(5), 945–958 (2009)
Article Google Scholar
Qin, A.K., Suganthan, P.N.: Self-adaptive differential evolution algorithm for numerical optimization. In: 2005 IEEE Congress on Evolutionary Computation, vol. 2, pp. 1785–1791 (2005). IEEE
Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: 21st International Conference on Data Engineering (ICDE’05), pp. 217–228 (2005). IEEE
Corriveau, G., Guilbault, R., Tahan, A., Sabourin, R.: Review and study of genotypic diversity measures for real-coded representations. IEEE Trans. Evol. Comput. 16(5), 695–710 (2012)
Article Google Scholar

Download references

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions.

Author information

Authors and Affiliations

Institute for Sustainable Industries and Liveable Cities, Victoria University, 70-104 Ballarat Road, Melbourne, 3011, Victoria, Australia
Mingshan You, Yong-Feng Ge & Hua Wang
School of Health and Biomedical Sciences, RMIT University, 30 Janefield Drive, Bundoora, 3083, Victoria, Australia
Kate Wang
Department of Computer Science and Information Technology, La Trobe University, Melbourne, 3086, Victoria, Australia
**li Cao
Department of Information and Communication Systems Engineering, University of the Aegean, Karlovasi, 83200, Samos, Greece
Georgios Kambourakis

Authors

Mingshan You
View author publications
You can also search for this author in PubMed Google Scholar
Yong-Feng Ge
View author publications
You can also search for this author in PubMed Google Scholar
Kate Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hua Wang
View author publications
You can also search for this author in PubMed Google Scholar
**li Cao
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Kambourakis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Mingshan You contributes to conceptualization, methodology, software, writing - original draft.Yong-Feng Ge contributes to conceptualization, methodology, software, writing - original draft.Kate Wang contributes to supervision, project administration, writing - review & editing.Hua Wang contributes to validation, formal analysis, writing - review & editing.**li Cao contributes to supervision, validation, writing - review & editing.Georgios Kambourakis contributes to supervision and validation.

Corresponding author

Correspondence to Yong-Feng Ge.

Ethics declarations

Conflicts of Interest

The authors declare that they have no conflict of interest.

Ethical Approval

Not applicable.

Competing Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

You, M., Ge, YF., Wang, K. et al. Hierarchical adaptive evolution framework for privacy-preserving data publishing. World Wide Web 27, 49 (2024). https://doi.org/10.1007/s11280-024-01286-z

Download citation

Received: 28 February 2024
Revised: 10 May 2024
Accepted: 02 July 2024
Published: 12 July 2024
DOI: https://doi.org/10.1007/s11280-024-01286-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Hierarchical adaptive evolution framework for privacy-preserving data publishing

Abstract

1 Introduction

2 Related work

3 Problem formulation

3.1 Data anonymization

3.2 Privacy model

Definition 1

3.3 Utility metrics

3.4 Optimal anonymization

Definition 2

4 Hierarchical adaptive evolution framework

4.1 Workflow of HAEF

4.2 GA-DE adaptation hierarchy

4.2.1 GA-prioritized adaptation

4.2.2 GA

4.3 DEs adaptation hierarchy

4.3.1 Random-prioritized adaptation

4.3.2 Mutation operation

4.3.3 Crossover operation

4.3.4 Selection operation

5 Experimental setup

5.1 Dataset and test cases

5.2 Algorithm implementation

6 Experimental result

6.1 Comparison with existing approaches

6.1.1 Parameter settings

6.1.2 TD comparison

6.1.3 Convergence curves

6.1.4 Population diversity

6.2 Ablation test

7 Implication and limitation

8 Conclusion

Availability of Data and Materials

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of Interest

Ethical Approval

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation