DFTMicroagg: a dual-level anonymization algorithm for smart grid data

Adewole, Kayode S.; Torra, Vicenç

doi:10.1007/s10207-022-00612-8

DFTMicroagg: a dual-level anonymization algorithm for smart grid data

Regular contribution
Open access
Published: 07 September 2022

Volume 21, pages 1299–1321, (2022)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Information Security Aims and scope Submit manuscript

DFTMicroagg: a dual-level anonymization algorithm for smart grid data

Download PDF

2695 Accesses
3 Citations
Explore all metrics

Abstract

The introduction of advanced metering infrastructure (AMI) smart meters has given rise to fine-grained electricity usage data at different levels of time granularity. AMI collects high-frequency daily energy consumption data that enables utility companies and data aggregators to perform a rich set of grid operations such as demand response, grid monitoring, load forecasting and many more. However, the privacy concerns associated with daily energy consumption data has been raised. Existing studies on data anonymization for smart grid data focused on the direct application of perturbation algorithms, such as microaggregation, to protect the privacy of consumers. In this paper, we empirically show that reliance on microaggregation alone is not sufficient to protect smart grid data. Therefore, we propose DFTMicroagg algorithm that provides a dual level of perturbation to improve privacy. The algorithm leverages the benefits of discrete Fourier transform (DFT) and microaggregation to provide additional layer of protection. We evaluated our algorithm on two publicly available smart grid datasets with millions of smart meters readings. Experimental results based on clustering analysis using k-Means, classification via k-nearest neighbor (kNN) algorithm and mean hourly energy consumption forecast using Seasonal Auto-Regressive Integrated Moving Average with eXogenous (SARIMAX) factors model further proved the applicability of the proposed method. Our approach provides utility companies with more flexibility to control the level of protection for their published energy data.

Recent Advances in Smart Meter: Data Analysis, Privacy Preservation and Applications

Differential privacy for real smart metering data

Article Open access 09 July 2016

Distributed Power Load Missing Value Forecasting with Privacy Protection

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Over the last few decades, the conventional electricity grid has been in existence, which consist of power generation and distribution systems. The conventional grid provides electricity to consumers with monthly billing arrangements. This type of grid is characterized by one-way communication and there is lack of interaction between the customers and the utility company. This leads to different issues that include loss of energy and poor peak load management [1, 2]. Nevertheless, the advancement in technology over the years has brought about the rollout of advanced metering infrastructure (AMI) smart meters that improved the traditional energy grid. AMI offers advantages such as effective communication between consumer and utility, increased reliability, resilience and better control of demand response load management [3, 4]. With the advancement in smart grid technology, the collection of fine-grained daily electricity usage data with different levels of time granularity has rapidly grown. The fine-grained electricity consumption data has enabled utility companies to perform robust grid operations such as demand response, grid monitoring, consumer profiling, customer segmentation, energy usage prediction, load forecasting and many more [5, 6].

Due to the benefits offered by AMI smart meters, the European Union (EU) planned to install 225 million smart meters for electricity and 51 million for gas in the year 2024. In this year, it is expected that almost 77% of European consumers of electricity will have access to smart meters [7]. Similarly, the UK government planned to install 53 million smart meters while the USA plans to roll out 90 million smart meters as of 2020 [1, 3]. As part of additional benefits, smart grid also enables consumers to actively manage their energy usage and control energy bills. Moreover, besides the use of electricity consumption data by utility companies, these data may be shared with third-party service providers and researchers to provide more insights on electricity consumption. However, fine-grained electricity consumption data has been characterized with privacy-sensitive consumer behaviors, which are capable of revealing general habits and lifestyles of households [4, 8]. Consequently, sharing of fine-grained electricity usage data in its original form has been shown to violate the security and privacy of electricity customers.

Fine-grained electricity usage data are valuable and can be sought by many entities including attackers who want to deduce the type of device or appliance that was in use at any given time. There is a specific research field called non-intrusive load monitoring for appliance (NILMA), which relies on electricity consumption data to extract detailed information of consumers based on their domestic appliance usage patterns. The goal of NILMA research is to deduce the types of appliances used in a house along with their energy consumption based on a detailed analysis of the current and voltage of the total load [9, 10]. The information obtained through this analysis is useful to third parties like marketers, law enforcement, and criminals [11, 12]. For instance, the case of electricity blackout due to hacking has been reported in Ukraine in 2015, 2016 and January 2017 where hackers were able to shut down energy systems that supply heat and light to millions of households [1, 13].

As a countermeasure against NILMA and re-identification or de-pseudonymization attacks, different solutions have been proposed, which include cryptographic approach, differential privacy, rechargeable battery for obfuscation of smart meter reading, data aggregation based on trusted third-party (TTP), and data anonymization and perturbation [1,2,3, 6, 10, 12, 14,15,16]. The cryptographic approach involves the development of encryption protocols to encrypt smart meter data at the point of generation so that it will be difficult to determine the specific household consumption. Cryptographic approach includes both the traditional and homomorphic encryption schemes [17, 18]. By traditional encryption we refer to those encryption schemes that do not allow computation on encrypted data. This method can provide high level of security and privacy before transmitting the data to the utility company. However, it is not an efficient method for publishing energy data that are needed for research purposes and complex data analytics as no information is released in the published data for complex statistical analysis [19]. Differentially private (DP) algorithm has been used to publish electricity consumption data [6]. However, previous studies have observed that for high-dimensional time series data, DP often adds too much noise that can lead to unsatisfactory data utility [12, 14].

Battery-based load hiding (BLH) has been proposed in [2, 20]. The goal of BLH approach is to mask smart meter reading by utilizing a rechargeable battery. This approach has been mainly theoretic and its successful real-world application is yet to be developed [2]. Data aggregation based on TTP was proposed in [10]. This method relies on TTP for aggregation of smart meter reading. The aggregated reading is then transmitted to utility company for workload balancing and statistical analysis. However, as stated by the authors, this approach traded security for privacy; hence, practical application of data anonymization should be extended to improve this method. To provide data anonymization and perturbation of smart meter reading, [12, 14] introduce PAD system. PAD directly applied microaggregation using k-ward algorithm to anonymize daily energy consumption data. However, in our study, we empirically show that reliance on microaggregation alone is not sufficient to protect smart grid data against disclosure risk.

In this paper, a dual-level anonymization algorithm, DFTMicroagg, is proposed to reduce the disclosure risk of microaggregation algorithm when used to protect energy data. To achieve this goal, we first conducted an experiment to ascertain the privacy value offered by microaggregation algorithm when used to protect smart grid data. Based on our findings, we extended this model by combining discrete Fourier transform (DFT) and microaggregation to improve privacy. We show that the proposed approach guarantees promising data utility by experimenting with three major data mining tasks based on clustering analysis using k-Means, classification via kNN algorithm and mean hourly load forecasting using SARIMAX model. In addition, we compute information loss (IL) to understand how much information is lost due to the dual-level perturbation process. To the best of our knowledge, this is the first paper to extensively investigate the application of DFT and microaggregation to smart grid data protection. Additionally, we investigate two record linkage attacks based on distance-based record linkage and interval disclosure risk on the protected smart grid data. Summarily, the following are the contributions of this paper:

Investigate the actual privacy value offered by microaggregation for protecting smart grid data.
Propose a dual-level anonymization algorithm, which combined DFT with microaggregation.
Implement two adversarial models using distance-based record linkage and interval disclosure risk. Specifically, we propose distance-based record linkage algorithm which does not only consider the nearest record to the masked data being linked but also the second nearest record.
Conduct extensive experiments on smart grid data with millions of smart meter readings.

The remaining parts of this paper are organized as follows: Section 2 discusses related works on smart grid data protection. Section 3 provides a detailed information on k-Anonymity and attack model assumed in the previous work for protecting smart grid data. Section 4 presents the proposed approach in this paper as well as the adversarial models considered in our study. Section 5 focuses on experimental setup, and Section 6 presents results and discussion. Finally, Sect. 7 concludes the paper and highlights future research direction.

2 Related work

The literature on privacy-preserving data publishing is vast and different research domains have been extensively studied. [21] presented an algorithm to publish dynamic datasets and compared their results with maximum distance to average vector (MDAV) microaggregation algorithm. Microaggregation procedure has also been extended to time series data in [22] where the authors evaluated the performance of two distance metrics: Euclidean distance and Short Time Series (STS) distance. An empirical comparison of disclosure risk control methods for microdata has been extensively studied [19]. [23] presented the foundation, new development and challenges of data privacy preserving. Nevertheless, in the domain of smart grid, privacy-preserving energy data has been studied from different dimensions. These include methods based on cryptography, differential privacy, BLH, data aggregation based on TTP, data compression, and data anonymization and perturbation [1].

Cryptographic methods involved the development of encryption protocols to encrypt smart meter data at the point of generation so that it will be difficult to determine the specific household consumption from the data. This method can provide some level of security and privacy before transmitting to the utility company. For instance, [16, 17] proposed similar approaches based on symmetric encryption algorithms and hashing. In these methods, lightweight cryptographic protocols encrypt smart meter data before transmission to the utility company. Similarly, cryptographic approach that allows computation on encrypted data based on homomorphic schemes have also been studied [18, 24]. The major challenge with cryptographic methods when used for privacy-preserving data publishing is that no information is released in the published data for research purposes [19]. Therefore, it is not a suitable method for publishing smart grid data that requires complex statistical analysis.

Differentially private (DP) algorithms have been studied for smart grid data [6, 9]. However, previous studies have observed that for high-dimensional time series data, DP often adds too much noise that can lead to unsatisfactory data utility [12, 14]. BLH has been proposed in [2, 20]. The goal of BLH is to install a battery at the consumer end, which can be charged or discharged to make the electricity meter incapable of precisely obtaining the consumption data of electric appliances and to obfuscate the actual consumption of the electric appliances [25]. This masking method is mainly theoretic and its empirical validation for real-world application is still a major concern [2].

[10] proposed data aggregation method that relies on TTP aggregation of smart meter reading. This approach assumed that utility companies only need to protect data that is collected at high-frequency (HF) without attributing to specific consumers while the low-frequency (LF) smart meter data are transmitted to TTP for aggregation. However, as stated by the authors, this approach traded security for privacy; hence, practical application of data anonymization should be extended to improve this method. A similar assumption was made to evaluate the performance of de-anonymization algorithms in [8, 26].

Data compression of smart meter reading has been investigated. The idea is that storage requirement and transmission overhead can be greatly reduced using data compression algorithms. [27] conducted an extensive study of the effect of applying different compression algorithms on smart meter data. The algorithms investigated are wavelet transform, symbolic aggregate approximation (SAX), principal component analysis (PCA), singular value decomposition (SVD), dimensionality reduction via linear regression, Huffman coding and Lempel–Ziv (LZ) algorithm. Nevertheless, this study established that finding an appropriate balance between efficiency and loss ratio is not a trivial issue when applying compression algorithms on smart meter data. Similar findings have also been presented in [28, 29] based on smart meter data compression.

Generative adversarial network (GAN) and additive correlated noise have been studied to protect smart meter consumption data [30, 31]. One of the benefits of GAN is its ability to model the uncertainties of original data and based on this model a new data is generated, which can be used for grid operations such as planning and scheduling. Two deep neural networks are usually trained: one to capture the distribution of the data and the other to estimate the probability that the input originates from the real data. This approach is promising to protect energy consumption data; however, its capability to prevent disclosure risk attacks is missing in the literature.

Smart grid and building occupancy data publishing system (PAD) was proposed in [12, 14]. This approach follows k-anonymity, which is assumed to guarantee some level of privacy. K-anonymity has received a wide range of attention as one of the suitable conditions that data protection algorithms must satisfy to prevent record linkage. In PAD [14], a linear distance metric was learned to determine data user’s specific task. A modified version of this approach was presented in [12] where a nonlinear distance metric learning was formulated based on a deep neural network. The goal of PAD is to learn user’s specific task by asking data analyst to manually annotate energy data to determine the specific data utility that satisfies the data analyst objective. The annotated data are then passed to k-ward microaggregation algorithm for privacy protection. However, asking data users to manually annotate large time series energy data is not a trivial task. In this study, we show that reliance on microaggregation alone is not sufficient to protect daily energy consumption data against disclosure risk.

3 k-anonymity and attack model assumption

In this section, we briefly present the concept of k-anonymity as well as the attack model that was assumed in the previous work [12] for protecting energy data, which forms the basis for conducting our investigative study.

3.1 k-anonymity and microaggregation

k-anonymity is not a protection method on its own but a condition that protected data should satisfy to guarantee the privacy of the individual in the masked data. k-anonymity concept was originally proposed in the context of privacy protection for relational databases [32,33,34]. The goal of k-anonymity is to ensure that each individual in a protected data cannot be identified within a set of k individuals. This means that the dataset is partitioned into a set of at least k indistinguishable records. One way to enforce k-anonymity on the protected data is to use microaggregation algorithm [35].

Generally, microaggregation protects dataset using two steps: k-partition and aggregation. Suppose X represents the input data to be protected and $\hat{X}$ is the protected data after applying microaggregation. The two steps are described as follows:

Step 1 (k-partition): All records in X are partitioned into different clusters, say g, with each consisting of k or more records.

Step 2 (aggregation): Compute a representative (i.e., centroid) for each of the clusters in g and use this centroid to replace the original records in the cluster. This means that all the k records in the cluster are replaced with the same value; hence, k-anonymity is guaranteed.

At the k-partitioning step, it is important to ensure that the in-group distance between cluster element and its centroid is minimized. This is to enforce homogeneity to minimize information loss. To achieve this, the sum of squared error (SSE) criterion in Eq. (1) is minimized. Formally, let $u_{ij}$ describes the clustering of records in X such that $u_{ij} =1 $, if record j is assigned to the ith cluster. Suppose $v_i$ is the centroid of the ith cluster, then homogeneity is enforced by,

$$\begin{aligned} \textit{Minimize SSE} = \sum _{i=1}^{g} \sum _{j=1}^{n} u_{ij}{(d(x_j,v_i))}^2 \end{aligned}$$

(1)

$$\begin{aligned} \textit{Subject to:} \sum _{i=1}^{g} u_{ij} = 1 \, \forall \, j = 1,2,\dots ,n \\ 2k \ge \sum _{j=1}^{n} u_{ij} \ge k \, \forall \, i = 1,2,\dots ,g \\ u_{ij} \in \{ {0,1} \} \end{aligned}$$

If X is numerical, Euclidean distance is mostly chosen to estimate the distance metric d(x, v) in Eq. (1). Several versions of microaggregation algorithm have been studied in the literature, which includes maximum distance (MD), maximum distance to average vector (MDAV), variable-size maximum distance to average vector (V-MDAV) and k-ward [12, 35,36,37]. In this study, we implemented MDAV as additional layer to DFT due to its performance and wide adoption in the literature [36]. MDAV is described in Algorithm 1 as adapted from [37].

3.2 Attack model assumption

This section presents the attack model assumed in the previous study [12] for protecting energy data. This forms the basis for conducting our investigative study to ascertain the actual privacy value offered by k-anonymity and microaggregation when used to protect energy data. For the sake of clarity, suppose we have energy data where each record (row) is daily energy consumption of a particular household or consumer that has been sampled at a specific time interval (e.g., 1 second, 5 minutes, 1 hour, etc.). Each column depicts the timestamp of the day when the energy was consumed. A household will have multiple records depending on the coverage of the dataset under consideration. As earlier discussed, this data is capable of revealing general habits and lifestyles of a household if published in its original form. By assumption, applying k-anonymity to this data will guarantee indistinguishability of k household with stronger privacy. This attack scenario is presented in Fig. 1. In Fig. 1a, an attacker can infer the privacy of each household by simply studying the unprotected data because the consumption pattern of an individual in the data is different. Whereas in Fig. 1b, where 2-anonymity is applied to protect the data, it will be difficult for an attacker to easily distinguish the consumption traces since we can find two households with the same traces in the protected data.

However, the same household can have very similar energy consumption traces per day, making the 2-anonymous traces in Fig. 1b point to the same household, thereby leading to successful record linkage. Therefore, it is worth researching the actual privacy value offers by k-anonymity and microaggregation for protecting this type of data. In our study, we empirically show the actual privacy value provided by this protection procedure by considering two types of disclosure risk attacks using distance-based record linkage and interval disclosure. Our findings show that the disclosure risk of k-anonymous energy consumption data with direct application of microaggregation is high and this can be reduced further using the proposed approach in this paper without compromising the utility of the data for research and analytical purposes.

4 Proposed approach

As show in Fig. 2, this paper presents two ways in which energy data can be protected. The time series data are first converted to the form described in Sect. 3.2. This form is termed interval-based representation in Fig. 2 for standard representation. The first protection method directly applied microaggregation on the data to produce the masked data. The second approach first applied DFT on the data before microaggregation algorithm. For each case of the protection procedures, we check the utility and privacy values offered by these methods. Based on the outcomes, the utility company decides to publish the protected data for research and analytical purposes. Section 4.2 presents an overview of MDAV algorithm and Sect. 4.3 highlights the detail components of the proposed DFTMicroagg algorithm. In Sect. 6, we show how the proposed DFTMicroagg algorithm reduces disclosure risk while maintaining a high level of data utility.

4.1 Discrete Fourier transform

Discrete Fourier transform (DFT) converts a finite sequence of equally spaced samples of a function into the same length sequence of equally spaced samples coefficients of a finite combination of complex sinusoids, which is a complex-valued function of frequency [38, 39]. This property of DFT enables us to efficiently determine the loss and gain of DFT approach by comparing the microaggregated version of DFT anonymized data with the original input data.

An inverse DFT (IDFT) is a Fourier series that uses the DFT samples as coefficients of complex sinusoids at the corresponding DFT frequencies. To provide additional level of masking, instead of producing the original input sequence through IDFT, we modified the coefficients of DFT as described in Sect. 4.3. A fast algorithm for implementing DFT is fast Fourier transform (FFT), which has been widely used in different domains [38]. In this study, we implemented FFT as additional layer to microaggregation algorithm to provide dual-level masking of the energy data.

Formally, a one-dimensional DFT converts a sequence of N complex numbers $\{x_n\} = x_0,x_1,x_2,\dots ,x_{N-1}$ to another sequence of complex numbers $\hat{\{x _k\} } = \hat{x_0},\hat{x_1},\hat{x_2},\dots ,\hat{x}_{N-1}$ such that,

$$\begin{aligned} \hat{x}_{k} = \sum _{n=0}^{N-1} {x_n} . e^{- \frac{i2\pi }{N}{kn}} \end{aligned}$$

(2)

The transformation to the complex-valued function of frequency is also denoted as $\hat{x} = F(x)$. The inverse of one-dimensional DFT for a sequence of N complex numbers is given by,

$$\begin{aligned} x_n = \frac{1}{N} \sum _{k=0}^{N-1} {\hat{x}_k} . e^{ \frac{i2\pi }{N}{kn}} \end{aligned}$$

(3)

Suppose n is split into even and odd indexed terms such that $n=2r$ for even and $n=2r+1$ for odd, where $r = 0,1,\dots ,\frac{N}{2}-1$. Then Eq. (2) can be computed concurrently in terms of even and odd terms such that,

$$\begin{aligned} \hat{x}_{k} = \sum _{r=0}^{\frac{N}{2}-1} {x_{(2r)}} . e^{- \frac{i2\pi }{N}{k}{(2r)}} + \sum _{r=0}^{\frac{N}{2}-1} {x_{[2r+1]}} . e^{- \frac{i2\pi }{N}{k}{(2r+1)}}\nonumber \\ \end{aligned}$$

(4)

$$\begin{aligned} \hat{x}_{k} = \sum _{r=0}^{\frac{N}{2}-1} {x_{(2r)}} . e^{- \frac{i2\pi }{N}{k}{(2r)}} + e^{- \frac{i2\pi }{N}{k}}\sum _{r=0}^{\frac{N}{2}-1} {x_{[2r+1]}} . e^{- \frac{i2\pi }{N}{k}{(2r)}}\nonumber \\ \end{aligned}$$

(5)

$$\begin{aligned} \hat{x}_{k} = \sum _{r=0}^{\frac{N}{2}-1} {x_{(2r)}} . e^{- \frac{i2\pi }{N/2}{k}{(r)}} + e^{- \frac{i2\pi }{N}{k}}\sum _{r=0}^{\frac{N}{2}-1} {x_{[2r+1]}} . e^{- \frac{i2\pi }{N/2}{k}{(r)}}\nonumber \\ \end{aligned}$$

(6)

Similarly, a two-dimensional DFT of discrete sequence f(x, y) of size $M \times N$ is given by,

$$\begin{aligned} F(u,v) = \frac{1}{MN} \sum _{x=0}^{M-1} \sum _{y=0}^{N-1} f(x,y) . e^{-i2\pi (ux/M + vy/N)} \end{aligned}$$

(7)

where F(u, v) is the frequency component of the discrete function f(x, y), u and v are the frequency variables in DFT, and x and y are the spatial variables in the input space. The inverse of Eq. (7) is given by,

$$\begin{aligned} f(x,y) = \sum _{u=0}^{M-1} \sum _{v=0}^{N-1} F(u,v) . e^{i2\pi (ux/M + vy/N)} \end{aligned}$$

(8)

4.2 MDAV microaggregation

As discussed in Sect. 3, there are several algorithms for microaggregation. However, this study has adapted MDAV [37] as additional layer to DFT due to its performance and wide adoption in the literature [36]. Algorithm 1 describes the stages involved in MDAV.

4.3 DFTMicroagg

4.3.1 Overview of DFTMicroagg

In this study, we propose DFTMicroagg (see Algorithm 2) to improve privacy guarantees of microaggregation algorithm without violating the utility of the protected data. The proposed algorithm aims to improve the privacy value offered by the protection method presented in Fig. 1b. The algorithm takes as input the original energy data X to be masked, an integer number representing the anonymity level and the desired coefficient value which is computed according to Eq. (9). X is a matrix representing daily energy consumption time series data as described in Sect. 3.2. The algorithm produces as output the masked dataset with k-anonymity guaranteed. The parameter coeff in the algorithm controls the degree of compression. The proposed algorithm applies a low-pass filtering as an anonymization step before the microaggregation algorithm (see Algorithm 2). This provides a two-level anonymization for the protected energy data and stronger privacy guarantees.

First, the variable no timestamps in the algorithm represents the total number of columns which corresponds to the timestamps of the day when the energy was consumed. The algorithm tests if the parameter coeff is even or odd. Based on the outcome of the test, the indices for the real and imaginary components to be used during FFT are then computed using the function sequence. This function takes three parameters. The first parameter is the start position of the sequence to be generated, the second is the stop position which signifies the end of the interval. The third parameter is the step value which indicates the spacing between values in the generated sequence. So, the function sequence can be seen as equivalent to numpy.arange() function in Python. The generated real and imaginary indices are used for the FFT computation. Inverse FFT takes as input the computed DFT and the no timestamps to produce the transformed data. This is passed as input to MDAV along with the value of k to generate the final masked dataset $\hat{X}$.

4.3.2 Use case of the proposed approach

Suppose we have a time series dataset $D = \{ SM_{cid},timestamp,value\} $ that was collected from AMI smart meters daily by the utility company. In this dataset, ${SM_{cid}}$ denotes the identifier of households based on the smart meters used. The high-frequency (HF) data (i.e., value) from the smart meters denotes the energy consumption of the households at a particular timestamp of the day. As discussed earlier, the HF data can reveal the consumption patterns of the households and this can be explored by attackers even if ${SM_{cid}}$ is pseudonymized. Utility company wants to protect the privacy of the households in this data so that it will be difficult for an attacker to re-identify a particular household record. At the same time, the protected data should be useful for research and analytical purposes. To protect D via microaggregation, first, the data need to be converted to what we termed $interval-based \, representation$ or standard format X where $t_1,t_2,\dots ,t_n \in T$ represent the number of attributes (timestamps) in X along with attributes Date and ${SM_{cid}}$. Each row in X denotes the time series daily energy consumption data recorded as ${SM_{cid}}$,Date, and T. Each $t_i \in T$ is a numeric attribute corresponding to the actual energy consumption value at a time $t_i$ and its value needs to be masked to protect the privacy of the households in X. In addition, ${SM_{cid}}$ is pseudonymized before publishing the data by the utility company to hide the true identities of the households. Each $t_i \in T$ is a $quasi-identifier$ and combination of $t_i$ can be used to re-identify a specific household. It is assumed that a specific $t_i$ or a subset of $t_i$ which is in the possession of an attacker is considered as confidential attribute(s). Therefore, before publishing X, each $t_i \in T$ must be masked to avoid privacy leakage.

To achieve this goal, as stated in Sect. 4, we provide two ways in which X can be protected. The first is to directly apply microaggregation on X to obtain the masked data $\hat{X}$. The second approach is to apply the proposed DFTMicroagg algorithm to protect X. For the sake of clarity, the number of coefficients used for each test case of DFTMicroagg is given by,

$$\begin{aligned} \textit{coeff} = \frac{T}{i} \end{aligned}$$

(9)

where T is the total number of timestamps in X and i is a constant that is to be chosen by the utility company for privacy control. We evaluated with different values of i as presented in Sect. 5.2.5. The motivation is that instead of continuously increasing the value of k to a large number during microaggregation, which can lead to significant information loss, we provide additional layer to microaggregation that offers suitable masking with specific consideration on the shape of the time series. We empirically show that this approach reduces disclosure risk without compromising the data utility of the protected data for research and analytical purposes.

4.4 Adversarial model

In this paper, we consider an adversary whose goal is to launch two types of record linkage attacks (distance-based record linkage and interval disclosure) to link the records in the masked dataset with an external data that the intruder has obtained through an external knowledge. The external data usually contain the key attributes such as the one in the masked data. When testing a record linkage model, the original dataset is used to represent the intruder external data. For each case of the attack model, we check the privacy values of microaggregation and DFTMicroagg for protecting energy data.

4.4.1 Distance-based record linkage

The goal of an attacker with distance-based record linkage is to use a distance metric to link each record in the masked dataset with its corresponding record in the original. [19] gives a brief description of how a robust distance-based record linkage algorithm for a typical case of microaggregation protection should be developed. For each record in the masked dataset, the distance to every record in the original dataset is computed. Thereafter, the ‘closest’ and ‘second closest’ records in the original dataset are considered. A record in the masked dataset is labeled as ‘linked’ when the closest record in the original dataset is the corresponding original record. Similarly, a record in the masked dataset is labeled as ‘linked to 2nd closest’ when the second closest record in the original dataset turns out to be the corresponding original record. In all other cases, a record in the masked dataset is labeled as ‘not linked.’ The percentage of disclosure risk is computed based on the number of ‘linked’ and ‘linked to 2nd closest’ records to the overall records in the masked dataset. Based on this description, we propose a robust distance-based record linkage algorithm in Algorithm 3, which does not only consider the closest record but also the second closest to the masked record being linked. This algorithm can also be generalized to evaluate the privacy value of other anonymization methods. Algorithm 3 uses a list comprehension method to compute the distances from each record in the masked dataset to every records in the original dataset. Note also how the closest and second closest distances were computed after the distance computation. The algorithm assumed the maximum knowledge attacker could have regarding the original data.

4.4.2 Interval disclosure risk

The second adversarial model considered in this study is interval disclosure risk [19], which is an attribute inference attack that tries to infer the smart meter values. Formally, for each record r in the masked dataset $\hat{X}$, an attacker computes rank interval based on the following procedures. First, each attribute in $\hat{X}$ is ranked independently to define a rank interval around the value the attribute takes on each record. Second, the ranks of values within the interval for an attribute around record r should differ less than p percent of the total number of records and the rank in the center of the interval should correspond to the value of the attribute in record r. If true, the proportion of original values that fall into the interval centered around their corresponding masked value is a disclosure risk measure. A 100 percent proportion indicates that an attacker is completely certain that the original value falls in the interval around the masked value. This leads to interval disclosure of the record in the original data. In the case of the daily energy consumption dataset, each attribute is taken as a particular timestamp of the day. A quantitative measure is then computed to quantify the interval disclosure risk for the protected data $\hat{X}$. We implemented interval disclosure via sdcMicro package. Algorithm 4 provides the procedural steps to achieve this goal. In this algorithm, n is the total number of records in $\hat{X}$ and the parameter p can be used to enlarge or down scale the interval.

5 Experimental setup

All experiments have been implemented in Python programming language on a Dell Laptop computer running Windows operating system with 1TB HDD and 32GB RAM. As stated in Sect. 4.4.2, we implemented interval disclosure risk using sdcMicro. sdcMicro is a statistical disclosure control methods for anonymization of data and risk estimation package in R. However, we use rpy2 package in Python to access sdcMicro.

5.1 Datasets description

We evaluated the efficacy of the proposed approach based on two publicly available datasets. The first dataset ‘EnerNOC GreenButton Data,’ hereafter refers to as Dataset 1, is a time series energy usage data collected at 5-minute resolution for 100 commercial/industrial sites in the year 2012. The data is available for download at https://open-enernoc-data.s3.amazonaws.com/anon/index.html. The second dataset ‘Low Carbon London Electric Vehicle Load Profiles Data,’ hereafter refers to as Dataset 2, is a time series data relating to load profiles for electric vehicle charging. This is part of the Low Carbon London (LCL) project delivered by UK Power Networks. The dataset spans two years from 2013 to 2014 with 53 commercial and 70 residential trials. The data is available for download at https://data.london.gov.uk/dataset/low-carbon-london-electric-vehicle- load-profiles. Table 1 summarizes the datasets.

Table 1 Datasets description

DFTMicroagg: a dual-level anonymization algorithm for smart grid data

Abstract

Similar content being viewed by others

Recent Advances in Smart Meter: Data Analysis, Privacy Preservation and Applications

Differential privacy for real smart metering data

Distributed Power Load Missing Value Forecasting with Privacy Protection

1 Introduction

2 Related work

3 k-anonymity and attack model assumption

3.1 k-anonymity and microaggregation

3.2 Attack model assumption

4 Proposed approach

4.1 Discrete Fourier transform

4.2 MDAV microaggregation

4.3 DFTMicroagg

4.3.1 Overview of DFTMicroagg

4.3.2 Use case of the proposed approach

4.4 Adversarial model

4.4.1 Distance-based record linkage

4.4.2 Interval disclosure risk

5 Experimental setup

5.1 Datasets description

5.2 Utility measures

5.2.1 Information loss

5.2.2 Clustering analysis

5.2.3 kNN classification

5.2.4 Forecasting model

5.2.5 Hyperparameter settings

6 Results and discussion

6.1 Microaggregation results

6.1.1 Utility

6.1.2 Disclosure risk

6.2 Microaggregation and DFTMicroagg

6.2.1 Utility

6.2.2 Disclosure risk

6.3 Order of households and sampling rate

6.3.1 Order of households

6.3.2 Sampling rate

7 Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation