Keywords

1 Introduction

The Internet has undergone dramatic changes in the last two decades, evolving from a mere communication network to a global multimedia platform in which billions of users not only actively exchange information, but also increasingly carry out their daily personal activities. While this transformation has brought tremendous benefits to society, it has also created new threats to online privacy that existing technology is failing to keep pace with. In fact, protecting privacy on the Internet remains a widely unsolved challenge for users, providers, and legislators alike. Users tend to reveal personal information without considering the widespread, easy accessibility, potential linkage and permanent nature of online data. Many cases reported in the press indicate the resulting risks, which range from public embarrassment and loss of prospective opportunities (e.g., when applying for jobs or insurance), to personal safety and property risks (e.g., when stalkers, sexual offenders or burglars learn users’ whereabouts).

Legislators have responded by tightening privacy regulations. The European Court of Justice recently ruled in Google Spain v. Mario Costeja González [9] that EU citizens have a fundamental right to be forgotten for digital content on the Internet, in the sense that indexing systems such as Google (or other search engines, as well as systems that make data easily discoverable, such as Facebook and Twitter) must offer users technical means to request removal of links in search results that point to sources containing their personal information and violating their data protection rightsFootnote 1. While a comprehensive expiration mechanism for digital data has often been postulated by privacy advocates in the past, this court decision, for the first time, imposes a legal constraint for indexing systems that operate in the EU to develop and deploy suitable enforcement techniques. As of now, the solution deployed by leading search engines, such as Google, Microsoft and Yahoo, consists of a simple web form that requires a user to manually identify all relevant links herself upfront and to insert them into the web form, followed by a manual evaluation by the search engine’s employees to assess whether the author of the request is eligible and the request itself is lawful, i.e., the data subject’s right to privacy overrides the interests of the indexing operator and the freedom of speech and information.

According to the Google transparency report [16], the number of removal requests that have been submitted to Google since the court decision in May 2014 has already exceeded 1/5 of a million and the number of URLs that Google has evaluated for removal are approximately 3/4 of a million. Clearly, in order to enable efficient enforcement, it is essential to develop techniques that at least partly automate this process and are scalable to Internet size, while being censorship-resistant by ensuring that malicious users cannot effectively blacklist links to Internet sources that do not affect them.

Our Contribution. We propose a universal framework, called Oblivion, providing the foundation to support the enforcement of the right to be forgotten in a scalable and automated manner. Technically, Oblivion provides means for a user to prove her eligibilityFootnote 2 to request the removal of a link from search results based on trusted third party-issued digital credentials, such as her passport or electronic ID card.

Oblivion then leverages the trust imposed by these credentials to generate eligible removal requests. More specifically, the officially-generated signatures contained in such credentials comprise personally-identifiable information of the card owner, such as her signed passport picture, address, etc. These so-called signed attributes are subsequently automatically compared with publicly available data whose removal should be requested, in order to determine if a source indeed contains information about a given entity. In Oblivion, we use state-of-the-art natural language processing (NLP) and image recognition techniques, in order to cover textual and visually identifiable information about a user, respectively. Further modalities can be seamlessly integrated into Oblivion. These techniques in particular automate the task for a user to determine if she is actually affected by an online source in the first place. The outcome of these comparisons, based on the signed attributes, is then used to provide proof to the indexing system that a user is eligibly affected by a source. To avoid creating further privacy concerns, Oblivion lets the user prove her eligibility to request data removal without disclosing any further personal information beyond what is already available at the link. This approach applies to a variety of different indexing systems, and in particular goes beyond the concept of search engines that we refer to throughout the paper for reasons of concreteness. Moreover, Oblivion exploits the homomorphic properties of RSA [29] in order to verify the eligibility of an arbitrarily large set of user credentials using only a single exponentiation, and is thus capable of handling 278 requests per second on a standard notebook (2.5 GHz dual core and 8 GB RAM). We consider this suitable for large-scale deployment.

Outline. This paper is structured as follows. We review related work in Sect. 2. The conceptual overview of Oblivion and its detailed realization are presented in Sects. 3 and 4, respectively. Section 5 provides performance analysis of Oblivion. Section 6 discusses various aspects of Oblivion. Next, we conclude and outline future work in Sect. 7. Appendix A formally states and proves the security properties of Oblivion.

2 Related Work

The most common way to prevent web robots (or web crawlers) [24] from indexing web content is the Robots Exclusion Protocol (a.k.a.  robots.txt protocol) [2], a standard for controlling how web pages are indexed. Basically, robots.txt is a simple text file that allows site owners to specify and define whether and how indexing services access their web sites. The use of this protocol for privacy enforcement is limited, since the file that defines the protocol can only be placed and modified by the administrator of the web site. The individual whose personal data is being published is hardly capable of contacting and persuading all administrators of these sources to remove the data or modify the robots.txt file. There are many attempts to approach this privacy enforcement problem in an orthogonal fashion, by adding an expiration date to information at the time of its first dissemination [5, 7, 15, 22, 27, 28]. The basic idea is to encrypt images and make the corresponding decryption key unavailable after a certain period of time. This requires the decryption key to be stored on a trusted server, which takes care of deleting the key after the expiration date has been reached. Although some of the approaches utilize CAPTCHAs to prevent crawling the images easily, there is no fundamental protection against archiving images and corresponding keys while they are still openly available, even though first successes using trusted hardware to mitigate this data duplication problem have been achieved [5]. Another approach in this direction is the concept of sticky policies [6, 8, 21, 26]. The concept was originally introduced by Mont et al. [21] and requires a machine-readable access policy to be bound to the data before it is disseminated. The policy then ensures that the recipient of the data acts in accordance with the policy definition. However, enforcement of such policies has to be backed by additional underlying hardware and software infrastructure. In addition to these shortcomings, a user needs to take care to augment data with expiration dates before the data is disseminated in all these approaches. Thus these approaches are inherently unsuited to cope with data that is already openly available on the Internet or gets published by third parties. Finally, to implement the European Court of Justice’s decision, Google, Microsoft and Yahoo recently launched dedicated web forms [17, 20, 34] for submitting removal requests. Users have to manually identify all relevant links and insert them into this form. Subsequently, the request is evaluated manually by the employees of the indexing system to assess first, weather the author is eligible to file that request and second, whether the link to the source needs to be deleted for a specific search. To this end, users have to additionally hand over a legible copy of an ID document. The necessity of handing over a user’s full identity to use the service comes with additional privacy implications that one would like to avoid. Oblivion constitutes a technical follow-up to this solution, with a dedicated focus on censorship-resistance, while additionally avoiding the detrimental effect of having to disseminate further personal information.

Fig. 1.
figure 1

Conceptual Overview of Oblivion.

3 Conceptual Overview of Oblivion

In this paper, we propose a framework laying the foundation for a privacy-preserving automation of the right to be forgotten in a scalable manner. The basic idea is that users automatically identify online sources that contain their personal data and can automatically request its removal from indexing systems, if it violates their data protection rights. Upon receiving the request, we enable the indexing service to automatically verify if the author of the request is provably affected by the source in question. Our framework is sufficiently generic to incorporate any type of data, such as text, pictures, voice and video. For brevity reasons, in this paper, we mainly focus on two data types: text and pictures.

3.1 Motivating Scenario and System Model

We start with a motivating scenario to explain the required functionality of the framework and the different parties involved. We assume that a user, Alice, discovers that an indexing service, say Google, returns certain query requests with links pointing to a document that contains her personal information and violates her privacy. In the next step, Alice contacts an Ownership Certification Party (OCP) in order to receive validation that this source indeed contains her personal information. Such an OCP could be a third party or the Google helpdesk. Along with the relevant links, she hands over publicly verifiable ID documents such as driver’s license, passport or national ID card to the OCP. If the provided documents and the content of the article in question indeed match (which will be automatically checked by Oblivion), the OCP hands back a corresponding certificate. Alice then contacts Google to request removal of these links, providing an additional explanation, and proves her eligibility to do so based on the certificate of the OCP. Upon receiving this information, Google checks if the considered document is indeed indexed by Google, and if the OCP certificate is valid for this specific document and user. In this case, the requested article will be removed from the indexing system.

Based on this use case scenario, we consider the following entities in our proposed framework designed for automating the process of handling removal requests.

  • User: An authorized user who issues the request to remove her personal data.

  • Indexing System: This system is capable of removing links to sources containing a user’s personal data from its indexing system, based on a removal request of the user.

  • Ownership Certification Party (OCP): It is responsible for verifying if the user is the eligible data subject of the source under considerationFootnote 3.

  • Certification Authority (CA): It issues publicly verifiable credentials to the users.

3.2 Threat Model and Security Objectives

We assume that all entities in the system fully trust the CA. However, a CA does not need to be online because the issuance of credentials to the users takes place out of band, typically for a longer period of time, say a couple of years.

Unlike the CA, the OCP is an entity that is not fully trusted from the user’s perspective because it can try to learn the user’s keying material and additional user credentials not required for the ownership verification; moreover, it might want to forge removal requests. The OCP is the only entity that is not part of the traditional system. The OCP can be run by the organization (e.g., Google) that manages the indexing system, or it can be a third-party service. The OCP is assumed to be online during the execution of a request.

The indexing system is an entity inherently present in the traditional system. The indexing system and the OCP mutually trust each other; in practice, this is often trivially the case since the OCP and the indexing system are often managed by the same organization. If the OCP is an independent third party, this trust would typically be established via the CA using appropriate certificates.

We assume that users protect their private keys or at least, if their private keys are lost or stolen, a key revocation mechanism is installed and the user generates new keys. During the ownership verification, we do not assume any interaction between the users and the OCP. A user can present the OCP-signed proof to remove links to the data from multiple indexing systems, such as Google and Yahoo. We also consider an external adversary that could harm credibility of the user through replay attacks with the intention to make the service unavailable. For providing confidentiality over the communication network, we assume the presence of secure channels (such as SSL/TLS [11]) between a user, the OCP and the indexing system.

Based on these assumptions, we intend to achieve the following security objectives:

  • Minimal Disclosure: An indexing system should not learn anything beyond what is required for eligibility checking and assessment of lawfulness. The court decision ruled that the right has to be judged on a case-by-case decision. Hereby, the right of the individual has to be balanced with the public right of information. Our system handles removal requests that prove eligibility but do not reveal any further information beyond what can be found in the online source in questionFootnote 4.

  • Request Unforgeability: The system should be designed such that an indexing system can only verify user requests without any possibility of forging existing or generating new requests on behalf of the user.

  • Censorship-Resistance: The system should prevent censorship in the sense that only requests from provably affected users should be taken into account.

In addition to ensuring these security properties, the system should satisfy the following system properties in order to be suitable for large-scale deployment. It should be scalable in order to be able to process a large amount of queries simultaneously, while at the same time ensuring a thorough treatment of each individual query. It should blend seamlessly into existing infrastructures, to enable adoption by current indexing systems and certification authorities; moreover, the solution should be conceptually independent of the device and the operating system used. Finally, it should be easy to understand and use even for the general public.

3.3 Key Ideas of the Protocol

Oblivion is built on top of already available infrastructure (as explained in Sect. 3.1) that includes users, an indexing system and a CA. For the automatic verification of ownership, we introduce only a single new entity, the OCP, thus making our framework deployable in practice. In the framework, we distinguish three main phases: registration, ownership claim and reporting phases. Figure 1 presents the overall architecture for achieving the goals defined in Sect. 3.2.

Registration Phase. During the registration phase, each user registers with the CA as shown in Fig. 1. For the registration, a user presents (in Step 1) her attributes (along with evidence) and her verification key. The verification key should, for privacy reasons, be generated by the user herself before contacting the CA, but the generation of the key is orthogonal to our framework. The CA checks the validity of the attributes presented, certifies them and returns (in Step 2) a list of signed attributes, where each signed attribute is bound with the user’s verification key. Typical examples of attributes are the date of birth, name or a user’s profile picture.

Ownership Claim Phase. Once a registered user finds leakage of her personal data through the indexing system, she can contact the OCP claiming eligibility (in Step 3). This is the core phase in which the OCP expects justification of why the given piece of data affects the user. To make such a justification, the user can put tags on the given data that consist of her attributes which were signed by the CA. In order to improve usability, we automate the tagging and verification. One trivial automation method is to simply check if any user attribute appears anywhere in the article; if this happens, the matched item could be tagged with that attribute. The name attribute, say Alice, could be matched in this way.

The exact matching can semi-automate the tagging process but it cannot work in general because it may not return the correct results for all user attributes. Let us consider a user attribute in the form of a tuple: \(\langle \) Nationality, German \(\rangle \) (as explained in Sect. 4.1). In order to match this attribute, the OCP has to check if the user attribute or its synonym has appeared in the article. This includes semantically linkable formulations, such as being a citizen of Germany and having German nationality.

Letting the user manually deploy this solution, i.e., forcing the user to find synonyms of each possible word in the article, is an exhaustive task. Therefore, we employ an NLP-based technique — the named entity recognizer (NER) [14] in our case — for efficiently collecting all possible candidates in the article. The NER detects and classifies the data into various categories, such as person, organization, location, date and time, and it thus helps to identify if a user has attributes belonging to the category identified by the NER. If yes, we can perform exact matching or run a synonym checker [23] on identified categories. Articles containing a user’s picture are tagged in a corresponding manner.

After the attributes are matched, the user has to generate a proof by preparing a message that contains a list of signed attributes that are required for the verification, the tagged article and her verification key. The user signs this message and sends it to the OCP (in Step 3) as an eligibility claim.

The OCP first verifies the message signature and the signed attributes used in the tagging. If the claim relates to text attributes, the OCP runs an entity disambiguator to identify whether the article is about the user. If the claim includes a picture, the OCP runs a corresponding face recognition algorithm. Upon successful evaluations of all steps, the OCP presents to the user an ownership token (in Step 4).

Reporting Phase. After receiving the ownership token from the OCP, the user sends a request for removal to the indexing system (in Step 5). The indexing system automatically validates the ownership token and then assesses whether to remove the links pointing to the user’s personal information from its system. Finally, it sends (in Step 6) an acknowledgment to the user, which could be a success or failure message.

4 Realization Details of Oblivion

In this section, we provide details of each phase of our framework and explain the communication protocol to show interaction between different components. An indexing system and a user are denoted with IS and U, respectively. The communication protocol steps, described in this section, correspond to the flow illustrated in Fig. 1. After that, we provide details on how to securely and efficiently realize the proposed protocols using cryptographic primitives.

4.1 Registration Phase

As we can see in the communication protocol, a user sends (in Step 1) her attributes, \(A = \{a_1, a_2, \ldots , a_n \}\), which characterize her, with supporting proofs and the verification key \(\mathsf {vk}_U\) to the CA. Each user attribute \(a_i \in A\) is a name and value key pair \(\langle \textsf {NAME, VALUE}\rangle \), representing name of the attribute and value specific to each user, respectively. For instance, an attribute name could be National, and if say, a user is national of Germany, then the value will be German. Some general user attribute names include, but are not limited to, Full Name, Date of Birth, Place of Birth, Current Residence and ID Picture.

Upon a successful verification of the provided data, the CA issues a list of signed attributes \(\sigma _{U_A} = \{ \sigma _{U_{a_1}}, \sigma _{U_{a_2}}, \ldots , \sigma _{U_{a_n}} \}\) and sends it back to the user (in Step 2). Our attribute signing scheme binds every user’s attribute with her verification key. Note that one of the attributes \(a_i\) is a profile picture that uniquely identifies the user.

Steps 1 and 2 constitute the registration phase that takes place securely and out of the band. The concept of digital signature together with user attributes (signed by the government) is already present in some EU countries [12, 13, 32].

4.2 Ownership Claim Phase

In order to make an ownership claim to the OCP, we consider a user client, say a browser plugin. The plugin sends the claim to the OCP and receives an ownership token from the OCP in the case the claim can be verified, cf. Figure 1. In order to do so, the first step is that the user client has to formulate the claim, then it has to identify personal information and finally the actual removal request has to be generated. In the next step, the OCP has to verify the request. This is done by first verifying the authenticity of the request and second verifying the relationship to the data. The latter verification depends on the type of data, e.g., face recognition can be used for pictures. The last step is to generate the ownership token that is then transferred from the OCP to the user. In the following, we present the details of all these tasks.

Fig. 2.
figure 2

An article illustrating personal information of Alice Schmidt who has an ID card with digital credentials issued by the German government.

Identifying Personal Information. For identifying user’s personal information in an article (as illustrated in Fig. 2), a user client may run the NER algorithm locally (assuming it is delivered as a part of the user client) to extract all possible candidates. The NER algorithm could also be run as a third-party service (e.g., a web service), called by the user client. After running the NER algorithm, a user client picks each of the candidates and matches them with the user attributes (see Fig. 2).

If the match is not successful, a user client runs a synonym checker. If both words are synonyms then they are considered matched; otherwise, the next candidate is picked from the queue for the comparison. The synonym checker could be delivered as a part of the user client. To make the user client lightweight, we can assume a third party service (e.g., a web service). In either case, the synonym checker should be very specific to the attributes issued by the CAFootnote 5.

Face Detection. Besides the textual description, an article could also contain a user’s picture, either as a solo or a group picture. Like textual attributes, the user client can run the face detection algorithm to automatically detect the user’s face. On successful detection, a user client can automatically include the CA-signed user picture in the removal request, which is explained next.

Generating Removal Request. After identifying personal information, a user client prepares a removal request. During the preparation, it chooses all signed attributes required for the ownership claim. Next, it packs them as \(P_{\sigma _{U_{A^*}}}\) so that the OCP can verify the signed user attributes using a single exponentiation operation using the CA verification key. This would also require a user client to include in the message a subset of her attributes \(A^*\) corresponding to the packed ones, i.e., \(P_{\sigma _{U_{A^*}}}\). Since a user client signs the message using the user’s signing key, the user’s verification key \(\mathsf {vk}_U\) is also included in the message to let the OCP verify the message. For preventing replay attacks, a timestamp TS is also included in the message. The user client sends to the OCP (in Step 3) the message \(M=(TS, \mathsf {vk}_U, A^*, P_{\sigma _{U_{A^*}}}, D)\) along with the signature \(\sigma _M\).

Verifying Removal Request. Upon receiving a removal request, an OCP verifies it before issuing any ownership token. As a first step, the signature \(\sigma _M\) over the message M is verified. Next, the OCP checks the timestamp and verifies the packed version of the user attributes signed by the CA. Then, the OCP checks if all tagged attributes are valid. This step comprises the exact matching and/or synonym checking.

Face Recognition. Optionally, the face recognition algorithm could be run provided there is a user picture in the article. As we explained earlier in this section, faces are pre-identified by the user client, in order to ease the job of the OCP. The OCP compares the user-tagged face with one provided as a signed user attribute in the request (see Fig. 2). If the face recognition algorithm discovers similarity with a certain confidence, the user’s picture in the article is considered matched with her profile picture.

Entity Disambiguation. When the given article contains text, the OCP can execute the disambiguation algorithm (e.g., AIDA [18]) for ensuring the eligibility goal, i.e., checking whether the article is about the user. The outcome of this algorithm is the relation between the user attributes, her name in particular, and the context of the text. The outcome, say satisfying the predefined threshold value, would help the OCP to mark the user as being affected by the data in the article. Figure 2 illustrates an example article about Alice Schmidt.

Issuing Ownership Token. On successful evaluations of all the steps performed by the OCP, the user is issued an ownership token. This is accomplished by the OCP by sending (in Step 4) an ownership token \(D_U\) to the user. It is important to note that the OCP verification protocol is non-interactive.

4.3 Reporting Phase

Once the user receives the ownership token, she can report to the indexing system. In this phase, a user reports by sending (in Step 5) the ownership token \(D_U\) (corresponding to D) to the indexing system. The indexing system verifies the token, fires the incident and sends (in Step 6) an acknowledgment Ack to the user. If the OCP is a third-party service, the ownership token is signed by the OCP and could be sent to multiple indexing systems simultaneously.

4.4 Efficient Cryptographic Realization

The cryptographic instantiation relies on RSA-full-domain hashing as the underlying signature scheme. We briefly recall the definition of this signature scheme. The scheme assumes a given collision-resistant family of hash functions \(\mathcal {H}_k: \{0,1\}^* \rightarrow \{0,1\}^k\). In the following, we omit the security parameter k for readability. The key generation KeyGen computes a key pair \((\mathsf {sk}, \mathsf {vk})\) by first computing an RSA modulus \(N = p \cdot q\), where p and q are two random primes, and then computing e and d such that \(e \cdot d = 1 \mod (p-1)(q-1)\). The keys are \(\mathsf {sk}= (d, \mathsf {vk})\) and \(\mathsf {vk}=(e,N)\). The signing function Sign(\(\mathsf {sk}\), M) computes \(\sigma _M := \mathcal {H}(M)^d \mod N\). Finally, the verification function Verify(\(\mathsf {vk}\),\(\sigma ,M\)) outputs \(\mathsf {accept}\) if \(\mathcal {H}(M) = \sigma ^e \mod N\) and \(\mathsf {reject}\) otherwise.

Using this cryptographic primitive, we finally describe the construction that we propose for our framework to achieve goals defined in Sect. 3.2. The censorship-resistance and eligibility checking goals could be achieved using X.509 based schemes [19]; however, those schemes are not able to achieve goals including minimal disclosure (i.e., disclosing only those attributes required for the ownership claim) and scalability (i.e., reducing computational overhead on the OCP end). Using our construction, the user can provide a minimal set of her attributes required for the ownership claim, and we are able to delegate some computation to the user client so that the OCP could be offloaded. Our construction allows an OCP to verify all user attributes with just a single exponentiation.

Fig. 3.
figure 3

Details on the algorithms of the data ownership scheme.

Definition 1

(Data Ownership Scheme). The data ownership scheme DATA-OWN is a tuple of algorithms \(\langle \textsf {Init, KeyGen, CA.SignA, U.SignM,}\) \(\textsf {OCP.VerifyM, U.PackA,OCP.VerifyA}\rangle \). The definition of the algorithms can be found in Fig. 3.

Lemma 1

(Correctness). Informally speaking, OCP.VerifyA(\(\mathsf {vk}_ CA , \mathsf {vk}_U,\) \( P_{\sigma _{U_{A^*}}}, A^*\)) will always return \(\mathsf {accept}\) if the list of signed attributes that are packed by the user are the same as the list of attributes \(A^*\) provided by the user to the OCP. More formally,

$$\begin{aligned} \mathbf Pr [\textsf {OCP.VerifyA}(\mathsf {vk}_ CA , \mathsf {vk}_U, P_{\sigma _{U_{A^*}}, A^*)} = \mathsf {accept}] = 1 \end{aligned}$$

The claim easily follows from the homomorphic property of exponentiation modulo N. We analyze the security properties of Oblivion in Appendix A.

5 Performance Analysis

In this section, we provide implementation details for all components that we newly developed for Oblivion and name libraries that this implementation relies on. We subsequently evaluate the performance overhead of this implementation for each involved component (CA, user client, and OCP).

5.1 Implementation Details and Evaluation Parameters

Components of the Implementation. The implementation prototype is written in Java. To reflect the different involved participants, the implementation consists of three components: a module for the CA (CA-module), a module for the OCP (OCP-module) and a module for the user client (user-module). For the sake of simplicity, the prototypical implementation assumes that the OCP and the indexing system are managed by the same organization; this avoids an additional trust level between these institutions and allows us to concentrate on the performance measurements. The size of each of these modules (without included libraries; see below) is below 5 KB.

Libraries Used. Our prototypical implementation relies on several existing open source libraries. First, we include the Stanford NER library [1] for identifying personal information in the textual article. The NER library is of size 3.2 MB and the NER classifier, for covering seven distinct classes of data, requires 16.6 MB. Second, we rely on OpenCV (Open Source Computer Vision Library), an open source computer vision and machine learning library [25], for face detection and recognition. Finally, we include the AIDA (Accurate Online Disambiguation of Named Entities) framework [18] to achieve ownership disambiguation. In our experiments, we used the AIDA framework itself and its corresponding web service, which works with entities registered in the DBpedia [3] or YAGO [30] knowledge base.

Evaluation Parameters. We have evaluated the performance of the implementation on a dataset of 150 news articles that we randomly crawled from the international news agency ReutersFootnote 6, using the Java-based web crawler crawler4j [10]. These articles cover different topics and range from 1 K to about 10 K words; the average length is 1.9 K words per article. The actual experiments were run on a standard notebook with 2.5 GHz dual-core processor and 8 GB RAM. The experimental results described below constitute the average over 100 independent executions. Network latency was not considered in the experiments.

5.2 Evaluating the CA-Module

Evaluating the performance of the CA-module consists of measuring the overhead of attribute certification.

Fig. 4.
figure 4

Evaluation of the CA-module: Performance overhead for certifying user attributes.

Attribute Certification. Figure 4 illustrates the computational overhead for certifying user attributes. In our experiment, we generated up to 50 attributes and considered CA’s signing keys of varying size, ranging from 512 to 4096 bits. As we expected, certification time grows linearly in the number of attributes. For the most complex cases under consideration — the CA signing 50 attributes, and thus far more than what a user would typically maintain, using a signing key of size 4096 bits — the attribute certification took 7.5 s. For smaller numbers of attributes, or for all smaller key sizes, this certification takes less than a second. Since attributes are typically certified only once per user, this computational overhead should be acceptable as a one-time upfront effort.

5.3 Evaluating the User-Module

Evaluating the user-module is performed in two steps: identifying suitable attributes in the given sample texts, and pre-processing these attributes for the subsequent ownership-proof phase.

Fig. 5.
figure 5

Evaluation of the user-module: Performance overhead of (a) identifying personal information and (b) for packing user attributes.

Identifying Attributes. As explained in Sect. 3.3, the user-module pre-processes the article using NER techniques and appropriately selects all entities that are necessary for the identification process. We evaluate the performance of the user-module on the aforementioned 150 news articles from Reuters, and measure the time required to identify and extract all entities. The results are depicted in Fig. 5(a). The performance overhead varies from 77 to 814 millisecond (ms), with an average of 174 ms per article. The number of unique entities in the articles ranges from 43 to 590, where the average number of unique entities per article is 135.

Attribute Packing. After identifying all personal attributes in a given news article, the user-module pre-processes a set of signed attributes as required for the ownership proof. This pre-processing in particular reduces the number of exponentiations that are required to verify the attributes for the OCP, and thereby avoids a potential bottleneck. In the performance measurement, we again considered up to 50 attributes and varying key sizes. As shown in Fig. 5(b), the time for this pre-processing increases linearly in the number of attributes, with an additional overhead for larger key sizes. For the maximum of 50 attributes, the pre-processing only took between 0.1 ms (for a 512-bit key) and 4.1 ms (for a 4096-bit key).

Message Signing. The user client signs the message using her signing key. For this experiment, we considered the aforementioned 150 news articles. Consider the overhead of signing a message with a signing key of size 1024 bits. Depending on the size of the article, the signing took between 2.8 and 3.8 ms, with an average of 2.9 ms per article.

5.4 Evaluating the OCP-Module

We split the performance evaluation of the OCP-module into two parts: First, we evaluate the time required to verify the validity of requests for varying parameters: for varying numbers of articles, for varying number of attributes, and for varying verification requests. Second, we evaluate the time required to decide whether the request is legitimate, i.e., whether the document under consideration affects the user’s data, either by means of entity disambiguation or face recognition.

Fig. 6.
figure 6

Evaluation of the OCP-module: Performance overhead of (a) verifying the messages, (b) verifying user attributes signed by the CA, (c) verifying user requests and (d) running entity disambiguation.

Validating the User Request. Upon receiving a signed message from a user, the OCP verifies the validity of the signature using the user’s verification key. This verification time (with a 1024-bit key) ranges from 2.9 to 4.3 ms with an average of 3.2 ms per article. Figure 6(a) illustrates the cumulative verification time to verify up to 150 articles. It grows linearly, so verifying message validity for 150 articles takes the OCP less than 0.72 s.

Similarly, Fig. 6(b) displays the time required to verify a certain number of signed user attributes. Recall that the user sends a packed version of her signed attributes to ease the verification task of the OCP. Still, the OCP needs to calculate the hash of each individual attribute and multiply all hashes together before being able to verify the signature based on the packed version. Verifying 50 user attributes takes 0.37 ms (for a 512-bit key) and 10 ms (for a 4096-bit key), respectively. For l attributes, the packed version is at least \(l-2\) exponentiations faster than verifying each attribute individually.

Finally, Fig. 6(c) shows the performance overhead for verifying a certain number of user requests. In our experimentation, we assumed that every request requires the verification of 20 attributes, each one signed with a key of size 1024 bits. To measure the performance overhead, we gradually increased the number of user requests from 2000 to 20,000 and observed an (essentially linearly-growing) overhead from 0.824 to 7.96 s. Processing a single verification request with 20 attributes took less than 0.4 ms on average.

The overall computational overhead of the OCP-module is a combination of the message verification and the attribute request verification, each one incurring on average 3.2 ms and 0.4 ms, respectively. Therefore, our implementation manages to process a removal request within 3.6 ms. In summary, it allows the OCP to handle 278 requests per second (using the standard laptop that we based these experiments on).

Eligibility of the User Request. Identifying whether the requested article indeed contains personal data of the requesting user relies on appropriate entity disambiguation. Figure 6(d) illustrates the performance overhead for entity disambiguation with up to 10 entities.

Recall that we require the user client to run the face detection algorithm and select the appropriate face and send it the OCP along with the standard request. The performance overhead of the face recognition algorithm depends on multiple factors such as the picture resolution and the face position in the picture. In our experiments, we have chosen pictures with well-defined frontal faces. The resolution of the pictures is up to 3072\(\,\times \,\)4608 pixels with an average size of 4 MB. Having all these predefined conditions, the runtime of the face recognition algorithm stays in the range of 150 to 300 ms.

The overall performance overhead, comprising both entity disambiguation and image recognition, currently constitutes the bottleneck for verifying the validity of removal requests in the OCP-module. Currently, we are exploring further optimization here.

6 Discussion

Deployability and Usability. In order to deploy our solution, Oblivion requires a national or local government-wide CA that issues credentials to citizens. We argue that this requirement does not limit practicality of our approach because the issuance of such credentials is already part of an EU standard [12], implemented by some member states and meant to be adopted by all the EU member states [13, 32]. The European EID standards also enable the use of digital credentials for Internet communication (e.g., for online shop**) [13] which also strengthens usability for Oblivion ’s developers as well as end-users.

Scope of Eligibility. First, it is a hard problem to decide on the eligibility of an ownership claim if two persons have the same attributes, e.g., name. Oblivion addresses this issue by using attributes that in combination should be sufficiently unique for most people. Second, our framework cannot decide whether a piece of content is of public interest (such information falls into the category of freedom of the press) and outweighs the privacy interest of an individual. This decision is a legal assessment. This is outside of the scope of Oblivion and subject to ongoing research about the automation of legal assessments [4].

Privacy and Availability. The OCP could be a third-party service or managed by the search engine provider. From a privacy point of view, the latter setup may reveal personal information about citizens. However, we argue that a search engine provider does not learn more than what is already available in the article. This is because Oblivion follows a principle of least privilege, where only those particular attributes that are present in the article are sent to the OCP. The collection of information and verification makes the OCP a key component of Oblivion. The availability of the OCP becomes essential in the long-run success of Oblivion. Therefore, to prevent a single point of failure, we can consider deploying multiple instances of the OCP.

Robustness. Oblivion relies on NLP and image recognition techniques. The NLP technique we use in our framework is simple and sufficiently robust in practice. Concerning robustness of the image recognition technique, recent research has shown that automated face recognition is almost comparable to human face recognition accuracy [31]. Therefore, when the removal request includes a picture that uniquely identifies the user with a certain confidence (part of the deployed policy), our framework can easily approve the removal request.

7 Conclusion

In this work, we have introduced a universal framework, called Oblivion, providing the foundation to support the enforcement of the right to be forgotten in a scalable and automated manner both for users and indexing systems. The framework enables a user to automatically identify personal information in a given article and the indexing system to automatically verify the user’s eligibility. The framework also achieves censorship-resistance, i.e., users cannot blacklist a piece of data unless it affects them personally. This is accomplished using the government-issued digital credentials as well as applying the entity disambiguator technique. We have conducted comprehensive evaluations of Oblivion on existing articles, showing that the framework incurs only minimal overhead and is capable of handling 278 removal requests per second on a standard notebook (2.5 GHz dual core). In these evaluations, we have observed that the remaining performance bottleneck on the OCP is caused by the entity disambiguator (i.e., AIDA) and the face recognition (i.e., OpenCV) algorithms. We believe that optimized versions of both could help in significantly improving the performance.

For future work, we plan to improve Oblivion’s accuracy and overall coverage for proving affectedness. Following the principle of reCAPTCHA digitizing books [33], improving the accuracy of NER by taking into account the user client tagging constitutes a promising approach. Another promising research direction is to analyze the assessment of lawfulness and automate the application of future guidelines for the right to be forgotten. Staying close to the precedent, this would also require semantically analyzing the article to determine if its content violates privacy rights, e.g., by being outdated or by containing sensitive information for the entity requesting removal.