1 Introduction

With the development of the intelligent network era, the utilization rate of mobile devices continues to increase, and the number of Android and iOS applications in domestic and foreign application stores is also increasing (Ardito et al. 2020; Feichtner and Gruber 2020). The convenient services of mobile applications facilitate people’s daily life, and applications that provide functions such as mobile payment and online shop** are slowly integrating into people’s lives. While enjoying the unprecedented convenience brought by feature-rich applications, users are also always facing various attacks and privacy risks brought by application insecurity. Malware based on the Android platform is increasing day by day, and nearly half of the malicious applications actually use permissions that exceed the permission information specified in the privacy policy (Wang et al. 2019; Elluri et al. 2020; Frenklach et al. 2021). With the increase in the leakage of users’ personal information, incidents of illegally buying and selling personal private information for profiteering have intensified (Heid and Heider 2021).

Since sensitive personal information is closely related to the basic rights of natural persons, such as personal dignity, major personal interests, and property interests, the processing of personal sensitive information will cause major risks to the basic rights of natural persons and personal and property safety. Countries such as China, the United States, and the European Union have special regulations for handling sensitive personal information.

The analysis and detection of malicious behavior of Android applications have always been a hot topic of cutting-edge research. Many static and dynamic analysis-based detection systems for Android malicious applications have been proposed by domestic and foreign researchers (Wang et al. 2019; Elluri et al. 2020; Heid and Heider 2021; Vyas et al. 2021; Fan et al. 2020; Elahi et al. 2021). Studies on static analysis of privacy leaks include LeakMiner (Yang and Yang 2012) proposed by Yang et al, the static analysis tool TrustDroid (Zhao and Osono 2012) proposed by Zhao et al and the analysis tool FlowDroid (Arzt et al. 2014) proposed by Fritz et al. For dynamic analysis of privacy leakage, there are TaintDroid (Enck et al. 2014) dynamic analysis tool proposed by Enck et al, AppFence (Hornyack et al. 2011) proposed by Hornyack et al, and Kynoid (Schreckling et al. 2013) which improves on the theory and practice of TaintDroid. The principles of dynamic and static analysis are different, but they all focus on detecting the behavior of the App, but not the permissions that the App declares, so it is impossible to understand the functional intent expressed by the App. For example, if an App applies for location permission to send the user’s geolocation information to a third-party server integrated with the App, this behavior will be judged as privacy leakage through original dynamic and static analysis.

The purpose of the privacy policy is to inform the user before using the app, what personal information of the user may be collected during the operation of the App. For example, map Apps, in order to provide better services, need to continuously obtain the user’s address and location information. Since the App developer has informed users in advance what personal information will be collected in the privacy policy, the normal collection of sensitive information does not count as a privacy breach. But if the privacy policy does not inform users or deliberately conceals the collection of certain sensitive information, this is a malicious behavior. To collect user sensitive information, an App needs to apply for permission in advance and use the corresponding API in the source code. Therefore, it is very necessary to analyze the consistency between the App’s permission declaration and actual use. In this article (Gibler et al. 2012), the author found that the purpose of most applications applying for some sensitive permissions is to transmit the user’s sensitive information to the peer server through the network. Although Android has added a runtime permission system since 6.0, ordinary users know little about security and cannot effectively protect personal information. Therefore, from the user’s point of view, studying the consistency between the permission declaration of Android applications and the actual use of Android permissions can help users discover and avoid installing Android malicious applications.

To our knowledge, few literatures focus on the consistency analysis of privacy policies and permission applications in Chinese. Chinese is more semantically rich and ambiguous than English, which is also a challenge. In order to analyze the Android privacy leakage problem in the Chinese domain from a consistent perspective, this paper analyzes it from two angles: firstly, it analyzes the collection of affirmative permissions obtained from Android privacy policy; secondly, it analyzes the sensitive APIs used from the perspective of APK source code through Android static analysis, which can more specifically obtain the collection of permissions information actually used by Android. In this paper, we made the following contributions:

  1. (1)

    To address the missing problem of Chinese dataset, this paper automates the construction of dataset by similarity model (BERT + cosine similarity), and the scale of extended sentences reaches 4800.

  2. (2)

    Through extensive analysis of Chinese privacy policies, two different types of Chinese privacy policies are summarized with respect to the description of third-party SDK permission information in the privacy policy.

  3. (3)

    Based on our dataset, we chosed mainstream algorithms such as CNN, GRU, LSTM, and BERT to implement consistency analysis engine and did comparative experiments. In the end, we found that the BERT model performed the best.

2 Background

2.1 Android security mechanisms and privacy policies

Android, being an open-source system, requires specific security mechanisms for protection. These mechanisms primarily rely on the sandbox and permission mechanisms (Enck et al. 2009; Gargenta 2011; Felt et al. 2011). The sandbox mechanism is used to isolate the running space of each App, each App represents a specific process. The permission mechanism of the Android system is more subdivided into Android permissions, which is more convenient and secure to manage the user’s private information (Davi et al. 2010; Bläsing et al. 2010). The permission mechanism restricts the access of applications to device resources and user data through an authorization mechanism. Android’s permission mechanism is divided into two types: normal permissions and dangerous permissions. Normal permissions do not involve user privacy, such as accessing the internet or reading device status. On the other hand, dangerous permissions involve user privacy, such as accessing contacts, location information, or SMS messages. When users install an application, the system displays the required permissions and requests user authorization. Once authorized, the application can access the corresponding resources and data. However, malicious applications can exploit the permission mechanism to invade user privacy. For example, a malicious application may request access to the user’s contacts or location information and then transmit this information to a remote server. Android’s dangerous permissions are of utmost importance as they are closely tied to user information. Improper use of these permissions can result in the malicious dissemination of the user’s private data.

The privacy policy is a document that informs users of how information and data are collected and used by a website or mobile software (Yu et al. 2015). The development of a privacy policy also enables network operators to better understand user habits and concerns and avoid certain legal issues. The following is a list of common content related to user privacy in Chinese privacy policies:

  1. 1.

    How personal information of users is collected and used.

  2. 2.

    How user information is shared and used by third parties and publicly disclosed.

  3. 3.

    How to protect and preserve users’ personal information.

2.2 Commonly used word segmentation models

2.2.1 CNN

Convolutional neural network is an algorithm model that includes a special network structure (convolutional layer) (Kattenborn et al. 2021). During forward propagation, the introduction of convolution operation can help the model learn local features, which is effective in image applications. Convolutional neural networks can also perform deep propagation to better improve the model effect. The prediction mechanism of a convolutional neural network involves several steps. Firstly, the input layer processes the input samples, such as training word vectors for input sentences using the embedding layer. Then, feature learning is performed through convolution operations within the network. Since there is often redundant information among adjacent feature points, the pooling layer is often connected after the convolutional layer. Finally, in order to adapt to downstream tasks, fully connected layers are essential.

2.2.2 LSTM

LSTM is distinguished from CNN neural networks (Greff et al. 2016). The introduction of the gate mechanism overcomes the problem of poor training of CNN for large length inputs, and also alleviates the problem of gradient disappearance and gradient explosion of CNN to some extent.

The function of each gate switch and the specific computation process will be introduced separately as follows:

  1. 1

    Forget Gate: The forgetting gate allows for the screening of long-term stored information, leaving behind information that is more beneficial to the model’s effectiveness (Ghaeini et al. 2018). Using the principle of the gate mechanism, the cell state Ct-1 at the previous moment is multiplied by a number greater than 0 and less than 1, leaving the resulting value as the current moment Ct.

    $$\begin{aligned} f_{t}= \delta (W_{f} \cdot [h_{t-1},X_{t}] + b_{f}) \end{aligned}$$
    (1)
  2. 2

    Input Gate: Considering the degree of influence of the current information on the model effect, it is necessary to filter the current input information to be filtered(Ghaeini et al. 2018).

    $$\begin{aligned} i_{t} = \delta (W_{i} \cdot _{} [h_{t-1,X_{t}}] + b_{i}) \end{aligned}$$
    (2)
  3. 3

    Output Gate: for determining the final output. tanh function weights the passed values to determine their significance level (\(-1\), 1) and multiplies the output by the Sigmoid.

    $$\begin{aligned} O_{t} = \delta (W_{o} \cdot _{} [h_{t-1,X_{t}}] + b_{o}) \end{aligned}$$
    (3)
    $$\begin{aligned} C_{t} = f_{t} \cdot C_{t-1} + i_{t}\cdot Ct\widetilde{} \end{aligned}$$
    (4)

    The cell state Ct at the current moment is calculated as shown in  4. The formula for calculating the final output of the LSTM at the current moment is shown in  5.

    $$\begin{aligned} h_{t} = O_{t}\cdot tanh(C_{t}) \end{aligned}$$
    (5)

2.2.3 GRU

GRU is a very effective variant of LSTM network (Fu et al. 2016). The convolutional neural network has a simpler structure compared to the LSTM network, yet it achieves excellent results. Therefore, it is currently considered a versatile network architecture. Since GRU is a variant of LSTM, it can also solve the long dependency problem in RNN network. GRU has fewer parameters, so it is faster to train, and GRU can reduce the risk of overfitting. A typical GRU cell is consisted of two gates: reset gate r and update gate z. Similar to LSTM cell, hidden state output at time t is computed using the hidden state of time t-1 and the input time series value at time t.

2.2.4 Attention

The attention mechanism in deep learning allows for distinguishing the importance of feature vectors in samples. This enables the model to assign greater significance to more important feature points, greatly enhancing the model’s feature extraction ability (Vaswani et al. 2017). Since its birth, the attention mechanism of deep learning has demonstrated its importance from both theoretical and practical perspectives, and has achieved more significant effects in many aspects. The essential idea of the attention mechanism: From a microscopic point of view, the attention mechanism is mainly operated by the query vector Query (represented by Q), the key vector Key (represented by K) and the value vector Value (represented by V).

2.2.5 BERT

BERT is a natural language processing pre-training model disclosed by Google in November 2018, which can be used to characterize text data (Devlin et al. 2018; Sanh et al. 2019; Liu et al. 2019). The BERT model uses the encoder in the Transformer (Jaderberg et al. 2015) structure based on the static word vector and the Transformer feature extractor for the context learning. As shown in Fig. 1 transformer usually consists of two parts: the encoder part on the left and the decoder part on the right, both of which are structured with a multi-head attention mechanism. The use of multiple heads enables to consideration of word information from different dimensions and improve the generalization ability of the model. Also, both parts contain a fully connected layer, which integrates global features, and introduces a normalization function to speed up the training efficiency of the model. The BERT model utilizes Wordpiece embedding, sentence embedding, and positional embedding in its input, with both sentence embedding and positional embedding being trainable. By combining the word itself and suffixes, Wordpiece embedding reduces the size of the word list significantly.

Fig. 1
figure 1

The structure of BERT transformer

2.3 Permission abuse problem

BERT has two pre-training tasks:

  1. (1)

    Masked language model (Devlin et al. 2018): MASK is used to cover 15% of the words randomly, and then the masked words are predicted using context by the idea of “completion fill-in-the-blank”.

  2. (2)

    Next sentence prediction task (Devlin et al. 2018): It is mainly used to characterize the relationship between two sentences and determine whether the two sentences belong to the contextual relationship.

3 Case study

3.1 Privacy policy type definition

After analyzing a large number of Chinese privacy policies, we concluded that Android privacy policies in most application markets are divided into two types with respect to third-party SDK issues:

  1. (1)

    The first type of privacy policy, which we refer to as “native”, does not specify the specific information collected by third-party SDKs. Users have the option to view the third-party privacy policy. The “native” privacy policy extracts permissions by combining the App privacy policy with the permissions extracted from the third-party SDK privacy policy to form a list of permissions declared by the App.

  2. (2)

    The second type is “hybrid” privacy policy, which explicitly mentions the third-party SDK permissions in the privacy policy, and this type of App will state the third-party SDK integrated with the App in the privacy policy and will give a description of the functions of the integrated third-party SDK and the corresponding permission declaration. The “hybrid” privacy policy extraction process separates the App declaration permission and the third-party SDK permission from the App privacy policy to avoid redundancy issues caused by analyzing third-party SDK privacy policies.

As shown in Fig.  2, the extraction process used in this article is for native and hybrid extraction.

Fig. 2
figure 2

“Native” and “hybrid” privacy policy extraction process

3.2 Permission abuse problem

For the “hybrid” privacy policy, the traditional solution of extracting permissions from the third-party SDK privacy policy also results in over-privileging the privacy policy, which leads to the consistency analysis framework identifying risky apps as normal apps.

Traditional solutions for analyzing third-party SDK privacy policies may result in the consistency analysis framework identifying at-risk applications as normal applications that are underreported. However, since the third-party SDK’s permission information is explicitly stated in the “hybrid” privacy policy, this issue is mitigated.

For example, a “Snow**” app declares four permissions, namely camera/album, system notification, calendar reading and writing, and recording permissions, but the permissions declared by the integrated third-party SDK include networking and device information, and the collection sensitive permissions include: The camera, calendar and recording information applied by the app itself, and the device information and phone number information applied by the third-party SDK. Define A1 as the set of declared permissions extracted from the application privacy policy according to the analysis process of the “hybrid” privacy policy, then the corresponding Android permission names in the set A1 include CAMERA, READ_CALENDAR, RECORD_AUDIO, READ_PHONE_STATE, and READ_PHONE_NUMBERS. The set A1 mentioned above consists of the permissions that the application will use when strictly following the privacy policy. Now let’s consider the use of “native” privacy policies.

According to the traditional analysis process for “native” privacy policies, it is also necessary to analyze third-party SDK privacy policies. The privacy policy of the third-party “You**” SDK integrated with the “Snow**” app is partially stated. Among them, the “You**” SDK will obtain geographic location information and user device information. Define A2 as the set of declared permissions extracted from the privacy policy of the third-party SDK. A={CAMERA, READ_CALENDAR, RECORD_AUDIO, READ_PHONE_STATE, READ_PHONE_NUMBERS, ACCESS_FINE_LOCATION} obtained by taking the union of the two sets A1 and A2. However, since the “hybrid” privacy policy has clearly stated the permission information of the third-party SDK, after analyzing the privacy policy of the third-party SDK according to the traditional scheme, the ACCESS_FINE_LOCATION location permission is redundant in the A collection. This leads to false negatives where the consistency analysis framework identifies risky applications as normal applications.

4 Methodology

4.1 Android usage permission extraction process

By decompiling the bytecode code, a list of sensitive API functions corresponding to some Android permissions is shown in Table  1. This paper uses Androguard (Mercaldo et al. 2016), a static analysis tool, to identify the APIs and permissions of APKs.

Table 1 Part Android permissions corresponding to the API functions

In this paper, we use the find_methods function, which is a function provided by Androgurad (Wang et al. 2015), to find whether there is a corresponding API in the APK source code. The function accepts three parameters, first is the package name and class name, second is the method name, last is the method call logic, generally “.”. As shown in List  1, location permission, access to cell phone numbers, and access to SD cards are used as examples.

figure a

When searches for a specific API through the find_methods method in Androguard. If the value of return is True, it proves that the API exists, and the corresponding permission name of the API exists.

4.2 Chinese dataset construction process

Fig. 3
figure 3

The process of constructing Chinese dataset

Figure  3 shows the flow chart of Chinese dataset construction. Firstly, we obtain the word splitting result by pre-processing the text of privacy policy, and then combine the word splitting result with 3-Gram to obtain permission phrases. Next, the permission phrases are characterized by BERT semantics and compared with the permission phrase expressions initially collected from the Android official website for similarity.

4.2.1 Data collection

The data sample in this article is divided into two parts: one is the Android application privacy policy text, and the other is the APK file. The source of the privacy policy text is mainly divided into two parts: one is collected from the Google Play application market through Python crawler technology, but the domestic application market does not require developers to upload privacy policies, the domestic application market cannot obtain privacy Policy data; the second is through the Baidu search API, and the first item of the result returned by the specific App name through the HTTP request is used as the privacy policy. At present, a total of 413 privacy policies have been collected. The privacy policy text is used as the raw data, and it is preprocessed before it is used as privacy data, from which the training set is constructed. The text preprocessing includes sentence splitting module, word splitting module, deactivation.

4.2.2 Build permission phrases

In this paper, we use N-Gram (Cavnar et al. 1994; Bergsma et al. 2009; Lapata and Keller 2005) to combine the words in a phrase-building perspective to avoid the noise problem of individual authority words. The corresponding permission phrase expressions can be constructed by using Jieba partitioning and 3-Gram operations.

4.2.3 Amplification permission phrase

In order to facilitate the automated mining of more permission phrase expressions, this paper uses the semantic representation (BERT pre-training model) + cosine similarity model and selects the permission phrases with similarity greater than a threshold value as the positive samples in the training set. In this way, it is possible to automate the augmentation of permission phrase expressions in large numbers, and also directly correspond to specific permission names for the permission phrases with high similarity, which can greatly reduce the manual annotation workload.

First, we obtain the most basic permission phrase expression through the short official description of Android permissions, and then obtain the specific semantic vector by semantic characterization of the permission phrases. In this paper, we use the bert-as-service tool developed by Dr. Han **.

The permission mapper function is to map permission information to permission names: the task form of textual multiclassification of the BERT model is used to map to specific permission names.

The SDK permission detector function is to identify whether the requesting subject of permission information is an integrated third-party SDK: the task form of textual binary classification of the BERT model is used to identify whether the requesting subject of filtered permission information belongs to a third-party SDK.

In the training phase, the training data sets of permission filter, permission mapper, and SDK permission detector are constructed from the collected Chinese privacy policies, and the three parts are trained as separate models using the BERT model.

In the prediction phase, the data pre-processing part is the same as the training phase in that the privacy policies are text processed and permission phrases are formed.

5 Implementation

5.1 Permission filter model design

This section uses a dynamic word-vector BERT model that uses words as processing units and handles contextual relationships through the Attention mechanism. The purpose of permission filtering is to classify the phrases in the privacy policy into permission-related phrases and permission-independent phrases.

Fig. 7
figure 7

The framework of permission filter model. w1,..., w4 represent the tokens that make up the sentence, [CLS] is the classification token for this sequence

In this section, the classification task of the BERT model is used to filter permission-independent information, where the BERT model is selected from the BERT-base Chinese pre-training model (Cui et al. 8. Among the 10 categories of permissions in this section, the accuracy and recall of READ _PHONE_STATE permissions are the highest at 0.968 and 0.988, and the AUC of CALL_PHONE permissions is the highest at 0.978. Since permission map** is essentially textual multiclassification, the TextCNN model is not as effective as TextRNN and BERT models because it does not consider sequence features. The BERT model dynamically takes into account the contextual information compared with the TextRNN model, which is the best effect in permission map**, and that is why the BERT model is used to build the permission mapper in this section.

Fig. 8
figure 8

Comparative experiment results

The experiments in this section extract the Embedding information of different layers of the Transformer. The higher the number of Transformer layers, the better the evaluation index, and the more semantic information features exist in the classification basis of the experiments of the permission map** related sentences.

6.3 SDK permission detector model evaluation

By text-mining the phrases in the permission-related sentences, it is found that the permission sentences about third-party SDKs are mainly described in two forms: one is directly introduced with the name “third-party software” and so on to declare the permission; the other specifies the third-party SDK information. In this section, we use the BERT model to experimentally validate the SDK permissions detection. In order to verify the effectiveness of third-party phrase information for SDK permission detection, this section sets up several sets of experiments, using CNN series models, and pre-trained model BERT for text classification. The experimental results are shown in Table 4, and the accuracy rate is above 95%, which proves that this solution can correctly identify third-party SDKs using third-party phrase information; secondly, the BERT model achieves the highest AUC value of 98.1% on the test set, which proves that the BERT model can perform the SDK permission detection task more efficiently.

Table 4 The result of different models in SDK permission detection

7 Related work

Android application privacy policy analysis has always been a hot topic of research (Yu et al. 2015; Benats et al. 2011; Yu et al. 2016, 2017; Wang et al. 2018). Wang et al. (2019) proposed SmartPi to extract information related to privacy permission from user reviews based on NLP. Yu et al. (2016) proposed PPChecker, which utilizes NLP to analyze privacy policies and models three issues: incomplete privacy policy, incorrect privacy policy, and inconsistent privacy policy. They also employ program analysis techniques to analyze application programs and identify potential privacy leakage issues. Slavin et al. (2016) proposed a semi-automated framework for detecting the violations based on a privacy-policy-phrase ontology and a collection of map**s from API calls to policy phrases. Yu et al. (2016) proposed PPCHECKER that focuses on system-managed data and identify three kinds of problems in the privacy policy. The most related work is proposed by Wang et al. (2018), who automatically detect privacy leaks of user-entered data for a given Android app and determines whether such leakage may violate the app’s privacy policy claims. However, the privacy policy analysis mentioned in these works is only applicable to English privacy policy texts and may not be applicable to Chinese texts.

A few studies are focusing on the analysis of GUI (Azim and Neamtiu 2013; Huang et al. 2014; Mulliner et al. 2014; Huang et al. 2015; Nan et al. 2015). The most related works are UIPICKER (Nan et al. 2015), SUPOR (Huang et al. 2015), UIREF (Andow et al. 2017) and GUILEAK (Wang et al. 2018), the goals of which are identifying the sensitive user input information on the GUI. UIPICKER and GUILEAK use sibling relationships in layouts to find the associated labels and input widgets. However, in practice, sibling relationships do not accurately gauge proximity. Both SUPOR and UIREF select the optimal label by calculating the distances between the labels and the input widgets based on the positions displayed on the screen.

GDPR Compliance Checking. Several recent works focus on the GDPR compliance checking (Torre et al. 2019; Palmirani and Governatori 2018; Ayala-Rivera and Pasquale 2018; Tom et al. 2018). However, their methodologies are quite different from ours. Torre et al. (2019) proposed a model-based GDPR compliance analysis solution using unified modeling language (UML) and object constraint language (OCL). Palmirani and Governatori (2018) presented a proof-of-concept applied to the GDPR domain, with the aim to detect infringements of privacy compulsory norms or to prevent possible violations using BPMN and Regorous engine. These existing approaches conduct compliance checking from the perspective of modeling.

8 Conclusion

In this paper, we extract Android privacy policy using text classification task to assert permissions, and static analysis of Android applications to extract usage permissions, and finally build a detection engine with consistency analysis. By analyzing a large number of Chinese privacy policies, we classify them into “hybrid” privacy policies and “native” privacy policies with respect to the existence of third-party SDK permission assertions. For the problem of missing data set in Chinese domain, we use BERT model to characterize the sentence vector and use cosine similarity to determine the degree of association of permission phrases, which reduces the time cost of manual labeling. To address the underreporting of existing schemes, we propose a detection scheme that uses text classification tasks to separately identify the permission information declared by the application itself and third parties. Based on the above research work, we design and implement an Android privacy policy and permission usage consistency analysis engine. After the completion of the engine, functional tests and speed tests were conducted to determine the correctness and usefulness of the engine. By comparing the extracted permissions with the permissions specified in the privacy policies, we aim to identify any inconsistencies or discrepancies. This analysis serves as a crucial step in ensuring that the actual privacy permissions granted by the application align with the permissions stated in the privacy policy. Furthermore, our approach not only aids in identifying potential privacy violations but also contributes to preventing privacy infringements. By raising awareness of the inconsistencies between the declared and actual permissions, our work empowers users to make informed decisions about the privacy implications of the applications they use. This knowledge enables users to take proactive measures to protect their personal information and mitigate the risks associated with privacy breaches. However, due to its larger model size and complex structure, the BERT model requires more computational resources and time for training and inference. The BERT model typically uses a Transformer structure to encode text sequences, which includes multiple layers of self-attention mechanisms, making the computation process relatively complex. Therefore, in our trained model, the average inference time of BERT is approximately 3 to 5 times longer than the TEXTRNN series models. It is worth exploring hardware or software optimizations to improve the efficiency of the BERT model.