Intelligent analysis of android application privacy policy and permission consistency

Tu, Tengfei; Zhang, Hua; Gong, Bei; Du, Daizhong; Wen, Qiaoyan

doi:10.1007/s10462-024-10798-z

Intelligent analysis of android application privacy policy and permission consistency

Open access
Published: 13 June 2024

Volume 57, article number 172, (2024)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

Intelligent analysis of android application privacy policy and permission consistency

Download PDF

Tengfei Tu¹,
Hua Zhang ORCID: orcid.org/0000-0002-0532-9783¹,
Bei Gong²,
Daizhong Du¹ &
…
Qiaoyan Wen¹

260 Accesses
Explore all metrics

Abstract

With the continuous development of mobile devices, mobile applications bring a lot of convenience to people’s lives. The abuse of mobile device permissions is prone to the risk of privacy leakage. The existing detection technology can detect the inconsistency between the declared authority and the actual use authority. But using the third-party privacy policy as the analysis basis for SDK permissions will result in a large set of extracted declaration permissions, which will lead to identifying risky applications as normal applications during consistency comparison. The prevailing approach involves utilizing models based on TextCNN to extract information from privacy policies. However, the training of TextCNN relies on large-scale annotated datasets, leading to high costs. This paper uses BERT as the word vector extraction model to obtain private phrases from the privacy policy. And then we use cosine similarity to automatically filter permission phrase samples, reducing the workload of manual labeling. On the other hand, existing methods do not support the analysis of Chinese privacy policies. In order to solve the problem of consistency judgment between Chinese privacy policy and permission usage, we implement a BERT-based Android privacy policy and permission usage consistency analysis engine. The engine first uses static analysis to obtain the permission list of Android applications, and then combines the BERT model to achieve consistency analysis. After functional and speed testing, we found that the engine can successfully run the consistency analysis function of Chinese declaration permissions and usage permissions, and it is better than the existing detection methods.

VioDroid-Finder: automated evaluation of compliance and consistency for Android apps

Article 03 May 2024

Research on data mining of permissions mode for Android malware detection

Article 02 February 2018

A Usage-Pattern Perspective for Privacy Ranking of Android Apps

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the development of the intelligent network era, the utilization rate of mobile devices continues to increase, and the number of Android and iOS applications in domestic and foreign application stores is also increasing (Ardito et al. 2020; Feichtner and Gruber 2020). The convenient services of mobile applications facilitate people’s daily life, and applications that provide functions such as mobile payment and online shop** are slowly integrating into people’s lives. While enjoying the unprecedented convenience brought by feature-rich applications, users are also always facing various attacks and privacy risks brought by application insecurity. Malware based on the Android platform is increasing day by day, and nearly half of the malicious applications actually use permissions that exceed the permission information specified in the privacy policy (Wang et al. 2019; Elluri et al. 2020; Frenklach et al. 2021). With the increase in the leakage of users’ personal information, incidents of illegally buying and selling personal private information for profiteering have intensified (Heid and Heider 2021).

Since sensitive personal information is closely related to the basic rights of natural persons, such as personal dignity, major personal interests, and property interests, the processing of personal sensitive information will cause major risks to the basic rights of natural persons and personal and property safety. Countries such as China, the United States, and the European Union have special regulations for handling sensitive personal information.

The analysis and detection of malicious behavior of Android applications have always been a hot topic of cutting-edge research. Many static and dynamic analysis-based detection systems for Android malicious applications have been proposed by domestic and foreign researchers (Wang et al. 2019; Elluri et al. 2020; Heid and Heider 2021; Vyas et al. 2021; Fan et al. 2020; Elahi et al. 2021). Studies on static analysis of privacy leaks include LeakMiner (Yang and Yang 2012) proposed by Yang et al, the static analysis tool TrustDroid (Zhao and Osono 2012) proposed by Zhao et al and the analysis tool FlowDroid (Arzt et al. 2014) proposed by Fritz et al. For dynamic analysis of privacy leakage, there are TaintDroid (Enck et al. 2014) dynamic analysis tool proposed by Enck et al, AppFence (Hornyack et al. 2011) proposed by Hornyack et al, and Kynoid (Schreckling et al. 2013) which improves on the theory and practice of TaintDroid. The principles of dynamic and static analysis are different, but they all focus on detecting the behavior of the App, but not the permissions that the App declares, so it is impossible to understand the functional intent expressed by the App. For example, if an App applies for location permission to send the user’s geolocation information to a third-party server integrated with the App, this behavior will be judged as privacy leakage through original dynamic and static analysis.

The purpose of the privacy policy is to inform the user before using the app, what personal information of the user may be collected during the operation of the App. For example, map Apps, in order to provide better services, need to continuously obtain the user’s address and location information. Since the App developer has informed users in advance what personal information will be collected in the privacy policy, the normal collection of sensitive information does not count as a privacy breach. But if the privacy policy does not inform users or deliberately conceals the collection of certain sensitive information, this is a malicious behavior. To collect user sensitive information, an App needs to apply for permission in advance and use the corresponding API in the source code. Therefore, it is very necessary to analyze the consistency between the App’s permission declaration and actual use. In this article (Gibler et al. 2012), the author found that the purpose of most applications applying for some sensitive permissions is to transmit the user’s sensitive information to the peer server through the network. Although Android has added a runtime permission system since 6.0, ordinary users know little about security and cannot effectively protect personal information. Therefore, from the user’s point of view, studying the consistency between the permission declaration of Android applications and the actual use of Android permissions can help users discover and avoid installing Android malicious applications.

To our knowledge, few literatures focus on the consistency analysis of privacy policies and permission applications in Chinese. Chinese is more semantically rich and ambiguous than English, which is also a challenge. In order to analyze the Android privacy leakage problem in the Chinese domain from a consistent perspective, this paper analyzes it from two angles: firstly, it analyzes the collection of affirmative permissions obtained from Android privacy policy; secondly, it analyzes the sensitive APIs used from the perspective of APK source code through Android static analysis, which can more specifically obtain the collection of permissions information actually used by Android. In this paper, we made the following contributions:

(1)
To address the missing problem of Chinese dataset, this paper automates the construction of dataset by similarity model (BERT + cosine similarity), and the scale of extended sentences reaches 4800.
(2)
Through extensive analysis of Chinese privacy policies, two different types of Chinese privacy policies are summarized with respect to the description of third-party SDK permission information in the privacy policy.
(3)
Based on our dataset, we chosed mainstream algorithms such as CNN, GRU, LSTM, and BERT to implement consistency analysis engine and did comparative experiments. In the end, we found that the BERT model performed the best.

2 Background

2.1 Android security mechanisms and privacy policies

Android, being an open-source system, requires specific security mechanisms for protection. These mechanisms primarily rely on the sandbox and permission mechanisms (Enck et al. 2009; Gargenta 2011; Felt et al. 2011). The sandbox mechanism is used to isolate the running space of each App, each App represents a specific process. The permission mechanism of the Android system is more subdivided into Android permissions, which is more convenient and secure to manage the user’s private information (Davi et al. 2010; Bläsing et al. 2010). The permission mechanism restricts the access of applications to device resources and user data through an authorization mechanism. Android’s permission mechanism is divided into two types: normal permissions and dangerous permissions. Normal permissions do not involve user privacy, such as accessing the internet or reading device status. On the other hand, dangerous permissions involve user privacy, such as accessing contacts, location information, or SMS messages. When users install an application, the system displays the required permissions and requests user authorization. Once authorized, the application can access the corresponding resources and data. However, malicious applications can exploit the permission mechanism to invade user privacy. For example, a malicious application may request access to the user’s contacts or location information and then transmit this information to a remote server. Android’s dangerous permissions are of utmost importance as they are closely tied to user information. Improper use of these permissions can result in the malicious dissemination of the user’s private data.

The privacy policy is a document that informs users of how information and data are collected and used by a website or mobile software (Yu et al. 2015). The development of a privacy policy also enables network operators to better understand user habits and concerns and avoid certain legal issues. The following is a list of common content related to user privacy in Chinese privacy policies:

1.
How personal information of users is collected and used.
2.
How user information is shared and used by third parties and publicly disclosed.
3.
How to protect and preserve users’ personal information.

2.2 Commonly used word segmentation models

2.2.1 CNN

Convolutional neural network is an algorithm model that includes a special network structure (convolutional layer) (Kattenborn et al. 2021). During forward propagation, the introduction of convolution operation can help the model learn local features, which is effective in image applications. Convolutional neural networks can also perform deep propagation to better improve the model effect. The prediction mechanism of a convolutional neural network involves several steps. Firstly, the input layer processes the input samples, such as training word vectors for input sentences using the embedding layer. Then, feature learning is performed through convolution operations within the network. Since there is often redundant information among adjacent feature points, the pooling layer is often connected after the convolutional layer. Finally, in order to adapt to downstream tasks, fully connected layers are essential.

2.2.2 LSTM

LSTM is distinguished from CNN neural networks (Greff et al. 2016). The introduction of the gate mechanism overcomes the problem of poor training of CNN for large length inputs, and also alleviates the problem of gradient disappearance and gradient explosion of CNN to some extent.

The function of each gate switch and the specific computation process will be introduced separately as follows:

1
Forget Gate: The forgetting gate allows for the screening of long-term stored information, leaving behind information that is more beneficial to the model’s effectiveness (Ghaeini et al. 2018). Using the principle of the gate mechanism, the cell state Ct-1 at the previous moment is multiplied by a number greater than 0 and less than 1, leaving the resulting value as the current moment Ct.
$$\begin{aligned} f_{t}= \delta (W_{f} \cdot [h_{t-1},X_{t}] + b_{f}) \end{aligned}$$
(1)
2
Input Gate: Considering the degree of influence of the current information on the model effect, it is necessary to filter the current input information to be filtered(Ghaeini et al. 2018).
$$\begin{aligned} i_{t} = \delta (W_{i} \cdot _{} [h_{t-1,X_{t}}] + b_{i}) \end{aligned}$$
(2)
3
Output Gate: for determining the final output. tanh function weights the passed values to determine their significance level ($-1$, 1) and multiplies the output by the Sigmoid.
$$\begin{aligned} O_{t} = \delta (W_{o} \cdot _{} [h_{t-1,X_{t}}] + b_{o}) \end{aligned}$$
(3)
$$\begin{aligned} C_{t} = f_{t} \cdot C_{t-1} + i_{t}\cdot Ct\widetilde{} \end{aligned}$$
(4)
The cell state Ct at the current moment is calculated as shown in 4. The formula for calculating the final output of the LSTM at the current moment is shown in 5.
$$\begin{aligned} h_{t} = O_{t}\cdot tanh(C_{t}) \end{aligned}$$
(5)

2.2.3 GRU

GRU is a very effective variant of LSTM network (Fu et al. 2016). The convolutional neural network has a simpler structure compared to the LSTM network, yet it achieves excellent results. Therefore, it is currently considered a versatile network architecture. Since GRU is a variant of LSTM, it can also solve the long dependency problem in RNN network. GRU has fewer parameters, so it is faster to train, and GRU can reduce the risk of overfitting. A typical GRU cell is consisted of two gates: reset gate r and update gate z. Similar to LSTM cell, hidden state output at time t is computed using the hidden state of time t-1 and the input time series value at time t.

2.2.4 Attention

The attention mechanism in deep learning allows for distinguishing the importance of feature vectors in samples. This enables the model to assign greater significance to more important feature points, greatly enhancing the model’s feature extraction ability (Vaswani et al. 2017). Since its birth, the attention mechanism of deep learning has demonstrated its importance from both theoretical and practical perspectives, and has achieved more significant effects in many aspects. The essential idea of the attention mechanism: From a microscopic point of view, the attention mechanism is mainly operated by the query vector Query (represented by Q), the key vector Key (represented by K) and the value vector Value (represented by V).

2.2.5 BERT

BERT is a natural language processing pre-training model disclosed by Google in November 2018, which can be used to characterize text data (Devlin et al. 2018; Sanh et al. 2019; Liu et al. 2019). The BERT model uses the encoder in the Transformer (Jaderberg et al. 2015) structure based on the static word vector and the Transformer feature extractor for the context learning. As shown in Fig. 1 transformer usually consists of two parts: the encoder part on the left and the decoder part on the right, both of which are structured with a multi-head attention mechanism. The use of multiple heads enables to consideration of word information from different dimensions and improve the generalization ability of the model. Also, both parts contain a fully connected layer, which integrates global features, and introduces a normalization function to speed up the training efficiency of the model. The BERT model utilizes Wordpiece embedding, sentence embedding, and positional embedding in its input, with both sentence embedding and positional embedding being trainable. By combining the word itself and suffixes, Wordpiece embedding reduces the size of the word list significantly.

2.3 Permission abuse problem

BERT has two pre-training tasks:

(1)
Masked language model (Devlin et al. 2018): MASK is used to cover 15% of the words randomly, and then the masked words are predicted using context by the idea of “completion fill-in-the-blank”.
(2)
Next sentence prediction task (Devlin et al. 2018): It is mainly used to characterize the relationship between two sentences and determine whether the two sentences belong to the contextual relationship.

3 Case study

3.1 Privacy policy type definition

After analyzing a large number of Chinese privacy policies, we concluded that Android privacy policies in most application markets are divided into two types with respect to third-party SDK issues:

(1)
The first type of privacy policy, which we refer to as “native”, does not specify the specific information collected by third-party SDKs. Users have the option to view the third-party privacy policy. The “native” privacy policy extracts permissions by combining the App privacy policy with the permissions extracted from the third-party SDK privacy policy to form a list of permissions declared by the App.
(2)
The second type is “hybrid” privacy policy, which explicitly mentions the third-party SDK permissions in the privacy policy, and this type of App will state the third-party SDK integrated with the App in the privacy policy and will give a description of the functions of the integrated third-party SDK and the corresponding permission declaration. The “hybrid” privacy policy extraction process separates the App declaration permission and the third-party SDK permission from the App privacy policy to avoid redundancy issues caused by analyzing third-party SDK privacy policies.

As shown in Fig. 2, the extraction process used in this article is for native and hybrid extraction.

3.2 Permission abuse problem

For the “hybrid” privacy policy, the traditional solution of extracting permissions from the third-party SDK privacy policy also results in over-privileging the privacy policy, which leads to the consistency analysis framework identifying risky apps as normal apps.

Traditional solutions for analyzing third-party SDK privacy policies may result in the consistency analysis framework identifying at-risk applications as normal applications that are underreported. However, since the third-party SDK’s permission information is explicitly stated in the “hybrid” privacy policy, this issue is mitigated.

For example, a “Snow**” app declares four permissions, namely camera/album, system notification, calendar reading and writing, and recording permissions, but the permissions declared by the integrated third-party SDK include networking and device information, and the collection sensitive permissions include: The camera, calendar and recording information applied by the app itself, and the device information and phone number information applied by the third-party SDK. Define A1 as the set of declared permissions extracted from the application privacy policy according to the analysis process of the “hybrid” privacy policy, then the corresponding Android permission names in the set A1 include CAMERA, READ_CALENDAR, RECORD_AUDIO, READ_PHONE_STATE, and READ_PHONE_NUMBERS. The set A1 mentioned above consists of the permissions that the application will use when strictly following the privacy policy. Now let’s consider the use of “native” privacy policies.

According to the traditional analysis process for “native” privacy policies, it is also necessary to analyze third-party SDK privacy policies. The privacy policy of the third-party “You**” SDK integrated with the “Snow**” app is partially stated. Among them, the “You**” SDK will obtain geographic location information and user device information. Define A2 as the set of declared permissions extracted from the privacy policy of the third-party SDK. A={CAMERA, READ_CALENDAR, RECORD_AUDIO, READ_PHONE_STATE, READ_PHONE_NUMBERS, ACCESS_FINE_LOCATION} obtained by taking the union of the two sets A1 and A2. However, since the “hybrid” privacy policy has clearly stated the permission information of the third-party SDK, after analyzing the privacy policy of the third-party SDK according to the traditional scheme, the ACCESS_FINE_LOCATION location permission is redundant in the A collection. This leads to false negatives where the consistency analysis framework identifies risky applications as normal applications.

4 Methodology

4.1 Android usage permission extraction process

By decompiling the bytecode code, a list of sensitive API functions corresponding to some Android permissions is shown in Table 1. This paper uses Androguard (Mercaldo et al. 2016), a static analysis tool, to identify the APIs and permissions of APKs.

Table 1 Part Android permissions corresponding to the API functions

Full size table

In this paper, we use the find_methods function, which is a function provided by Androgurad (Wang et al. 2015), to find whether there is a corresponding API in the APK source code. The function accepts three parameters, first is the package name and class name, second is the method name, last is the method call logic, generally “.”. As shown in List 1, location permission, access to cell phone numbers, and access to SD cards are used as examples.

When searches for a specific API through the find_methods method in Androguard. If the value of return is True, it proves that the API exists, and the corresponding permission name of the API exists.

4.2 Chinese dataset construction process

Figure 3 shows the flow chart of Chinese dataset construction. Firstly, we obtain the word splitting result by pre-processing the text of privacy policy, and then combine the word splitting result with 3-Gram to obtain permission phrases. Next, the permission phrases are characterized by BERT semantics and compared with the permission phrase expressions initially collected from the Android official website for similarity.

4.2.1 Data collection

The data sample in this article is divided into two parts: one is the Android application privacy policy text, and the other is the APK file. The source of the privacy policy text is mainly divided into two parts: one is collected from the Google Play application market through Python crawler technology, but the domestic application market does not require developers to upload privacy policies, the domestic application market cannot obtain privacy Policy data; the second is through the Baidu search API, and the first item of the result returned by the specific App name through the HTTP request is used as the privacy policy. At present, a total of 413 privacy policies have been collected. The privacy policy text is used as the raw data, and it is preprocessed before it is used as privacy data, from which the training set is constructed. The text preprocessing includes sentence splitting module, word splitting module, deactivation.

4.2.2 Build permission phrases

In this paper, we use N-Gram (Cavnar et al. 1994; Bergsma et al. 2009; Lapata and Keller 2005) to combine the words in a phrase-building perspective to avoid the noise problem of individual authority words. The corresponding permission phrase expressions can be constructed by using Jieba partitioning and 3-Gram operations.

4.2.3 Amplification permission phrase

In order to facilitate the automated mining of more permission phrase expressions, this paper uses the semantic representation (BERT pre-training model) + cosine similarity model and selects the permission phrases with similarity greater than a threshold value as the positive samples in the training set. In this way, it is possible to automate the augmentation of permission phrase expressions in large numbers, and also directly correspond to specific permission names for the permission phrases with high similarity, which can greatly reduce the manual annotation workload.

First, we obtain the most basic permission phrase expression through the short official description of Android permissions, and then obtain the specific semantic vector by semantic characterization of the permission phrases. In this paper, we use the bert-as-service tool developed by Dr. Han **.

The permission mapper function is to map permission information to permission names: the task form of textual multiclassification of the BERT model is used to map to specific permission names.

The SDK permission detector function is to identify whether the requesting subject of permission information is an integrated third-party SDK: the task form of textual binary classification of the BERT model is used to identify whether the requesting subject of filtered permission information belongs to a third-party SDK.

In the training phase, the training data sets of permission filter, permission mapper, and SDK permission detector are constructed from the collected Chinese privacy policies, and the three parts are trained as separate models using the BERT model.

In the prediction phase, the data pre-processing part is the same as the training phase in that the privacy policies are text processed and permission phrases are formed.

5 Implementation

5.1 Permission filter model design

This section uses a dynamic word-vector BERT model that uses words as processing units and handles contextual relationships through the Attention mechanism. The purpose of permission filtering is to classify the phrases in the privacy policy into permission-related phrases and permission-independent phrases.

In this section, the classification task of the BERT model is used to filter permission-independent information, where the BERT model is selected from the BERT-base Chinese pre-training model (Cui et al. 8. Among the 10 categories of permissions in this section, the accuracy and recall of READ _PHONE_STATE permissions are the highest at 0.968 and 0.988, and the AUC of CALL_PHONE permissions is the highest at 0.978. Since permission map** is essentially textual multiclassification, the TextCNN model is not as effective as TextRNN and BERT models because it does not consider sequence features. The BERT model dynamically takes into account the contextual information compared with the TextRNN model, which is the best effect in permission map**, and that is why the BERT model is used to build the permission mapper in this section.

The experiments in this section extract the Embedding information of different layers of the Transformer. The higher the number of Transformer layers, the better the evaluation index, and the more semantic information features exist in the classification basis of the experiments of the permission map** related sentences.

6.3 SDK permission detector model evaluation

By text-mining the phrases in the permission-related sentences, it is found that the permission sentences about third-party SDKs are mainly described in two forms: one is directly introduced with the name “third-party software” and so on to declare the permission; the other specifies the third-party SDK information. In this section, we use the BERT model to experimentally validate the SDK permissions detection. In order to verify the effectiveness of third-party phrase information for SDK permission detection, this section sets up several sets of experiments, using CNN series models, and pre-trained model BERT for text classification. The experimental results are shown in Table 4, and the accuracy rate is above 95%, which proves that this solution can correctly identify third-party SDKs using third-party phrase information; secondly, the BERT model achieves the highest AUC value of 98.1% on the test set, which proves that the BERT model can perform the SDK permission detection task more efficiently.

Table 4 The result of different models in SDK permission detection

Full size table

7 Related work

Android application privacy policy analysis has always been a hot topic of research (Yu et al. 2015; Benats et al. 2011; Yu et al. 2016, 2017; Wang et al. 2018). Wang et al. (2019) proposed SmartPi to extract information related to privacy permission from user reviews based on NLP. Yu et al. (2016) proposed PPChecker, which utilizes NLP to analyze privacy policies and models three issues: incomplete privacy policy, incorrect privacy policy, and inconsistent privacy policy. They also employ program analysis techniques to analyze application programs and identify potential privacy leakage issues. Slavin et al. (2016) proposed a semi-automated framework for detecting the violations based on a privacy-policy-phrase ontology and a collection of map**s from API calls to policy phrases. Yu et al. (2016) proposed PPCHECKER that focuses on system-managed data and identify three kinds of problems in the privacy policy. The most related work is proposed by Wang et al. (2018), who automatically detect privacy leaks of user-entered data for a given Android app and determines whether such leakage may violate the app’s privacy policy claims. However, the privacy policy analysis mentioned in these works is only applicable to English privacy policy texts and may not be applicable to Chinese texts.

A few studies are focusing on the analysis of GUI (Azim and Neamtiu 2013; Huang et al. 2014; Mulliner et al. 2014; Huang et al. 2015; Nan et al. 2015). The most related works are UIPICKER (Nan et al. 2015), SUPOR (Huang et al. 2015), UIREF (Andow et al. 2017) and GUILEAK (Wang et al. 2018), the goals of which are identifying the sensitive user input information on the GUI. UIPICKER and GUILEAK use sibling relationships in layouts to find the associated labels and input widgets. However, in practice, sibling relationships do not accurately gauge proximity. Both SUPOR and UIREF select the optimal label by calculating the distances between the labels and the input widgets based on the positions displayed on the screen.

GDPR Compliance Checking. Several recent works focus on the GDPR compliance checking (Torre et al. 2019; Palmirani and Governatori 2018; Ayala-Rivera and Pasquale 2018; Tom et al. 2018). However, their methodologies are quite different from ours. Torre et al. (2019) proposed a model-based GDPR compliance analysis solution using unified modeling language (UML) and object constraint language (OCL). Palmirani and Governatori (2018) presented a proof-of-concept applied to the GDPR domain, with the aim to detect infringements of privacy compulsory norms or to prevent possible violations using BPMN and Regorous engine. These existing approaches conduct compliance checking from the perspective of modeling.

8 Conclusion

In this paper, we extract Android privacy policy using text classification task to assert permissions, and static analysis of Android applications to extract usage permissions, and finally build a detection engine with consistency analysis. By analyzing a large number of Chinese privacy policies, we classify them into “hybrid” privacy policies and “native” privacy policies with respect to the existence of third-party SDK permission assertions. For the problem of missing data set in Chinese domain, we use BERT model to characterize the sentence vector and use cosine similarity to determine the degree of association of permission phrases, which reduces the time cost of manual labeling. To address the underreporting of existing schemes, we propose a detection scheme that uses text classification tasks to separately identify the permission information declared by the application itself and third parties. Based on the above research work, we design and implement an Android privacy policy and permission usage consistency analysis engine. After the completion of the engine, functional tests and speed tests were conducted to determine the correctness and usefulness of the engine. By comparing the extracted permissions with the permissions specified in the privacy policies, we aim to identify any inconsistencies or discrepancies. This analysis serves as a crucial step in ensuring that the actual privacy permissions granted by the application align with the permissions stated in the privacy policy. Furthermore, our approach not only aids in identifying potential privacy violations but also contributes to preventing privacy infringements. By raising awareness of the inconsistencies between the declared and actual permissions, our work empowers users to make informed decisions about the privacy implications of the applications they use. This knowledge enables users to take proactive measures to protect their personal information and mitigate the risks associated with privacy breaches. However, due to its larger model size and complex structure, the BERT model requires more computational resources and time for training and inference. The BERT model typically uses a Transformer structure to encode text sequences, which includes multiple layers of self-attention mechanisms, making the computation process relatively complex. Therefore, in our trained model, the average inference time of BERT is approximately 3 to 5 times longer than the TEXTRNN series models. It is worth exploring hardware or software optimizations to improve the efficiency of the BERT model.

References

Andow B, Acharya A, Li D, Enck W, Singh K, **e T (2017) Uiref: analysis of sensitive user inputs in android applications. Proceedings of the 10th acm conference on security and privacy in wireless and mobile networks (pp. 23–34)
Ardito L, Coppola R, Malnati G, Torchiano M (2020) Effectiveness of Kotlin vs. Java in android app development tasks. Inf Softw Technol 127:106374
Article Google Scholar
Arzt S, Rasthofer S, Fritz C, Bodden E, Bartel A, Klein J, McDaniel P (2014) Flowdroid: precise context, flow, field, object sensitive and lifecycle-aware taint analysis for android apps. ACM Sigplan Notices 49(6):259–269
Article Google Scholar
Ayala-Rivera V, Pasquale L (2018) The grace period has ended: an approach to operationalize GDPR requirements. In: 2018 IEEE 26th international requirements engineering conference (re). pp 136–146
Azim T, Neamtiu I (2013) Targeted and depth-first exploration for systematic testing of android apps. Proceedings of the 2013 ACM sigplan international conference on object oriented programming systems languages & applications pp 641–660
Benats G, Bandara A, Yu Y, Colin J-N, Nuseibeh B (2011) Primandroid: privacy policy modelling and analysis for android applications. 2011 IEEE international symposium on policies for distributed systems and networks pp 129–132
Bergsma S, Lin D, Goebel R (2009) Web-scale n-gram models for lexical disambiguation. Twenty-first international joint conference on artificial intelligence
Bläsing T, Batyuk L, Schmidt A-D, Camtepe SA, Albayrak S (2010) An android application sandbox system for suspicious software detection. 2010 5th international conference on malicious and unwanted software pp 55–62
Cavnar WB, Trenkle JM et al (1994) N-gram-based text categorization. Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, vol 161175
Cui Y, Che W, Liu T, Qin B, Yang Z, Wang S, Hu G (2019) Pre-training with whole word masking for Chinese bert. ar**v:1906.08101
Davi L, Dmitrienko A, Sadeghi A-R, Winandy M (2010) Privilege escalation attacks on android. International conference on information security. pp 346–360
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. ar**v:1810.04805
Elahi H, Castiglione A, Wang G, Geman O (2021) A human-centered artificial intelligence approach for privacy protection of elderly app users in smart cities. Neurocomputing 444:189–202
Article Google Scholar
Elluri L, Joshi KP, Kotal A (2020) Measuring semantic similarity across eu gdpr regulation and cloud privacy policies. 2020 IEEE international conference on big data (big data) pp 3963–3978
Enck W, Ongtang M, McDaniel P (2009) Understanding android security. IEEE Secur Privacy 7(1):50–57
Article Google Scholar
Enck W, Gilbert P, Han S, Tendulkar V, Chun B-G, Cox LP, Sheth AN (2014) Taintdroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Trans Comput Syst (TOCS) 32(2):1–29
Article Google Scholar
Fan M, Yu L, Chen S, Zhou H, Luo X, Li S, Liu T (2020) An empirical evaluation of GDPR compliance violations in android mhealth apps. 2020 IEEE 31st international symposium on software reliability engineering (ISSRE) pp 253–264
Feichtner J, Gruber S (2020) Understanding privacy awareness in android app descriptions using deep learning. Proceedings of the tenth ACM conference on data and application security and privacy pp 203–214
Felt AP, Chin E, Hanna S, Song D, Wagner D (2011) Android permissions demystified. Proceedings of the 18th ACM conference on computer and communications security pp 627–638
Frenklach T, Cohen D, Shabtai A, Puzis R (2021) Android malware detection via an app similarity graph. Comput Secur 109:102386
Article Google Scholar
Fu R, Zhang Z, Li L (2016) Using lstm and GRU neural network methods for traffic flow prediction. 2016 31st youth academic annual conference of Chinese association of automation (yac) pp 324–328
Gargenta M (2011) Learning android. “O’Reilly Media, Inc”
Ghaeini R, Hasan SA, Datla V, Liu J, Lee K, Qadir A, Farri O (2018) Dr-bilstm: dependent reading bidirectional lstm for natural language inference. ar**v:1802.05577
Gibler C, Crussell J, Erickson J, Chen H (2012) Androidleaks: automatically detecting potential privacy leaks in android applications on a large scale. International conference on trust and trustworthy computing pp 291–307
Greff K, Srivastava RK, Koutnık J, Steunebrink BR, Schmidhuber J (2016) Lstm: A search space odyssey. IEEE Trans Neural Netw Learning Syst 28(10):2222–2232
Article MathSciNet Google Scholar
Heid K, Heider J (2021) Automated, dynamic android app vulnerability and privacy leak analysis: design considerations, required components and available tools. European interdisciplinary cybersecurity conference pp 1–6
Hornyack P, Han S, Jung J, Schechter S, Wetherall D (2011) These aren’t the droids you’re looking for: retrofitting android to protect data from imperious applications. Proceedings of the 18th ACM conference on computer and communications security pp 639–652
Huang J, Zhang X, Tan L, Wang P, Liang B (2014) Asdroid: Detecting stealthy behaviors in android applications by user interface and program behavior contradiction. Proceedings of the 36th international conference on software engineering pp 1036–1046
Huang J, Li Z, **ao X, Wu Z, Lu K, Zhang X, Jiang G (2015) SUPOR: Precise and scalable sensitive user input detection for android apps. 24th USENIX security symposium (USENIX security 15) pp 977–992
Jaderberg M, Simonyan K, Zisserman A et al (2015) Spatial transformer networks. Adv Neural Inf Process Syst 28:2017–2025
Google Scholar
Kattenborn T, Leitloff J, Schiefer F, Hinz S (2021) Review on convolutional neural networks (CNN) in vegetation remote sensing. ISPRS J Photogram Remote Sens 173:24–49
Article Google Scholar
Lapata M, Keller F (2005) Web-based models for natural language processing. ACM Trans Speech Lang Process (TSLP) 2(1):3
Article Google Scholar
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. ar**v:1907.11692
Mercaldo F, Visaggio CA, Canfora G, Cimitile A (2016) Mobile malware detection in the real world. 2016 IEEE/ACM 38th international conference on software engineering companion (icse-c) pp 744–746
Mulliner C, Robertson W, Kirda E (2014) Hidden gems: automated discovery of access control vulnerabilities in graphical user interfaces. 2014 IEEE symposium on security and privacy. pp 149–162
Nan Y, Yang M, Yang Z, Zhou S, Gu G, Wang X (2015) Uipicker: User-input privacy identification in mobile applications. 24th USENIX security symposium (USENIX security 15) pp 993–1008
Palmirani M, Governatori G (2018) Modelling legal knowledge for gdpr compliance checking. Jurix pp 101–110
Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ar**v:1910.01108
Schreckling D, Köstler J, Schaff M (2013) Kynoid: real-time enforcement of fine-grained, user-defined, and data-centric security policies for android. Inf Secur Tech Rep 17(3):71–80
Article Google Scholar
Slavin R, Wang X, Hosseini MB, Hester J, Krishnan R, Bhatia J, Niu J (2016) Toward a framework for detecting privacy policy violations in android application code. Proceedings of the 38th international conference on software engineering. pp 25–36
Tom J, Sing E, Matulevičius R (2018) Conceptual representation of the gdpr: model and application directions. International conference on business informatics research. pp 18–28
Torre D, Soltana G, Sabetzadeh M, Briand LC, Auffinger Y, Goes P (2019) Using models to enable compliance checking against the gdpr: an experience report. 2019 ACM/IEEE 22nd international conference on model driven engineering languages and systems (models). pp 1–11
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. Advances in Neural Information Processing Systems 30
Vyas P, Shyamasundar R, Patil B (2021) App2secapp: privacy protection from android applications. Proceedings of the 36th annual ACM symposium on applied computing pp 908–911
Wang R, Wang Z, Tang B, Zhao L, Wang L (2019) Smartpi: understanding permission implications of android apps from user reviews. IEEE Trans Mob Comput 19(12):2933–2945
Article Google Scholar
Wang X, Qin X, Hosseini MB, Slavin R, Breaux TD, Niu J (2018) Guileak: Tracing privacy policy claims on user input data for android applications. Proceedings of the 40th international conference on software engineering pp 37–47
Wang Z, Li C, Guan Y, Xue Y (2015) Droidchain: a novel malware detection method for android based on behavior chain. 2015 IEEE conference on communications and network security (CNS) pp 727–728
**ao H (2018) bert-as-service. https://github.com/hanxiao/bert-as-service
Yang Z, Yang M (2012) Leakminer: Detect information leakage on android with static taint analysis. 2012 third world congress on software engineering. pp 101–104
Yu L, Luo X, Liu X, Zhang T (2016) Can we trust the privacy policies of android apps? 2016 46th annual IEEE/IFIP international conference on dependable systems and networks (dsn) pp 538–549
Yu L, Luo X, Qian C, Wang S, Leung HK (2017) Enhancing the description-to-behavior fidelity in android apps with privacy policy. IEEE Trans Softw Eng 44(9):834–854
Article Google Scholar
Yu L, Zhang T, Luo X, Xue L (2015) Autoppg: Towards automatic generation of privacy policy for android applications. Proceedings of the 5th annual acm ccs workshop on security and privacy in smartphones and mobile devices pp 39–50
Zhao Z, Osono FCC (2012) “$\text{trustdroid}^{\text{ TM }}$”: Preventing the use of smartphones for information leaking in corporate networks through the used of static analysis taint tracking. 2012 7th international conference on malicious and unwanted software pp 135–143

Download references

Acknowledgements

This work is supported by NSFC (Grant Nos. 62072051, 61976024, 61972048), the Fundamental Research Funds for the Central Universities (Grant No. 2019XD-A01), and the Key Project Plan of Blockchain in Ministry of Education of the People’s Republic of China (Grant No. 2020KJ010802 ).

Funding

This work is supported by the Fundamental Research Foundation for the Central Universities under Grant 2023RC29 and the National Natural Science Foundation of China (Grant No. 62072051).

Author information

Authors and Affiliations

Bei**g University of Posts and Telecommunications, No 10. **yuan, Bei**g, 100021, China
Bei Gong

Authors

Tengfei Tu
View author publications
You can also search for this author in PubMed Google Scholar
Hua Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Bei Gong
View author publications
You can also search for this author in PubMed Google Scholar
Daizhong Du
View author publications
You can also search for this author in PubMed Google Scholar
Qiaoyan Wen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Hua Zhang or Bei Gong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tu, T., Zhang, H., Gong, B. et al. Intelligent analysis of android application privacy policy and permission consistency. Artif Intell Rev 57, 172 (2024). https://doi.org/10.1007/s10462-024-10798-z

Download citation

Accepted: 23 April 2024
Published: 13 June 2024
DOI: https://doi.org/10.1007/s10462-024-10798-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Intelligent analysis of android application privacy policy and permission consistency

Abstract

Similar content being viewed by others

VioDroid-Finder: automated evaluation of compliance and consistency for Android apps

Research on data mining of permissions mode for Android malware detection

A Usage-Pattern Perspective for Privacy Ranking of Android Apps

1 Introduction

2 Background

2.1 Android security mechanisms and privacy policies

2.2 Commonly used word segmentation models

2.2.1 CNN

2.2.2 LSTM

2.2.3 GRU

2.2.4 Attention

2.2.5 BERT

2.3 Permission abuse problem

3 Case study

3.1 Privacy policy type definition

3.2 Permission abuse problem

4 Methodology

4.1 Android usage permission extraction process

4.2 Chinese dataset construction process

4.2.1 Data collection

4.2.2 Build permission phrases

4.2.3 Amplification permission phrase

5 Implementation

5.1 Permission filter model design

6.3 SDK permission detector model evaluation

7 Related work

8 Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation