1 Introduction

The onion router (Tor) network, known as one of the most famous darknet networks, gives end users a high level of privacy and anonymity. The Tor project was proposed in the mid-1990 s by US military researchers to secure intelligence communications. However, a few years later, as part of their secret strategy, they made the Tor project available to the public [1]. Currently, onion domains are proliferating rapidly, and the latest statistics stated by the onion metrics websiteFootnote 1 show a significant increase in the number of domains, exceeding 500, 000.

Fig. 1
figure 1

Overview of the Tor network monitoring tool. After crawling onion domains from the Tor network, the classification module [12] classifies them according to their crime category, and the proposed ranking module sorts the domains per category following their influence level

There are many legal uses for the Tor network, such as personal blogs, news domains, and discussion forums [2, 3]. However, due to its level of anonymity, Tor darknet is being exploited by services traders, allowing them to promote their products freely, including but not limited to child sexual abuse (CSA) [4], drug trading [4,5,6,7,8,9], and counterfeit personal identifications [10,11,12]. Moreover, the high level of privacy and anonymity provided by the Tor network obstructed the authorities’ monitoring tools from controlling the content or even identifying the IP address of the hosts behind any suspicious service. To address this problem, we collaborate with the Spanish National Cybersecurity Institute (INCIBEFootnote 2) to develop tools that can ease the task of monitoring the Tor darknet and detecting existing or new suspicious content. The proposed Tor monitoring framework is summarized in Fig. 1.

The first module of our Tor monitoring tool is an onion domain classifier, which detects and isolates categories of suspicious onion domains. For this task, we used the supervised text classifier already presented in [12], which categorizes hidden services (HS) into eight classes: pornography, cryptocurrency, counterfeit credit cards, drugs, violence, hacking, counterfeit money, and counterfeit personal identification, including driving-licence, identification, and passport.

The second module, which is the focus of this study, addresses the problem of ranking the HS that were classified as suspicious. Once they are ranked, a police officer can prioritize the work by focusing on the most influential onion domains. In our previous work [2], we presented ToRank, a ranking algorithm to sort onion domains by analysing the connectivity of their hyperlinks, a linked-based approach. In this work, we propose a content-based approach for ranking, including features extracted from the text, named entities, HTML code, domain position, and visual content, as explained in the following sections.

One of the difficulties we faced was defining the influence of a given onion domain. The literature is rich with definitions of the term influencers. In the social network analysis (SNA) field, it denotes highly participating members [13], key members [14], members who encourage others to participate [15], or members who can change the perspective of others using a sentiment analysis algorithm [16]. In the terrorist network analysis field, influencers refer to people who have connectivity with the majority of the network members, such as financial managers [17]. Furthermore, in the viral marketing discipline, it stands for opinion leaders who can persuade their audience to purchase or subscribe to a product or a service [18]. In this paper, we borrow this definition of influencers to refer to onion domains that can attract customers to visit their websites and potentially buy their products. The attractiveness of the onion domain website is subjective, and it can be determined through its public reputation among buyers, its confidentiality and reliability, or even the service quality it offers [19]. However, ranking onion domains considering these subjective factors is difficult because they depend heavily on customers’ opinions and impressions [20].

This work overcomes this difficulty by presenting a supervised ranking approach to sort onion domains based on various features extracted from the content and structure of the onion domains. The ranking function learns how to map between human opinion, i.e., the ground-truth order, and the extracted features from the domains. Therefore, it assigns each domain a score that reflects its influence, whereas the higher the score is, the more significant influence it has. Thanks to the text classification module [12], the proposed ranking framework works at the activity level of the domains and detects the influential HS in each category of domains. Hence, this paper aims to answer the following question: What are the most influential onion domains in a determined area of activity?

Answering this question can improve the capability of LEAs to keep a close eye on suspicious domains that are more influential by concentrating their efforts on monitoring them. Moreover, if an LEA takes a suspicious domain down, the proposed ranking module can recognize it even if it was hosted under a new address if it still hosts the same content. Additionally, when a new domain is released and hosts suspicious content similar to a previously recognized influential domain, our ranking module can capture it before becoming popular among Tor users. Therefore, LEAs can strike suspicious domains preemptively.

A straightforward strategy for detecting influential onion domains is to sort them by the number of client requests, i.e., analysing the network traffic. However, the design of the Tor network is oriented to preventing this behaviour [21]. Chaabane et al. [22] conducted a deep analysis of Tor network traffic by establishing six exit nodes distributed worldwide with the default exit policy. Nonetheless, this approach cannot assess the traffic of onion domains that are not reachable through these exit nodes. Furthermore, it can be risky because the Tor network users can reach any onion domain, regardless of its legality, through the IP addresses of the machines dedicated to that purpose. Biryukov et al. [23] attempted to exploit the concept of entry guard nodes [24] to deanonymize clients of a Tor hidden service. However, this proposal will not be feasible as soon as the vulnerability is fixed.

Another strategy reported in the literature to detect influential onion domains is using a link-based ranking algorithm such as ToRank [2], PageRank [25], hyperlink-induced topic search (HITS) [26], or Katz [27]. We explored link-based ranking algorithms in our previous work [2] and concluded that the main drawback of this approach lies in its dependency on hyperlink connectivity between onion domains [28]. Hence, if an influential but isolated domain exists in the network, this technique cannot recognize it as an essential item.

This paper presents an alternative approach for detecting influential onion domains by extracting features from domain content to train a learning-to-rank (LtR) algorithm [29,30,31]. In particular, given a list of HSs, our model ranks onion domains based on two key steps: content feature extraction and onion domain ranking. First, we represent each onion domain by forty element feature vectors extracted from five different resources: 1) the textual content of the domain, 2) the textual named entities (NEs) in the user-visible text such as product names and organization names, 3) the HTML markup code by taking advantage of specific HTML tags, 4) the visual content such as the images exposed in the domain, and finally, 5) the position of the targeted onion domain in the Tor network topology. Second, the extracted features are cleaned and normalized to train a ranking function using the LtR approach to rank the domains and to propose the top-k domains as the most influential.

The ranking problem addressed in this work is close to the information retrieval (IR) field but with a significant difference. Both retrieve a ranked list of elements similar to how search engines work. For example, the Google search engine considers more than 200 factors to generate a ranked list of websites concerning a query [32]. However, in the context of our problem, we do not have a search term to order the results accordingly. Instead, our objective is to rank the domains based on a virtual query: What are the most attractive onion domains in a determined area of activities? Therefore, this model adopts IR to solve the problem of ranking and detecting the most influential onion domain in the Tor network without having an available search term.

Nevertheless, the proposed framework is not restricted to ranking the onion domains of the Tor network. It can be generalized and adapted to different areas with slight modifications in the feature vector, such as document ranking, web pages of the surface web, or users in a social network, among others.

The main contributions of this work are as follows:

  • We propose a novel framework to rank the onion domains and detect the most influential domains. Our strategy exploits five groups of features extracted from the Tor network via a hidden service modelling unit (HSMU). We used the extracted features to train the supervised learning-to-rank unit (SLRU). Our approach outperforms link-based ranking techniques, such as ToRank, PageRank, HITS, and Katz, when tested on samples of onion domains related to drug marketing (Fig. 2).

  • We propose 40 features extracted from five resources: 1) user-visible text, 2) textual NEs, 3) the HTML markup code, 4) the visual content, and 5) features drawn from the Tor network topology. In particular, we address the effects of representing an onion domain by several variations in features on the ranking framework. We identify the most efficient combination of features compared to their cost of extraction in terms of the prediction time and the resources needed to build the feature extraction model.

  • We evaluated our approach on a manually ranked dataset of 290 domains extracted from the Tor network and dedicated to trading illegal drugs. Each onion domain was judged by three members and received its influence score based on the majority voting strategy.

Fig. 2
figure 2

A general view of the proposed framework for ranking and detecting the influential onion domains in the Tor network. The dashed orange arrows indicate the training pipeline of the system, while the solid blue arrows indicate the testing/production phase

The rest of the paper is organized as follows. Section 2 summarizes the related work. Next, in Sect. 3, we present a procedure followed to build the dataset. Section 4 introduces the proposed ranking framework, including its main components. Section 5 describes the experimental settings and the configuration of the framework units. Section 6 addresses a case study to test the effectiveness of the proposed framework in a real-case scenario. Finally, Sect. 7 presents the main conclusions of this work and introduces other approaches that we are planning to explore in the future.

2 Related work

Several researchers have analysed suspicious activities on the darknet, including illicit drug markets [33,34,35], terrorist activities [36, 37], arms smuggling, violence, and cybercrime [6, 12, 38]. However, a few have focused on detecting the most influential domains.

Some have used social network analysis (SNA) techniques to mine networks. Chen et al. [39] conducted a comprehensive exploration of terrorist organizations to examine the robustness of their networks against attacks. They simulated the attacks by removing the items with the highest in-degree or betweenness scores [40]. Al-Nabki et al. [2] proposed an algorithm called ToRank to rank and detect the most influential domains in the Tor network. ToRank represents the Tor network by a directed graph of nodes and edges; the most influential nodes are those whose removal would reduce the nodes’ connectivity. However, link-based approaches fail to evaluate isolated nodes that do not connect to the rest of the community.

Choi et al. [41] built hand-crafted features to identify key cyberbullies in social networks. They collected features from various network centrality measures, including degree centrality, betweenness centrality, closeness centrality, and PageRank, to analyse the connectivity of community members. Additionally, they used the Losada ratio, a ratio of positive-to-negative text sentiment, and a cyberbullying index, a ratio of insulting words that appear in the text. Similarly, [42] addressed the Twitter social network to identify key actors using the same network centrality measures along with sentiment analysis.

Anwar et al. [43] presented a hybrid algorithm to detect the influential leaders of radical groups in darknet forums. Their proposal is based on mining the content of the user’s profiles and their historical posts to extract textual features representing their radicalness. Then, they incorporated the obtained features in a customized link-based ranking algorithm based on PageRank [25] to build a ranked list of radically influential users.

A different perspective was carried out by Biryukov et al. [23], who exploited the entry guard node concept [24] to deanonymize clients of an onion domain in the Tor network. The popularity of onion domains in the Tor network is estimated by measuring its incoming traffic; nevertheless, this approach will not be feasible when the vulnerability is fixed.

The LtR framework has been used widely in the IR domain [44,45,46,47]. Li et al. [48] proposed an algorithm to help software developers deal with unfamiliar application programming interfaces (APIs) by offering software documentation recommendations and training an LtR model with 22 features extracted from four resources. Agichtein et al. [49] employed the RankNet algorithm to leverage search engine results by incorporating features from user behaviour. Wang et al. [50] presented an LtR-based framework to rank input parameter values of online forms. They used 6 categories of features extracted from user contexts and patterns of user inputs. Moreover, LtR was used for mining social networks [51,52,53] or to detect and rank critical events in Twitter social networks [54].

Table 1 Binary questionnaire used to build a ground-truth rank for the drug onion domains

3 Dataset construction

Darknet Usage Text Addresses 10K (DUTA-10K) is a publicly available dataset proposed by Al-Nabki et al. [2] that contains 10, 367 onion domains from the Tor network distributed into 25 categories. In this paper, we consider the domains of the category Drugs as a case study to rank its domains using the proposed ranking framework. This category contains drug manufacturing, cultivation, and marketing topics, as well as drug forums and discussion groups. Out of 465 drug domains in DUTA-10K, we selected only English language domains, which totalled 290 domains. This ranking approach could be adapted to any collection of web domains, but we selected the drug-related domains owing to their high popularity in the Tor network. In addition, our approach is more comprehensive than HS ranking. It can be extended to document ranking or influence detection in social networks.

To annotate the dataset, thirteen people, including the authors, manually ranked the 290 drug-related domains. To secure consistent ranking criteria among the annotators, we created a unified questionnaire of 23 subjective binary questions (Table 1) that the annotators answered for each domain. The ground-truth is built in a pointwise manner, assigning an annotator a value to each domain, coming from answering every question with a 1 or 0, corresponding to Yes or No, respectively.

We repeated the process three times, assigning each annotator a new batch of approximately 23 domains every time. Thus, each onion domain was judged three times by three different annotators, and as a result, each domain was represented by three binary vectors of answers. Following the majority voting approach, we unified these answers’ vectors of every domain into a single vector of 23 dimensions that corresponded with the number of questions. Finally, we summed the answers of each domain to obtain a score value for each domain, representing a ground-truth rank while training. In this context, a higher score means a more significant influence.

4 Proposed ranking framework

This work presents a ranking framework for automatically ranking hidden services (HSs), i.e., the Tor network websites, according to user-defined criteria captured from a training set (Fig. 2). Our design has two components: 1) the hidden service modelling unit (HSMU) for extracting features from a given website domain in the Tor network and 2) a supervised learning-to-rank unit (SLRU) that trains a supervised ranking model.

4.1 Hidden service modelling unit

Given a hidden service domain \(d_i \in D\) collected from the Tor network D, which is represented in the HSMU by a feature vector extracted from sources: 1) the text, 2) the NEs, 3) the HTML code, 4) the visual content, and 5) the topology of D and the position of \(d_i\) in D.

4.1.1 Text features

Given the text of \(d_i\), we extract nine features from the following four sources.

Date and Time 1

binary feature to indicate whether \(d_i\) has been updated recently. If the most recent date is close to today’s date, it is marked as “updated" or “obsolete" otherwise. Additionally, we count date patterns within a date window to measure the number of recent changes in \(d_i\). We refer to these two features as recently_updated and update_counts, respectively.

Website URL 1

URL address of an onion domainFootnote 3 consists of 16 characters generated using a 1024-bit RSA key pair, and the public key is hashed using the SHA-1 algorithm. Then, the first 80 bytes of the hash are encoded using a Base32 encoder, and the suffix ".onion" is added. Therefore, most generated onion domain URLs do not involve readable or meaningful words and can be seen as a random sequence of 16 characters. However, there are open-source tools capable of generating customized addresses, such as Shallot.Footnote 4 These tools allow the onion domain address to include attractive, catchy words, such as cocaine or LSD, for a hidden service selling illegal drugs. The main challenge here is the exponential time required to customize domain names; for example, customizing seven characters takes one day of machine time, while customizing 10 characters requires 40 years of processing. We used a probabilistic model based on English Wikipedia unigram frequencies to extract the URL features. The model splits concatenated letters into potential words, thanks to the Wordninja tool.Footnote 5 For the URL words, we obtain two features: (i) the number of human-readable words identified using the Nostril tool [55] and (ii) the number of their letters. We name these features URL_word_count and URL_letter_count, respectively.

Clone rate 1

refers to the number of HS that host the same content under different addresses. In our previous work [2], we recognized that some onion domains have identical or semi-identical text hosted under different URLs, particularly those with suspicious content. To detect duplication, we calculate the MD5 hash [56] after preprocessing it by removing numbers, special characters, date and time formats, and the PGP signature. The clone_rate of \(d_i\) reflects the frequency of its MD5 hash code.

Term frequency-inverse document frequency (TF-IDF) vectorizer an algorithm comprised of two components, the term frequency (TF) and the inverse document frequency (IDF). The TF counts the number of times a word is used in a domain, while the IDF finds how important a word is in the list of onion domains. It is calculated by dividing the number of onion domains by the number of domains that contain that word. Finally, the TF-IDF is computed as (Eq. 1).

$$\begin{aligned} w_{(i,d)} = TF_{(i,d)} \times log_{2}\frac{N}{DF_i}, \end{aligned}$$
(1)

where \(w_{(i,d)}\) is the weight of word i in domain d, N is the size of domain set D, \(TF_{(i,d)}\) is the term frequency of word i in d, and \({DF_i}\) is the document frequency of word i in D.

Typically, it is good practice to filter out infrequent words by adjusting the max features parameter of the TF-IDF algorithm.Footnote 6 Hence, only a specific number of features are considered. Following our previous work [12], we set the \(max\_features\) parameter to 10, 000 sorted by the TF-IDF weight. The TF-IDF algorithm represents the text of each onion domain by a feature vector of 10, 000 dimensions. In addition, it returns a dictionary (TF-IDF_dict) of length \(max\_features\) that holds the keywords and their weights. Applying the TF-IDF algorithm to a dataset of drugs HS, we obtained the following top-10 words (cannabis, cocaine, quantity, kush, gram, crystal, heroin, psychedelic, drug, and strain). We consider the common words between the TF-IDF_dict and domain \(d_i\) as the domain keywords. Consequently, we define the following four features: 1) keyword_num: the number of keywords identified in \(d_i\), 2) keyword_TF-IDF_Acc: the accumulated TF-IDF keyword weights, 3) keyword_avg_weight: the average keyword weight, and 4) keyword_to_total: the number of the domain’s keywords divided by the number of its words.

4.1.2 Named entities features

A named entity (NE) refers to a real proper name of an object, including but not limited to persons, organizations, or locations. In the Tor network, most entities come from sparse text without context, such as the product entity names mentioned under the product image in a marketplace. Therefore, it is vital to use a named entity recognition model that does not depend heavily on the context. Hence, we used our previous work [57], which was designed especially for this case, rather than contextualized-based models such as the bidirectional encoder representations from transformers (BERT) [58]. The named entity recognition (NER) model recognizes six categories of named entities: persons (PER), locations (LOC), organizations (ORG), products (PRD), creative work (CRTV), corporations (COR), and groups (GRP). We map the extracted NEs into the following five features:

NE number 1

counts the total number of entities in \(d_i\) regardless of the category; we name this feature NE_counter.

NE popularity 1

an entity is popular if its frequency is above or equal to a threshold that we set to five, as explained in Sect. 5.2.2. For every category identified by the NER model, we use a binary representation to capture the existence of popular entities in domain (1), or (0) otherwise. We refer to this feature as popular_NEX, where X is the corresponding NER category.

NE TF-IDF 1

accumulates the TF-IDF weight of all the detected NE in \(d_i\). This feature is denoted by NE_TF-IDF.

TF-IDF popular NE 1

accumulates the TF-IDF weight of the popular NE, and it is named popular_NE_TF-IDF.

Emerging NE 1

the frequency of the emerging product entities in \(d_i\). We used our previous work [5] based on the K-Shell algorithm [59] and graph theory to detect emerging entities in HS. We denote this feature by emerging_NE.

4.1.3 HTML markup features

Among the available HTML parsing techniques, we used a regular expression pattern to detect hyperlinks because we realized that some onion domain pages reference other domains by mentioning their addresses within the text flow without the \(<HREF>\) HTML tag. Hence, libraries such as Beautiful SoupFootnote 7 cannot detect them. For the rest of the HTML markup code of \(d_i\), we used the Beautiful Soup library to extract the following features:

Internal hyperlinks 1

counts the number of unique hyperlinks that share the same domain name as \(d_i\). We denote it by internal_links.

External hyperlinks 1

refers to the number of pages referenced by \(d_i\) on the Tor network or Surface Web. We refer to this feature by external_links.

Image tag count 1

corresponds to the number of images referenced in \(d_i\). It is calculated by counting the \(<img>\) HTML tag in the HTML code of \(d_i\). We denote it by img_count.

Login and password 1

a binary feature to indicate whether the domain needs login and password credentials. We used a regular expression pattern to parse such inputs. This feature is called needs_credential.

Domain Title 1

a binary feature to check whether the \(<title>\) HTML tag has a textual value. We called it has_title.

Domain header 1

a binary feature that checks if the \(<H1>\) HTML tag has a header, and we named it has_H1.

Title and header TF-IDF 1

an accumulation of the TF-IDF weight for the \(d_i\) title and header text. It is denoted by TF-IDF_title_H1.

TF-IDF image alternatives 1

some websites use an optional property called \(<alt>\) inside the image tag \(<img>\) to hold a textual description for the image. This text becomes visible to the end user to substitute the image in case it is not loaded properly. This feature refers to the TF-IDF weight accumulation of the alternative text and is denoted as TF-IDF_alt.

4.1.4 Visual content features

The visual content can be more attractive than the text to draw the customer’s attention. A suspicious services trader might incorporate authentic product images to create an impression of credibility to customers. However, the interesting images for LEAs can be confused with other noisy images, such as banners and logo images. To isolate the interesting images, we built a supervised image classifier that categorizes the visual content into nine categories, where eight are suspicious and one is others. The definition of these categories is based on our previous works [2, 12]. For the image classifier, we fine-tune the Inception-ResNet V2 model [60]. The following features represent the visual content:

Image count 1

corresponds to the total number of images in \(d_i\), both suspicious and nonsuspicious, regardless of their category. Suspicious stands for images that can contain illicit content. We denote these features by total_count, suspicious_count and noise_count, respectively.

Average classification confidence 1

represents the averaged confidence score of multiple images per category. These features are named avg_suspicious_conf and avg_normal_conf, respectively.

Majority class 1

a binary flag to indicate whether the majority of the images published in \(d_i\) are suspicious. This flag is denoted by suspicious_majority.

4.1.5 Network structure features

We modelled the Tor network as a directed graph of nodes and edges. The nodes refer to onion domains, and the edges capture the hyperlinks between domains. This representation allowed us to build the following features:

In-degree 1

the number of onion domains pointing to domain \(d_i\). It is called the in-degree.

Out-degree 1

the number of HS referenced by \(d_i\), and it is named out-degree.

Centrality measures 1

for each domain \(d_i\) in the Tor network graph, we evaluated three node centrality measures: closeness, betweenness, and eigenvector [61, 62]. The closeness metric computes the length of the shortest paths from \(d_i\) to the network domains. The betweenness measures the extent to which \(d_i\) lies on paths between other domains. Finally, the eigenvector centrality reflects the importance of \(d_i\) based on the centrality of its neighbours. Formally, given a graph \(G= (V, E)\) with a set of V nodes and E edges, the closeness centrality is calculated as the inverse of the sum of the shortest path distances between a domain \(d_i\) and the remaining \(|V|-1\) domains in G, and it is defined in Eq. 2. as:

$$\begin{aligned} cls(d_i) = \frac{|V|-1}{\sum _{v=1}^{|V|-1} dis(d_i, d_v)}, \end{aligned}$$
(2)

where \(cls(d_i)\) is the closeness of \(d_i\) and \(dis(d_i, d_v)\) is the shortest path distance between domains \(d_i\) and \(d_j\).

The betweenness of domain \(d_i\) is the sum of the fraction of all-pairs shortest paths that pass through \(d_i\); it is given by Eq. 3.

$$\begin{aligned} btwn(d_i) = \sum _{d_j,d_k \in V and (d_i \sigma d_j, d_i \ne d_k) } \frac{\sigma (d_j,d_k \bar{d}_i) }{\sigma (d_j, d_k)}, \end{aligned}$$
(3)

where \(btwn(d_i)\) is the betweenness of \(d_i\), \(\sigma (d_j,d_k \bar{d}_i)\) and corresponds to the number of shortest paths between domains \(d_j\) and \(d_k\) that pass through node \(d_i\), and \(\sigma (d_j, d_k)\) is the number of shortest paths between domains \(d_j\) and \(d_k\).

The eigenvector centrality score of domain \(d_i\), denoted by \(eigvec(d_i)\), is proportional to the sum of the eigenvector scores of all connected domains. Therefore, the relative score of domain \(d_i\) is defined by Eq. 4.

$$\begin{aligned} eigvec(d_i)= \frac{1}{\lambda } \sum _{d_j\in G, d_j \ne d_i} a_{(d_i,d_j)} eigvec(d_j), \end{aligned}$$
(4)

It can be rewritten as \(Ax=\lambda x\), where \(\lambda \) is an eigenvalue and \(a_{(d_i,d_j)}\) is the adjacency matrix of graph G. If there are hyperlinks between domains \(d_i\) and \(d_j\), \(a_{(d_i,d_j)} = 1 \); otherwise, \(a_{(d_i,d_j)} = 0 \). Matrix A has multiple eigenvalues, but the components of A are all nonnegative. According to the Perron-Frobenius theorem [63], there is only a unique eigenvalue that satisfies a positive eigenvector of x. The eigenvector centrality calculation is as follows: all the node centralities are initialized to one and multiplied by A. The resulting vectors are normalized, and the process is repeated until convergence [64].

ToRank value 1

ToRank is a link-based ranking algorithm to order the items of a given network following their centrality [2]. We applied ToRank to the Tor network to rank the onion domains and used the assigned rank as a node feature. Moreover, we used a binary flag to indicate whether \(d_i\) is in the top-X domains of ToRank. We refer to those features as ToRank_rank and ToRank_top-X, respectively.

After computing the features described (Table 2), we concatenate them to form a feature vector. However, given the variety of the scales of the features, we normalize them by removing the mean and scaling to unit variance.

Table 2 Summary of the HSMU feature vector

4.2 Supervised learning-to-rank unit

We adopt the LtR approach widely used in the information retrieval (IR) field. In a traditional IR problem, a training sample has three components: the query ID, a ranked list of answers to the query and their relevance score, which can be either binary [50] or multiple levels of relevance [65]. However, looking at our ranking problem, there are two significant differences. First, we do not have queries; we have a single abstract question: What are the most attractive onion domains in a determined area of activities? Second, the relevant, i.e., practising the same activity, thanks to the classification component demonstrated in the Tor monitoring pipeline (see Fig. 1). Simultaneously, the relevance score cannot be multilevel because each domain has received a numerical score calculated and assigned manually by human annotators, as described in Sect. 3. These scores represent the ground-truth while training LtR. Therefore, a training sample \(d_i\) has a feature vector and a score \(r_i\) in R, where R refers to the ground-truth set. The feature vector of each sample \(d_i\) can be modelled as \(V= <r_i, d_{i,1}, d_{i,2}..., d_{i,n}>\), \(n \in N\), where \(d_{i,n}\) is the \(n_{th}\) feature of the domain \(d_i\) and N is the total number of ranking features, i.e., \(N = 40\).

Our LtR schema aims to learn a function f that projects a feature vector into a rank value \((d_{i,1}, d_{i,2}..., d_{i,n})\xrightarrow {f} r_i\). Therefore, the goal of an LtR scheme is to obtain the optimal ranking function f that ranks D in a similar way to R, i.e., \(D \xrightarrow {f} R\). The learning loss function depends on the LtR architecture and is explained in the following three subsections.

4.2.1 Pointwise

The loss function of the pointwise approach considers only a single instance of onion domains at a time [66]. It is a supervised classifier/regressor that independently predicts a relevance score for each query domain. The ranking is achieved by sorting the onion domains according to yield scores. For this LtR schema, we explore the multilayer perceptron (MLP) regressor [67]. This approach estimates the loss function based on a single item, i.e., onion domain, as shown in Eq. 5.

$$\begin{aligned} L(f; D, R) = \sum _{i=1}^{|R|} (f(d_i)-r_i)^2 \end{aligned}$$
(5)

4.2.2 Pairwise

Pairwise transforms the ranking task into a pairwise classification task. In particular, the loss function takes a pair of items at a time and attempts to optimize their relative positions by minimizing the number of inversions compared to the ground-truth [68]. We use the RankNet algorithm [68], which is one of the most popular pairwise LtR schemes. The loss function of RankNet is given by Eq. 6, as:

$$\begin{aligned} L(f; D, R) = \sum _{i=1}^{|R|-1} \sum _{j=1}^{|R|} \theta (f(d_i)-f(d_j)), \end{aligned}$$
(6)

where \(\theta \) is logistic function \(\theta (z) = log(1+\exp ^{-z})\).

4.2.3 Listwise

This approach extends the pairwise schema by looking at the entire list of samples at once [69]. One of the most well-known listwise schemes is the ListNet algorithms [70]. Given two ranked lists, the human-labelled scores and the predicted scores, the loss function minimizes the cross-entropy error between their permutation probability distributions. The ListNet loss function is defined for all onion domains in R by Eq. 7, as:

$$\begin{aligned} L(f; d_i, r_i) = - \sum _{j=1}^{|R|} P_{d_i}(j)log P_{f(r_i)}(j) \end{aligned}$$
(7)

where \(P_{s}(j)\) is a Plackett-Luce probability model [71] of j according to s, which is given by Eq. 8, as:

$$\begin{aligned} P_{s}(j) = \prod _{j=1}^{|R|} \frac{exp({s_j}_j)}{\sum _{k=j}^{|R|}exp({s_j}_k)} \end{aligned}$$
(8)

5 Experimental settings

To evaluate the proposed ranking framework, we tailored the experiments to answer three research questions:

  • What is the most suitable LtR schema for ranking the onion domains in the Tor network and detecting the influential domains?

  • When is each ranking approach used: the content-based and the link-based??

  • What is the best combination of features for the LtR model performance?

In the following, we discuss these questions, describe the analytical approach we conducted in detail, and present our findings.

5.1 Evaluation measure

The two most popular metrics for ranking an information retrieval system are mean average precision (MAP) and normalized discounted cumulative gain (NDCG) [72, 73]. The main difference between the two is that the MAP assumes a binary relevance of an item according to a given query, i.e., an item can be either relevant or nonrelevant. Additionally, NDCG allows the use of a numerical relevance score. Therefore, the NDCG is better suited for two reasons. First, thanks to the onion domain classification component (see Fig. 1), all domains are relevant, i.e., all have the same category, drug-related domains, in this case. Second, the ground-truth and the predicted rank score are numerical scores produced by the LtR schemes.

To obtain the NDCG@K, we calculate the DCG@K following formula (Eq. 9).

$$\begin{aligned} DCG@K = G_{1} + \sum _{i=2}^{K} \frac{G_i}{log_2(i)} \end{aligned}$$
(9)

where \(G_1\) is the gain score at the first position in the obtained ranked list, \(G_i\) is the gain score of item i in that list, and K refers to the first K items to calculate the DCG. To obtain a normalized version of DCG@K, it is necessary to divide it by IDCG@K, which is the ideal DCG@K sorted by the gain scores in descending order (Eq. 10).

$$\begin{aligned} NDCG@K = \frac{DCG@K}{IDCG@K} \end{aligned}$$
(10)

5.2 Module configuration

5.2.1 Hardware configurations

Our experiments were conducted on a 2.8 GHz CPU (Intel i7) PC running Windows 10 OS with 16 GB of RAM. We implemented the ranking models using Python3.

Table 3 The image classification performance using the F1 score over a test set of nine classes

5.2.2 HSMU configurations

We set the feature vector length of the TF-IDF text vectorizer to 10, 000 with a minimum frequency of 3, following our previous work [12]. We used an NER model trained on the WNUT-2017 dataset.Footnote 8 To set the popularity threshold of the popular_NEX feature, we examined four values (3, 5, 10, 15), and we assigned it to 5 experimentally. Additionally, we set the threshold of the recently_updated feature to three months earlier than the dataset scra** date. To extract features from the HTML code, we used the BeautifulSoup library.Footnote 9 To construct the Tor network graph, we used the NetworkXFootnote 10 library.

For the image classifier, we fine-tuned the Inception-ResNet V2 model [60] on a dataset of 11, 700, split as 9, 000 for training and 2, 700 for testing, and equally distributed over nine categories, as shown in Table 3. We collected the images from Google Images using a chrome plugin called Bulk Image Downloader.

5.2.3 SLRU configurations

We used the dataset described in Sect. 3 to train and test the three LtR models. Due to the small number of samples in the drug domain, only 290 onion domains, we conducted a 5-fold cross-validation following recommendations from previous works [70]. On each iteration, three folds were used for training the ranking model, one for validation and one for testing. For the three LtR models, the number of iterations is controlled by early stop** criteria, which is triggered when there is no change in the validation set at NDCG@10 [74].

The three LtR schemes commented on in Sect. 4.2 share the same network structure but differ in their loss functions. The neural network has two layers, with 128 and 32 neurons. For nonlinearity, a rectifier linear unit (ReLU) activation function is used [75], and a ReLU layer is followed by a dropout layer with a value of 0.5 [76] to avoid overfitting.

6 Results and discussion: drug case study

Fig. 3
figure 3

A comparison between three LtR algorithms against multiple values of averaged NDCG@K over the five fold cross-validation. The horizontal axis refers to the K value, and the vertical axis indicates the NDCG scores of the algorithms obtained at each value of K

6.1 Learning to ranking schema selection

In Sect. 4.2, we explored three well-known LtR schemes, namely, pointwise, pairwise, and listwise, and for each one, we explored a supervised ranking algorithm: MLP, RankNet, and ListNet, respectively. We wanted to know the most suitable LtR schema for ranking the onion domains in the Tor network and detecting the influential ones. Figure 3 compares the three LtR algorithms using the \({}N{}D{}C{}G{}@{}k{}\) metric for 10 different values of \(K= \{1, 3, 5, 7, 9, 15, 25, 35, 45, 55\}\), whereas 55 refers to the complete test set. The values of k are not equally sampled, and we select five values between zero and ten, while the other five values are greater than ten. This distribution is chosen because a correct rank on the head of a ranked list is more important than its tail [50, 77]. The superiority of the listwise approach is evidence of its suitability among the other methods (Fig. 3). The same figure shows that the NDCG@1 of ListNet is equal to one, which means that during the five folds of cross-validation, the algorithm ranked the first domains in the test set correctly, exactly as the ground-truth. It obtained NDCG@5 and NDCG@10 values 0.97 and 0.93, respectively. However, the lowest value was at NDCG@25 of 0.88. Additionally, as shown in Fig. 3, the pointwise approach, which is the MLP in our case, obtained the worst performance, which agrees with the conclusion of other researchers [65].

The superiority of the ListNet scheme comes from its ability to map a list of scores to a probability distribution, whereas the loss is calculated using the cross entropy between the predicted probability distribution and a target probability distribution. Therefore, we can say that ListNet considers the complete list of ranked items, while pointwise and pairwise ignore this structure.

Table 4 presents the top-10 drug domains nominated by each ranking algorithm.

Table 4 An example of the top 10 ranking algorithm outputs sorted from the highest to the lowest influence. The rank is estimated only based on the output of the first fold of the cross-validation. ListNet has the highest NDCG@10

In addition to comparing the performance using \(N{}D{}C{}G{}@{}K\), we register the total time required to train and test each LtR model. More precisely, we compare these times from when the model receives a list of domains encoded by the HSMU (Sect. 4.1) until it produces the rank. On average, for the five folds, the ListNet model took 8.30 seconds for training and 0.08 seconds for testing. The RankNet took 7.35 seconds for training and 0.007 seconds for testing. Finally, the MLP model was the fastest, requiring 3.34 seconds for training and 0.0009 for testing. This comparison shows that the ListNet model is the slowest due to the complexity of its loss function compared to the RankNet and MLP algorithms.

6.2 Link-based versus content-based ranking

Having two distinct ranking strategies raises a question: What is the most suitable ranking approach, content-based or link-based? To answer this question, we explore four link-based algorithms: ToRank [2], PageRank [25], hyperlink-induced topic search (HITS) [26], and Katz [27]. Ranking the onion domains of the Tor network using a link-based approach requires a directed graph representation. The graph nodes represent onion domains, and the directed edges capture the hyperlinks between domains. We compare these four link-based algorithms against the best LtR model, i.e., ListNet, which depends on the 40 features described in Sect. 4.1.

6.2.1 Comparison configuration

Unlike our supervised ranking approach [2], the link-based approach does not require training data; it can be seen as an unsupervised ranking. In contrast, LtR uses a portion for training and another for testing. Therefore, to perform a fair comparison between these two approaches, we use a five fold cross-validation. We split the dataset into five parts, and each time, one-fold is held out for testing, and the remaining four folds are used to train LtR. Hence, both approaches are tested on the same test set. Finally, we report the average NDCG of both. We evaluated several configuration parameters for the link-based algorithms and selected the parameters that obtained the highest NDCG (Table 5).

Figure 4 shows that ListNet surpasses all the link-based ranking algorithms. We observe that the weakest LtR approach, i.e., MLP, which obtained an NDCG@10 of 0.71, outperforms the best link-based ranking algorithm, ToRank, which scored NDCG@10 of 0.69. This result emphasizes the importance of considering the content of domains rather than their hyperlink connectivity only. Nonetheless, the link-based approach, such as ToRank, is still valid, with an NDCG@10 of 0.69 without labelling cost.

Fig. 4
figure 4

A comparison between the content-based versus link-based ranking algorithms concerning multiple values of K. The horizontal axis refers to the K value, and the vertical axis indicates the NDCG scores of the algorithms obtained at each value of K

6.3 Feature selection

In the previous sections, we concluded that ListNet outperformed the benchmarked techniques when a feature vector of forty dimensions represented each hidden service. However, the computational cost of these features varies. Some of them, such as the visual content, require building a dedicated image classification model, while other features could be extracted merely using a regular expression. The cost is reflected in the time necessary to obtain the features and build the ranking model and the inference time. On average, per domain, the prediction of the image classification model was the most expensive. It took 109 seconds, followed by the NER model with 22 seconds and the text features that required 12. Finally, the HTML and graph features were the fastest to be extracted, requiring 3 and 2 seconds, respectively.

Table 5 The evaluated parameter for the link-based ranking algorithms. Bold values correspond to the selected configuration with the highest NDCG
Fig. 5
figure 5

The effect of using different types of features along with their combinations on the ListNet ranking model. The vertical axis refers to the NDCG value, while the horizontal axis denotes the value of K. Each curve refers to a source of features: textual (text), featured produced by a named entity recognition (NER), HTML markup features (HTML), visual features (visual), graph features (graph), and all the features fused, denoted as (All)

Furthermore, we used asymptotic notation to generalize the processing time of features and compare their time complexities. In particular, textual features have a time complexity of O(nLlog(nL)), i.e., the complexity of computing the TF-IDF feature vector, where n is the total number of text sequences and L is the average length of these sequences [78]. In contrast, the HTML feature has a time complexity of O(n). Regarding the network structure features, ToRank has a complexity of O(2n), and the remaining features have a time complexity of O(n). Last, the complexity of the neural network-based models, such as the image classifier Inception-ResNet V2 or the NER model, depends on the structure of the neural network, i.e., the number of convolution layers and kernels [79]. Because of this, visual content features are the most time-consuming.

To answer the question: What feature or combination of features produces the best LtR model performance? We compare a ListNet ranker trained on different collections of features, as shown in Fig. 5.

We found that the features extracted only from text, denoted by text, achieved the highest NDCG@5 of 0.90. The features extracted from the NEs came in the second position, which obtained an NDCG@5 of 0.85. After that, using only features extracted from HTML, the ListNet model obtained an NDCG@5 of 0.81. In contrast, the graph features obtained the lowest NDCG@5 of 0.65, which indicates their weakness in ranking onion domains, unlike the features extracted from the text, which showed a significant and positive impact on the NDCG metric. Hence, the features extracted from the user-visible text are more representative than those from the visual content or the graph structure.

Furthermore, we examined the impact of aggregating the user-visible text features. Figure 5 shows an increase in the NDCG when the text, NER, and HTML features were combined. They scored an NDCG@5 of 0.95 compared to 0.97 when all the features were used. Hence, the graph and the visual features can be ignored with a 0.02 decrease in the NDCG. However, at NDCG@10, user-visible text features scored 0.88 and 0.93 for all the features, which means a decrease of 0.05. This result emphasizes the ability of the proposed ranking framework to rank onion domains regardless of whether they were isolated in the network or carried visual content. Therefore, further exploration of textual features, mainly textual semantic representation, such as BERT [80], for onion domains will significantly boost the ranking results.

6.4 Limitations of the content-based ranking

The content-based ranking approach has some limitations. As it falls under the supervised learning umbrella, it requires preranked data, which can be labour extensive. Moreover, building a training set requires answering subjective questions based on the annotators’ opinions, such as “Do you feel that this domain is trustable?" If the answers are not normalized to a standard, more noise will be introduced in the dataset. Furthermore, when a domain blocks the crawler from exploring the content by requesting login credentials, a content-based ranker will not be able to analyse the content and produce the expected output.

7 Conclusions and future work

The Tor network hosts suspicious activities that LEAs might be interested in monitoring. Ranking the onion domains according to their influence inside the Tor network will help LEAs prioritize the domains to leverage the monitoring process.

In this paper, we benchmarked three supervised learning-to-rank (LtR) algorithms, MLP, RankNet, and ListNet, to detect and rank the most influential onion domains. The proposed framework consists of two components: 1) a hidden service modelling unit (HSMU), which represents an onion domain by 40 features extracted from the domain user-visible text, the HTML markup of the web page, the NEs in the domain text, the visual content, and the Tor network structure; and 2) a supervised learning-to-rank unit (SLRU), which builds a ranking model.

We tested the effectiveness of our framework on a manually ranked dataset of 290 onion domains related to drug trading. We found that the ListNet algorithm outperforms with an NDCG@10 of 0.93.

Furthermore, we analysed the impact of the feature collections on ranker performance. We found that using only the user-visible textual features extracted from the text, NEs, and HTML markup code, the model achieves 0.95 at NDCG@5, and it decreased to 0.88 at NDCG@10, in comparison to 0.97 and 0.93, respectively, when all the features were used. Hence, using only features from the user-visible text allows the model to perform comparably with less complexity.

In the future, we plan to boost the explored features with a contextualized language model, such as BERT, to extract semantic features from the onion domain text [80].