1 Introduction

“Dark web” is a generic term for the subset of the Web that, other than being non-indexed by popular search engines, is accessible only through specific privacy-preserving browsers and overlay networks. Those networks, often called darknets, implement suitable cryptographic protocols to the purpose of kee** anonymous the identity of both the services offering contents and the users enjoying them. The best known and most widespread of them is probably Tor, which takes its name from The Onion Routing protocol it is based upon. Tor guarantees privacy and anonymity by redirecting traffic through a set of relays, each adding a layer of encryption to the data packets they forward. The equivalent of a domain on the surface Web is called Hidden Service (HS) in Tor.

Past research on the Tor network has evaluated its security [8], evolution [22], and thematic organization [36]. Nevertheless, an in depth study of Tor’s characteristics is difficult due to the limited number of Tor entry points on the surface web. In this paper, building on and extending over previous results on the topic [5, 6], we aim at better characterizing the Tor Web by analyzing three crawling datasets collected over a five-month time frame. In line with previous work on the WWW [11] and with a recent trend for criminal networks and dark/deep web [5, 13, 20, 34], we investigate Tor as a complex system, shedding new light on usage patterns as well as dynamics and resilience of the Tor Web. We consider the Tor Web graph aggregated by HS, i.e., the network of Tor HSs connected by hyperlinks – not to be mistaken with the network of Tor relays. We analyze the topology of two different graph representations of the Tor Web – directed and undirected – also using local properties of the graphs to characterize the role that different services play in the network. Relying on a large dataset of manually tagged HSs [2], we relate a few structural properties with the thematic organization of Tor’s web content.

Along with the three snapshot graphs induced by the three crawling data sets, we also consider an intersection graph and an union graph, in an effort to discriminate intrinsic features from noise. As a side effect, the present paper also addresses several open questions about the persistence of the Tor Web, showing the actual changes that took place in the quality, quantity and shape of available services and in their interconnections over the considered time span.

Overall, Tor comes out having significant structural differences with respect to the WWW. Our main findings may be summarized as follows:

  • The Tor Web is a network which resembles a small world one but is somehow inefficient, consisting of a tiny strongly connected component (SCC) surrounded by a multitude of services that can be reached from the SCC but do not allow getting back to it.

  • The stable core of the Tor Web is mostly composed of in- and out-hubs, whereas the periphery is highly volatile. The in- and out-hubs are generally separate services in Tor.

  • The (relatively small) undirected subgraph of the Tor Web, obtained only considering mutual connections, is quite efficient despite it lacks most of the features of a small world network. As a matter of fact, the undirected graph better preserves the social organization of the graph, such as its community structure, which appears to be generally stable and, as such, meaningful.

  • Both the volatility of Tor’s HSs and the tendency of the HSs to cluster together are unrelated to the services’ content.

  • With a few exceptions, the topological metrics are scarsely informative of the activity occurring on a service; however, the “hubbiness” of a HS may be of some help in detecting “suspicious” activities (as defined in [1]).

To the best of our knowledge, the amount of data we collected for the study of the Tor Web exceeds previous efforts reported in the literature [6, 7, 16] examine the global and local network structure of an encrypted online drug distribution network. Their aim is to identify vendor characteristics that can help explain variations in the network structure. Their study leverages structural measures and community detection analysis to characterize the network structure. Norbutas et al. [31] made use of publicly available crawls of a single cryptomarket (Abraxas) during 2015 and leveraged descriptive social network analysis and Exponential Random Graph Models (ERGM) to analyze the structure of the trade network. They found out the structure of the online drug trade network to be primarily shaped by geographical boundaries, leading to strong geographic clustering, especially strong between continents and weaker for countries within Europe. As such, they suggest that cryptomarkets might be more localized and less international than thought before. Christin et al. [13] collected crawling data on specific Tor hidden services over an 8 month lifespan. They evaluated the evolution/persistence of such services over time, and performed a study on the contents and the topology of the explored network. The main difference with our work is that the Tor graph we explore is much larger, not being limited to a single marketplace. In addition, we present here a more in depth evaluation of the graph topology.

De Domenico et al. [15], used the data collected in [4] to study the topology of the Tor network. They gave a characterization of the topology of this darknet and proposed a generative model for the Tor network to study its resilience. Their viewpoint is quite different from our own here, as they consider the network at the autonomous system (AS) level. Griffith et al. [5]. They aimed at relating semantic contents similarity with Tor topology, searching for smaller connected components that exhibit a larger semantic uniformity. Their results show that the Tor Web is very topic-oriented, with most pages focusing on a specific topic, and only a few pages dealing with several different topics. Further work [6] by the same authors features a very detailed network topology study investigating similarities and differences from surface Web and applying a novel set of measures to the data collected by automated exploration. They show that no simple graph model fully explains Tor’s structure and that out-hubs govern the Tor’s Web structure.

1.2 Roadmap

The rest of the paper is organized as follows. In Section 2 we describe: (i) our dataset, including statistics about the organization of the hidden services as websites (tree map, amount of characters and links); (ii) the DUTA dataset we used for content analysis. In Section 3 we describe how we extracted our graph representations from the available data and we recall the definition of all graph-related notation and metrics used throughout the paper. In Section 4 we discuss and present the results of our in-depth analysis of the Tor Web, carried out through a set of structural measures and statistics. We study properties such as bow-tie decomposition, global and local (i.e., vertex-level) metrics, degree distributions, community structure, and content related distribution and metrics. Finally, we draw conclusions in Section 5.

2 Data

The present paper analyzes a dataset that is the result of three independent six-week runs of our customized crawler, resulting in three “snapshots” of the Tor Web: SNP1, SNP2 and SNP3. The design of the crawler and the outcome of the scra** procedures are reported in Appendix 1 and more extensively discussed in [6, 12].

It is quite common to analyze a dataset obtained by crawling the web. Yet, it must be kept in mind that the analysis may be susceptible to fluctuations due to the order in which pages have been first visited – and, hence, not revisited thereafter [26]. In the case of the Tor Web, the issue is exacerbated by the renowned volatility of Tor hidden services [7, 8, 32]. By executing three independent scra** attempts over five months, we aimed at making our analysis more robust and at telling apart “stable” and “temporary” features of the Tor Web.

In total, we reached millions of onion pages (more than 3 millions in the second run alone) and almost 30 thousands distinct hidden services. The distribution of these hidden services across the three snapshots is reported in Table 1. Albeit active services may temporarily appear offline to the crawler (e.g., due to all paths to those services being unavailable), these statistics are quite informative about the volatility of the Tor web. Just 10685 onion URLs were successfully reached by all three crawling runs. It is quite likely that those hidden services were durably present over the considered five months time frame; they account for, respectively, 83.3% of SNP1, 42.2% of SNP2 and 61.2% of SNP3. Among the hidden services that are absent in just one of the three data sets, especially notable are the 76 hidden services that reappeared in SNP3 after they disappeared during SNP2.

Table 1 Services persistence over time; in total we reached almost 30000 different hidden services

To provide a better picture of the complexity of Tor websites, for each and every hidden service, we proceeded as follows: i) we reconstructed the whole tree-structure of sub-domains and pages; ii) we computed the total number of characters and the total number of hyperlinks (i.e., number of hrefs in the HTML source). Figure 1 shows the statistical distribution of tree heights for the three snapshots and the distribution of tree height variations across different snapshots (for hidden services present in, at least, two snapshots). The trees are generally very short and do not vary remarkably over time, yet exceptions exist with variations comparable to the maximum “size” of a hidden service. The char count is generally variable, whereas services with 0 hyperlinks are predominant. A significant number of hidden services has one hyperlink every 20 to 200 chars (i.e., from \(\approx 3\) words up to \(\approx 2\) sentences). In the following sections we rely on the ratio of number of hyperlinks over number of characters (links-to-char ratio, or LCRatio) to assess whether hidden services that are central in the Tor Web graph are indeed just link directories or not. It is worth noting that, of the 10685 hidden services reached in all three snapshots, only \(\approx 65\%\) had a constant tree height and only \(\approx 43\%\) had a constant char count across all snapshots. Automatically detecting hidden services that stay durably online but with different names (e.g., to prevent being tracked down) thus requires manual work that lies beyond the scope of the present paper.

Fig. 1
figure 1

Distribution of tree heights in the three snapshots (a) and distribution of tree height variations across different snapshots for hidden services present in at least two snapshots (b)

For contents analysis we rely on the DUTA dataset, the widest publicly available thematic dataset for Tor, consisting of a three-layer classification of 10250 hidden services [1, 2]. Albeit the DUTA classification does not cover our dataset entirely, the percentage of HSs of our snapshots contained in the DUTA dataset is significant: for instance, \(\approx 49.5\%\) of the fully persistent HSs found in all three snapshots, and \(\approx 85\%\) of the 200 HSs having most hyperlinks to other HSs, have a DUTA tag. In addition, the DUTA dataset has the undeniable advantage of being manually tagged – by choosing it rather than carrying out a fresh new classification of our dataset, we trade coverage for accuracy.

The DUTA dataset provides a two-layer thematic classification plus a language tag for each service. The thematic classes are further categorized as “Normal”, “Suspicious” or “Unknown”. The “Unknown” category only includes classes that correspond to services whose nature could not be established: “Empty”, “Locked” or “Down”. Due to the limited information provided by these tags, we ignore all “Unknown” services in the following. For certain first layer classes (e.g., “Marketplace”) that can be both “Suspicious” and “Normal”, the second layer is exactly used to tell apart “Legal” and “Illegal” content. We consider the second layer for this purpose only, thus obtaining the customized version of the DUTA thematic classification reported in Table 2.

Table 2 The content-based classification used in this paper

3 Methods

3.1 Graph construction

From each of the three WARC Footnote 1 files obtained from the scra** procedures we extracted two graphs: a Directed Service Graph (DSG) and an Undirected Service Graph (USG). As detailed in [12], a vertex of these graphs represents the set of pages belonging to a hidden service. In the DSG a directed edge is drawn from hidden service HS1 to HS2 if any page in HS1 contains, at least, a hypertextual link to any page in HS2Footnote 2. The directed graphs obtained from the three snapshots are denoted DSG1, DSG2, and DSG3, respectively. In the USG, instead, an undirected edge connects hidden services HS1 and HS2 if they are mutually connected in the corresponding DSG, that is, if there exists at least one page in HS1 linking any page in HS2 and at least one page in HS2 linking any page in HS1. More formally an edge \((u,v) \in E_{USG}\) iff \((u,v) \in E_{DSG}\) and \((v,u) \in E_{DSG}\), Figure 2 shows an example of construction of a DSG and a USG. When we consider just mutual connections, a vast majority of vertices remains isolated. These are ignored in the following since they convey no structural information. In other words, we consider edge-induced graphs. The undirected graphs obtained from the three snapshots are denoted USG1, USG2, and USG3 respectively.

Fig. 2
figure 2

A toy example showing a Directed Service Graph (DSG) and an Undirected Service Graph (USG). The USG is built from the DSG by kee** only mutually connected hidden services. We consider edge-induced graphs, thus isolated vertices are ignored

Since the snapshot graphs are inevitably conditioned by the effect of scra** a reputedly volatile network, we also consider the edge-induced intersection and union of the aforementioned graphs. Precisely, we denote DSGI the graph induced by the edge set \(E_{DSGI}= E_{DSG1} \cap E_{DSG2} \cap E_{DSG3}\) and DSGU the graph induced by the edge set \(E_{DSGU}= E_{DSG1} \cup E_{DSG2} \cup E_{DSG3}.\) Analogously, USGI is induced by the edge set \(E_{USGI}= E_{USG1} \cap E_{USG2} \cap E_{USG3}\) and USGU is induced by the edge set \(E_{USGU}= E_{USG1} \cup E_{USG2} \cup E_{USG3}.\)

We do not preserve multi-edges in order to allow a direct comparison with most previous work on other web and social/complex networks. However, in both directed and undirected graphs, we store the information about the number of links that have been “flattened” onto an edge as a weight attribute assigned to that edge – taking the minimum available weight for edges of our intersection graph and the maximum for the union. We interpret the edge weight as a measure of connection strength that does not alter distances but expresses endorsement/trust and quantifies the likelihood that a random web surfer [33] travels on that edge.

3.2 Graph Analysis

In line with previous work on Web and social graphs [11, 19, 21, 26], we analyze the Tor Web graph through a set of structural measures and statistics, including a bow-tie decomposition of the directed graphs, global and local (i.e., vertex-level) metrics, and modularity-based clustering. The main graph-based notions and definitions are reported in the following, while graph-related symbols used throughout the paper are reported in Table 3.

Table 3 Basic graph notations and definitions used throughout the paper

Bow-Tie decomposition In a directed graph, two vertices u and v are strongly connected if there exists a path from u to v and a path from v to u. Strong connectedness defines equivalence classes called strongly connected components. A common way to characterize a directed graph consists in partitioning its vertices based on whether and how they are connected to the largest strongly connected component of the graph. This “bow-tie” decomposition [26] consists of six mutually disjoint classes, defined as follows: (i) a vertex v is in LSCC if v belongs to the largest strongly connected component; (ii) v is in IN if v is not in LSCC and there is a path from v to LSCC; (iii) v is in OUT if v is not in LSCC and there is a path from LSCC to v; (iv) v is in TUBES if v is not in any of the previous sets and there is a path from IN to v and a path from v to OUT; (v) v is in TENDRILS if v is not in any of the previous sets and there is either a path from IN to v or a path from v to OUT, but not both; otherwise, (vi) v is in DISCONNECTED.

Global metrics To characterize our ten graphs we resort to well-known metrics, summarized in Table 4. Most of these metrics have a straightforward definition. Let us just mention that: in directed graphs, following Newman’s original definition [30], the assortativity \(\rho\) measures the correlation between a node’s out-degree and the adjacent nodes’ respective in-degree; in undirected graphs, \(\rho\) measures the correlation between a node’s degree and the degree of its adjacent nodes; the global efficiency \(E_{glo}\) is the average of inverse path lengths; in directed graphs, the transitivity T measures how often vertices that are adjacent to the same vertex are connected in, at least, one direction; the clustering coefficient C is the transitivity in undirected graph, defined as the ratio of closed triplets over total number of triplets. Many of the metrics from Table 4 are undefined for disconnected graphs, or may provide misleading results when evaluated over multiple isolated components. To make up for it and allow for a fair comparison, we only consider the giant (weakly) connected component of all disconnected graphs. It is worth mentioning that the three Directed Service Graphs (DSGs), and therefore their union DSGU, consist of a single weakly connected component. On the contrary, all Undirected Service Graphs (USGs) are weakly disconnected graphs. DSGI is also disconnected, albeit only two hidden services – violet77pvqdmsiy.onion and typefacew3ijwkgg.onion – are isolated from the rest and connected by an edge. We instead consider the graphs in their entirety for other types of analysis.

Table 4 Global metrics notations and definitions

Correlation analysis of centrality metrics We perform a correlation analysis of several local structural properties to the purpose of sorting out the possible roles of a service in the network. We rely on Spearman’s rank correlation coefficient – rather than the widely used Pearson’s – for a number of reasons: (i) we are neither especially interested in verifying linear dependence, nor we do expect to find it; (ii) we argue that not all the considered metrics yield a clearly defined interval scale – while they apparently provide a ordinal scale; (iii) when either of the two distributions of interest has a long tail, Spearman’s is usually preferable because the rank transformation compensates for asymmetries in the data; and (iv) recent work [27] showed that Pearson’s may have pathological behaviors in large scale-free networks. The considered metricsFootnote 3 are shown in Table 5. In words: the betweenness of v measures the ratio of shortest paths that pass through v; the closeness of v is the inverse of the average distance of v from all other vertices; the pagerank of v measures the likelihood that a random web surfer ultimately lands on v; the authscore and hubscore of v, jointly computed by the HITS algorithm [25], respectively measure how easy it is to reach v from a central vertex or to reach a central vertex from v; the efficiency of v is the average inverse distance of v from all other vertices; the transitivity of v is the ratio of pairs of neighbors of v which are themselves adjacent; the eccentricity of v is the maximum distance of any other vertex from v; the LCRatio of v is not a graph-based metrics, but we defined it as the ratio of number of hyperlinks over number of characters in the text extracted from the HS associated to v.

Table 5 Local metrics notations and definitions

Degree distribution We perform a log-normal and a power-law fit of the degree distribution of all graphs using the statistical methods developed in [14], relying on the implementation provided by the powerlaw python package [3]. A log-normal distribution may be a better fit of degree distributions in many complex networks [28], and a recent work suggests that a log-normal distribution may emerge from the combination of preferential attachment and growth [35]. Nevertheless, using a power-law fit is standard practice in the study of long-tailed distributions and allows direct comparison with previous works. It is worth specifying that powerlaw autonomously finds a lower-bound \(k_{\min }\) for degrees to be fitted. In our case, even if \(k_{\min }\) is much less than the maximum degree, all values greater than \(k_{\min }\) account for just a small percentage of the whole graph. However, we believe this should not prevent from taking these fits seriously into consideration: the tail of the distribution de facto describes the central part of the graph that actually has a meaningful structure – as opposed to the bulk of the distribution mostly depicting vertices with out-degree 0 (83% to 95% of the graph according to the specific DSG considered) and/or in-degree 1 (17% to 43%). The procedure by which we calculate the reach of the most important hubs of each network is the following: taking into account just the giant component, we i) sort the hidden services by degree (out-degree in the DSGs); ii) compute the cumulative percentage of the giant component that is at distance one from one of the first i hubs, for \(i\in \{1,\ldots ,25\}\).

Community structure To extract a community structure for our graphs we rely on the well-known Louvain algorithm [9], based on modularity maximization. As often done in the literature [19], we consider edge weights to make it harder to break an edge corresponding to a hyperlink that appears several times in the dataset. To compare the clusters emerged across different graphs, we consider how common vertices are grouped in each graph using the well-known Adjusted Mutual Information (AMI) to measure the similarity of two partitions. The AMI of two partitions is 1 if the two partitions are identical, it is 0 if the mutual information of the two partitions is the expected mutual information of two random partitions, and it is negative if the mutual information of the two partitions is worse than the expected one. Since a single label from Table 2 is assigned to each service, the DUTA classification naturally induces three hard partitions, denoted “duta” (the individual classes), “duta type” (the macro categories “Normal” and “Suspicious”) and “lang” (the language) in the following. For the set of hidden services that our graphs share with the DUTA dataset, we can assess the coherence of topic-based and modularity-based clustering by computing the AMI of “duta”, “duta type” and “lang” with respect to the Louvain’s clusters.

3.3 Topological features for content-based classification

To measure the information gain provided by topological vertex properties with respect to content-based classification, we proceed as follows:

  • For each DUTA category C, we consider the dummy variable \(X_C\) that indicates whether a randomly picked service belongs to the considered category.

  • We let each metrics m induce a probability distribution \(P_m\) over the set of all services, in such a way that the probability of selecting a HS is proportional to the value of that metrics for that service.

  • To measure the importance of knowing a metrics m with respect to a specific category C, we compare the distribution of \(X_C\) under two different assumptions: that the HSs are drawn based on \(P_C\) and that they are drawn uniformly at random – the latter meaning that \(\Pr [X_C=1]\) is the overall prevalence of C in the graph.

  • As a measure of information gain, we use the Kullback-Leibler divergence. The KL divergence lies in \([0,+\infty ]\), and it is 0 if the two distributions coincide.

4 Results and discussion

Hereafter, we summarize and discuss our main findings; additional explanations, statistics and figures are available in the Appendices. Since we monitored Tor over a sufficient time span, our analysis is robust under fluctuations of the results obtained for different snapshots. The union and intersection graphs, in particular, capture most of the features of the snapshots, reflecting in different ways some of their specific characteristics. We will therefore often focus on such graphs to provide a clear and synthetic overview of the results.

The bow-tie decomposition of the DSGs is reported and compared with previous work in Table 6. In general agreement with [

Availability of data and material

The dataset used for the analysis is available at the following address https://www.cranic.it/data/supporting_material.tar.gz. Readers interested in additional information about the dataset are welcome to contact the authors.

Code availability

To explore Tor we used a set of open source tools, namely tor, tinyproxy and bubing. To extract metrics and analyze data we used a set of software libraries, mainly python libraries such as igraph, numpy and scipy. To build the graphs we developed custom software in C language. Readers interested in getting our tools are welcome to contact the authors.

Notes

  1. Web ARChive https://www.iso.org/obp/ui/#iso:std:iso:28500:ed-2:v1:en

  2. Edges from/to the surface web have been ignored.

  3. Beware that some of these metrics are only defined for directed graphs.

  4. Here, the meaning of “random” depends on the choice of a distribution over the set of all possible partitions [38]

  5. wikitjerrta4qgz4.onion

  6. https://metrics.torproject.org/hidserv-dir-onions-seen.html?start=2017-01-01&end=2017-05-01

  7. http://nutch.apache.org

  8. https://webarchive.jira.com/wiki/display/Heritrix

  9. https://metrics.torproject.org/hidserv-dir-onions-seen.html?start=2017-01-01&end=2017-05-01

References

  1. Al-Nabki, M.W., Fidalgo, E., Alegre, E., Fernández-Robles, L.: Torank: Identifying the most influential suspicious domains in the tor network. Expert Systems with Applications 123, 212–226 (2019)

    Article  Google Scholar 

  2. Al Nabki, M.W., Fidalgo, E., Alegre, E., de Paz, I.: Classifying illegal activities on tor network based on web textual contents. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 35–43 (2017)

  3. Alstott, J., Bullmore, E., Plenz, D.: Powerlaw: a Python package for analysis of heavy-tailed distributions. PloS One 9(1), e85777 (2014)

    Article  Google Scholar 

  4. Annessi, R., Schmiedecker, M.: Navigator: Finding faster paths to anonymity. In: IEEE European Symposium on Security and Privacy (Euro S&P). IEEE (2016)

  5. Bernaschi, M., Celestini, A., Guarino, S., Lombardi, F.: Exploring and analyzing the tor hidden services graph. ACM Trans. Web 11(4), 24:1-24:26 (2017). https://doi.org/10.1145/3008662

    Article  Google Scholar 

  6. Bernaschi, M., Celestini, A., Guarino, S., Lombardi, F., Mastrostefano, E.: Spiders like onions: On the network of tor hidden services. In: The World Wide Web Conference, WWW ’19, pp. 105–115. ACM, New York, NY, USA (2019). https://doi.org/10.1145/3308558.3313687

  7. Biryukov, A., Pustogarov, I., Thill, F., Weinmann, R.P.: Content and popularity analysis of tor hidden services. In: Distributed Computing Systems Workshops (ICDCSW), 2014 IEEE 34th International Conference on, pp. 188–193 (2014). https://doi.org/10.1109/ICDCSW.2014.20

  8. Biryukov, A., Pustogarov, I., Weinmann, R.P.: Trawling for tor hidden services: Detection, measurement, deanonymization. In: Proceedings of the 2013 IEEE Symposium on Security and Privacy, SP ’13, pp. 80–94. IEEE Computer Society, Washington, DC, USA (2013). https://doi.org/10.1109/SP.2013.15

  9. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008(10), P10008 (2008)

    Article  Google Scholar 

  10. Boldi, P., Marino, A., Santini, M., Vigna, S.: Bubing: Massive crawling for the masses. In: Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion, pp. 227–228 (2014)

  11. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the web. Computer Networks 33(1–6), 309–320 (2000). https://doi.org/10.1016/S1389-1286(00)00083-9

    Article  Google Scholar 

  12. Celestini, A., Guarino, S.: Design, implementation and test of a flexible tor-oriented web mining toolkit. In: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics, WIMS ’17, pp. 19:1–19:10. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3102254.3102266

  13. Christin, N.: Traveling the silk road: A measurement analysis of a large anonymous online marketplace. In: Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13, pp. 213–224. ACM, New York, NY, USA (2013). https://doi.org/10.1145/2488388.2488408

  14. Clauset, A., Shalizi, C.R., Newman, M.E.: Power-law distributions in empirical data. SIAM Review 51(4), 661–703 (2009)

    Article  MathSciNet  Google Scholar 

  15. De Domenico, M., Arenas, A.: Modeling structure and resilience of the dark network. Phys. Rev. E 95, 022313 (2017). https://doi.org/10.1103/PhysRevE.95.022313

    Article  Google Scholar 

  16. Duxbury, S.W., Haynie, D.L.: The network structure of opioid distribution on a darknet cryptomarket. Journal of Quantitative Criminology 34(4), 921–941 (2018)

    Article  Google Scholar 

  17. Franceschet, M.: Pagerank: Standing on the shoulders of giants. Commun. ACM 54(6), 92–101 (2011). https://doi.org/10.1145/1953122.1953146

    Article  Google Scholar 

  18. Ghosh, S., Das, A., Porras, P., Yegneswaran, V., Gehani, A.: Automated categorization of onion sites for analyzing the darkweb ecosystem. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, pp. 1793–1802. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3097983.3098193

  19. Girvan, M., Newman, M.E.: Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99(12), 7821–7826 (2002)

    Article  MathSciNet  Google Scholar 

  20. Griffith, V., Xu, Y., Ratti, C.: Graph theoretic properties of the darkweb. ar**v:1704.07525 (2017)

  21. Guarino, S., Trino, N., Celestini, A., Chessa, A., Riotta, G.: Characterizing networks of propaganda on twitter: a case study. Applied Network Science 5(1) (2020). https://doi.org/10.1007/s41109-020-00286-y

  22. Jansen, R., Bauer, K., Hopper, N., Dingledine, R.: Methodically modeling the tor network. In: Proceedings of the 5th USENIX Conference on Cyber Security Experimentation and Test, CSET’12, pp. 8–8. USENIX Association, Berkeley, CA, USA (2012). http://dl.acm.org/citation.cfm?id=2372336.2372347

  23. Khare, R., Cutting, D., Sitaker, K., Rifkin, A.: Nutch: A flexible and scalable open-source web search engine. Oregon State University 1, 32–32 (2004)

    Google Scholar 

  24. Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: The web as a graph: Measurements, models, and methods. In: Asano, T., Imai, H., Lee, D., Nakano, S.i., Tokuyama, T. (eds.) Computing and Combinatorics, Lecture Notes in Computer Science, vol. 1627, pp. 1–17. Springer Berlin Heidelberg (1999). https://doi.org/10.1007/3-540-48686-0_1

  25. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM) 46(5), 604–632 (1999)

    Article  MathSciNet  Google Scholar 

  26. Lehmberg, O., Meusel, R., Bizer, C.: Graph structure in the web: Aggregated by pay-level domain. In: Proceedings of the 2014 ACM Conference on Web Science, WebSci ’14, pp. 119–128. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2615569.2615674

  27. Litvak, N., Van Der Hofstad, R.: Uncovering disassortativity in large scale-free networks. Physical Review E 87(2), 022801 (2013)

    Article  Google Scholar 

  28. Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet mathematics 1(2), 226–251 (2004)

    Article  MathSciNet  Google Scholar 

  29. Mohr, G., Stack, M., Ranitovic, I., Avery, D., Kimpton, M.: An introduction to heritrix an open source archival quality web crawler. In: In IWAW’4, 4th International Web Archiving Workshop. Citeseer (2004)

  30. Newman, M.E.J.: Mixing patterns in networks. Phys. Rev. E 67(2), 026126 (2003). https://doi.org/10.1103/PhysRevE.67.026126

    Article  MathSciNet  Google Scholar 

  31. Norbutas, L.: Offline constraints in online drug marketplaces: An exploratory analysis of a cryptomarket trade network. International Journal of Drug Policy 56, 92–100 (2018)

    Article  Google Scholar 

  32. Owen, G., Savage, N.: Empirical analysis of tor hidden services. IET Information Security 10(3), 113–118 (2016)

    Article  Google Scholar 

  33. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Tech. rep, Stanford InfoLab (1999)

  34. Sanchez-Rola, I., Balzarotti, D., Santos, I.: The onions have eyes: A comprehensive structure and privacy analysis of tor hidden services. In: Proceedings of the 26th International Conference on World Wide Web, WWW ’17, pp. 1251–1260. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2017). https://doi.org/10.1145/3038912.3052657

  35. Sheridan, P., Onodera, T.: A preferential attachment paradox: How preferential attachment combines with growth to produce networks with log-normal in-degree distributions. Scientific Reports 8(1), 2811 (2018)

    Article  Google Scholar 

  36. Spitters, M., Verbruggen, S., van Staalduinen, M.: Towards a comprehensive insight into the thematic organization of the tor hidden services. In: Intelligence and Security Informatics Conference (JISIC), 2014 IEEE Joint, pp. 220–223 (2014). https://doi.org/10.1109/JISIC.2014.40

  37. Takaaki, S., Atsuo, I.: Dark web content analysis and visualization. In: Proceedings of the ACM International Workshop on Security and Privacy Analytics, pp. 53–59. ACM (2019)

  38. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th annual international conference on machine learning, pp. 1073–1080 (2009)

  39. Zabihimayvan, M., Sadeghi, R., Doran, D., Allahyari, M.: A broad evaluation of the tor english content ecosystem. ar**v:1902.06680 (2019)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alessandro Celestini.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Appendices

Appendix A: Data collection

To collect data from the Tor Web we used a customized crawler fed with a list of seeds. Specifically, we assembled a large root set by merging onion urls advertised on well-known Tor wikis and link directories (e.g., “The Hidden Wiki”Footnote 4), or obtained from standard (e.g., Google) and Tor-specific (e.g., Ahmia) search engines. Then, in the 5-month time frame between January 2017 and May 2017, we launched our customized crawler three times and let each execution run for about six weeks. As result, we obtained three different “snapshots” of the Tor Web, denoted SNP1, SNP2, and SNP3, respectively. Table 7 describes our datasets the composition of which is comparable to similar studies in the Literature [36]. Yet, if we refer to the statistics provided by the Tor Project for the corresponding time windowFootnote 5, our crawls only reached 25% to 35% of the total number of daily published hidden services. It is not clear to which extent those estimates are inflated by the existence of Tor-specific messaging services in which each user is identified by a unique onion domain [

Fig. 11
figure 11

Cumulative percentage of the graph linked by the top hubs

1.4 B. 4 Community structure

Figure 12 shows the distribution of cluster sizes for the DSGs (a) and the USGs (b). In Figure 13 we use the well-known Adjusted Mutual Information (AMI) to compare the clusters emerged across different graphs based on how common vertices are grouped in each graph. We recall that the AMI of two partitions is 1 if the two partitions are identical, it is 0 if the mutual information of the two partitions is the expected mutual information of two random partitionsFootnote 9, and it is negative if the mutual information of the two partitions is worse than the expected one. All DSGs have a very similar structure, in terms of number and size of the clusters, and the pairwise AMI of the obtained clusters lies around 0.5. While USGs have a more heterogeneous structure, their communities are more similar, in line with the intuition that the existence of a mutual link is a stronger indicator of the similarity between two services. The only case in which a directed graph and the corresponding undirected graph have a AMI\(>0.5\) are the union graphs DSGU and USGU, i.e., the graphs based on all collected data.

Fig. 12
figure 12

The community size distribution for our Tor Web graphs

Fig. 13
figure 13

The comparison of the partitions obtained for our Tor Web graphs

To assess the coherence of topic-based and modularity-based clustering, we focused on the set of hidden services that our graphs share with the DUTA dataset and we measured the AMI of the partitions induced by the “duta”, “duta type” and “lang” classes with respect to the Louvain’s clusters. From Figure 14 it emerges very clearly that modularity-based clusters are not thematically uniform, since the mutual information of the two partitions is always barely greater than the mutual information of two random partitions. Thus, the apparent significance of the obtained Louvain’s clusters cannot be explained by a thematic homogeneity of the clusters.

Fig. 14
figure 14

The comparison of the topic-based partition induced by the DUTA dataset and the modularity-based partitions obtained through Louvain’s algorithm on our graphs