1 Main text

The definition of meaningful maps of the research space is a fundamental step in the study of the emergence of scientific areas and the characterization of the drivers of knowledge production and consumption. Map** the relation and structure of scientific knowledge is indeed one of the key elements towards the understanding of the dynamics of science and has practical applications in the information retrieval and classification of the ever growing output of the research community. The recent abundance of large scale bibliographic datasets has provided momentum to the study of the dynamics and structure of science [1, 2]. Studies have shown that it is possible to characterize the evolution of entire disciplinary areas [3, 4], identify general trends in science [5,6,7], characterize the effect of memory and attention [8,9,10], and measure the emergence and relevance of interdisciplinary efforts [11,12,13,14]. Considerable progress has been made also in the study of the mobility of researchers both in space and among research topics [15,16,17,18,19,20,3, 4, 42,43,44,45,46,47].

One of the hurdles in defining large-scale knowledge maps is that the approaches proposed in the literature typically rely on well defined scientific taxonomies. Indeed, it has been shown that it takes a non-trivial effort to analyze the evolution of trending topics in science when only keywords are used rather than well established classification schemes [5]. Here, we propose a new methodology that uses recent developments stemming from the Natural Language Processing (NLP) machine learning literature on word embeddings [48,49,

2 Results

The core assumption behind our approach is that research topics can be simply characterized by sets of keywords extracted from individual publications, patents, and other scientific artifacts. The relations among scientific areas are generally provided by measures of similarity among the research topics, inferred from co-occurrences of keywords in papers, citations, or other bibliographic indicators. Common approaches range from co-word similarity, to citation linkages, or more sophisticated vector space models [42, 55, 56].

In our case, to extract the labels that we are going to use to identify the research topics, we consider all the articles published in American Physical Society’s (APS) journals in the period 1986–2009 and we associate to each article: (a) a set of authors; and (b) a set of research topics identified using the Physics and Astronomy Classification Scheme (PACS) codes reported in each publication. Given this data, we select the scientific output of each author by kee** track of all the research topics, identified by the PACS codes, in which she/he has published on in a given time window. Then, we represent each scientist as a bag-of-topics and we use this information to train our embedding model and recover the vector embeddings for each PACS code (as shown in Fig. 1). In the machine learning literature, supervised and unsupervised vector space models have been used to perform exactly this task: embed words in a high dimensional space in which semantically similar words are mapped into neighboring points. Here, we use the general-purpose embedding approach proposed by [54] to map research topics (i.e. PACS codes) into a research space in which scientifically similar topics are placed close to each other. The motivation behind this methodology lies in the principle of relatedness [57, 58]: i.e. it is easier to specialize and work in related research areas requiring a set of common skills/knowledge. Each individual is thus assumed to have a given set of skills/knowledge which allow her/him to successfully publish in a specific set of topics. The embedding vector space model is then trained to learn the similarity of topics by analyzing the bag-of-topics of all the authors in our dataset.

Figure 1
figure 1

Embedding of research topics. We consider the articles published in American Physical Society’s (APS) journals and we associate to each manuscript: a set of authors and a set of research topics (PACS codes). Then, we represent each scientist as a bag-of-topics and we use this information to train our embedding model and recover the vector embeddings for each topic

From a technical standpoint, the model embeds each research topic into a N-dimensional space where related research topics are going to be placed close to each other. Each PACS code is thus identified by a vector \(\mathit{vec}_{j}^{t}\) defined by the N-dimensional embedding for topic i, learned with the StarSpace model [54] by observing scientists publication patterns in time window t. In this model, entity embeddings are learned using discrete feature representations describing the relations between the selected entities (in our case, authors and PACS codes). In practice, the model is used in its collaborative filtering-based recommendation training mode where collections of labels—the bag-of-topics for each author—are used to predict/suggest other PACS codes in which an author might be active on. This is achieved by first defining a dictionary of \(\mathcal{D}\) features as a \(\mathcal{D} \times N\) matrix where the ith row represents the N-dimensional PACS code/research topic embedding. In our case, \(\mathcal{D}\) is set equal to 854 and it corresponds to the number of PACS codes considered in our analysis; while N is set equal to 200 (this choice is discussed in the Methods section). The embeddings are learned by minimizing a loss function that depends on the pairwise cosine similarities between the different topics. Further details are provided in the Methods section, but the basic intuition behind this approach is that research topics co-occurrences (at the author level) are exploited to tune the embeddings so that frequently occurring pairs-of-topics are also close in the N-dimensional embedding space.

2.1 From the embedding space to the research space network

The embedding of topics into a high dimensional research space allows us to use their spatial positions to infer the value of their pairwise similarities by measuring the topics relative closeness. In particular, the topic vector space can be used to compute the similarity between two research topics as the cosine similarity between their vectors:

$$ \phi _{i,j}^{t} = \frac{\mathit{vec}_{i}^{t} \cdot \mathit{vec}_{j} ^{t}}{ \Vert \mathit{vec}_{i}^{t} \Vert \, \Vert \mathit{vec}_{j}^{t} \Vert } , $$
(1)

where \(\mathit{vec}_{i}^{t}\) and \(\mathit{vec}_{j}^{t}\) are the 200-dimensional embeddings for topics i and j, respectively.

The similarity measure can be used to generate the research space network (RSN) that considers the similarity as the weight of the connections and preserves only the most important links by removing the ones associated to negative or small values of the cosine similarity. The resulting RSN is visualized in Fig. 2, where we show that our methodology successfully groups together research topics belonging to the same section as taxonomized in the PACS 2010 Regular Edition of the Physics and Astronomy Classification Scheme [59]. Although our methodology is completely general and does not make use of the PACS hierarchical classification, we can use the latter as an external validation of the quality of the obtained embedding and classification. Indeed, our approach treats each 6-digits PACS code as a mere keyword and the information regarding the hierarchical structure of the classification scheme is not used to train the vector embeddings. In other words, our algorithm is unaware of the existence of the ten Sections of the PACS classification. However, when we do look at the resulting research space by coloring the nodes according to their PACS Section we notice that the General section is correctly placed at the center of the research space network, along with the Interdisciplinary Physics section, as one would expect. On the other hand, we note that Physics of Gases, Plasmas, and Electric Discharges, Condensed Matter, and Nuclear Physics seem to be populating three different boundary areas of the research space (as also observed in previous studies [3]). Overall, the position of the topics is consistent with the information codified in the PACS codes but—in addition—our approach also allows us to understand the relative position of each topic and PACS section with respect to each other, therefore enabling us to quantitavely measure their degree of relatedness.

Figure 2
figure 2

Research space network. Each node represents a research topic in Physics, identified by a PACS code, and each edge is weighted by the level of similarity between research topics. Nodes are colored according to their membership to the ten macro-sections of the PACS 2010 classification scheme. Only the most relevant connections are shown

2.2 Fingerprinting scientific expertise

Research activities in the context of the research space can be analyzed at different geographical scales. More precisely, we can fingerprint scientific production at the level of individual authors, institutions, cities, or countries by geolocating scientific publications (Fig. 3). This can be achieved by considering all the articles published in American Physical Society’s (APS) journals in the period 1986–2009 and by associating to each publication: (a) the information contained in the authors’ affiliation; and (b) the set of research topics (i.e. the PACS codes) used in the paper. In the following, we focus on geographical units constructed by first parsing the city names from the affiliation strings for each article, and then clustering together neighboring cities to obtain distinct urban areas. More specifically, we follow the same procedure used in [60]: first, we infer the country in which each affiliation-city pair is located; second, for each country, we compute a geographic distance matrix (using Vicenty’s formula) connecting each pair of cities; and lastly we use hierarchical clustering to define the different urban areas with the additional constraint that the maximum distance within each cluster has to be less than 50 km. Once we have the geographical units defined, we count how many publications have been produced in each PACS code by each distinct urban area.

Figure 3
figure 3

Data representation. Papers are geo-localized using authors’ affiliation information, while research topics correspond to the PACS codes reported in each article. For each geographical unit and location pair, the revealed comparative advantage (RCA) is computed to generate location-specific specialization profiles that allow us to fingerprint the structure of the scientific production system of each urban area

In order to provide a specific fingerprinting for the degree of specialization of each geographical unit we extend to scientific production [47] the concept of Revealed Comparative Advantage (RCA, [61]). RCA has a long history in the economic literature where it has been used to study the level of specialization of nations and regions in terms of industrial production, technological production, and trade exports (see for example [62,63,64,65,66,67,68,69,70,71,72,73,74,75]). The RCA [61] is defined as:

$$ \mathit{RCA}_{c,k}^{t} = \frac{X_{c,k}^{t}/ \sum_{k} X_{c,k}^{t}}{\sum_{c} X _{c,k}^{t} / \sum_{c,k} X_{c,k}^{t}}, $$
(2)

where \(X_{c,k}^{t}\) denotes the number of publications produced in urban area c in PACS code k in the time window t. In practice, the numerator represents the percentage share of papers published in PACS code k by location c; while the denominator represents the percentage share of papers published in PACS code k across the world. By comparing these two figures, we can assess whether a given geographical unit is relatively more specialized in a certain research topic.

By using the above definition, we consider a geographical unit to be a specialized scientific producer in PACS code k at time t if \(\mathit{RCA}_{c,k}^{t}>1\). Using the RCA, we can generate location-specific specialization profiles that allow us to fingerprint the structure of the scientific production system of each geographical area. In particular, we can create a (time varying) fingerprint matrix \(F_{ck}\), where c is a geographical unit and k is a topic, and assign non-zero entries only if \(\mathit{RCA}_{c,k}^{t}>1\). We can visualize the matrix \(F_{ck}\) to have a general understanding of the different specialization patterns, a sort of research DNAs, that characterize the knowledge production of geographical units, as shown in Fig. 4. As an example, we also show the scientific fingerprints of three different urban areas: Darmstadt (Germany), Cambridge (MA, USA), and Pittsburgh (PA, USA). This let us appreciate how different locations might specialize into different parts of the research space. For instance, Darmstadt has a relative comparative advantage on \({\sim}70\%\) of all the PACS code in Nuclear Physics (PACS section 20). On the other hand, Pittsburgh is specialized in Physics of Elementary Particles and Fields (PACS section 10) while its specialization in Nuclear Physics is particularly low. Lastly, we observe that Cambridge is the only city among the three considered with a more homogenous pattern of specialization (with the exception of Nuclear Physics). Overall, by looking at Fig. 4, we can start to appreciate how different cities might cluster their scientific expertise around distinct areas of the research space, even though exceptions—as in the case of Cambridge—do exist.

Figure 4
figure 4

Fingerprinting knowledge production profiles. Each row represents an urban area, each column a PACS code, and each colored dot indicates that a given location has a revealed comparative advantages in the production of papers in a given PACS code in the time window 2007–2009. Colors identify the ten different PACS Sections. Separate bar charts are reported for three cities showing the fraction of PACS codes in which the city has a revealed comparative advantage over the total number of PACS codes in that section

2.3 Knowledge density and the prediction of scientific specialization

The RCA in the context of the research space has been introduced by Guevara et al. [47] to explore the principle of relatedness [57, 58, 76] in the process of scientific production: i.e. it is easier to specialize and work in related research areas requiring a set of common skills/knowledge. Indeed, relatedness has been found to play an important role in explaining the patterns of future development of industries and research production at the level of cities, regions, and nations [47, 57, 77,78,79,80,81,82,83]. This is due to the fact that different sets of capabilities and skills might be needed to grasp the depth and complexity of different research topics, therefore affecting the ability of researchers to move and develop a competitive edge across different disciplines. This observation builds upon the idea that congnitive proximity [84, 85] is required to successfully absorb and use new knowledge [86].

Our analysis provides support to the principle of relatedness and the fingerprint matrix shows patterns of specialization that are indeed not random. Some of these patterns can be appreciated in Fig. 5 where we plot each PACS code in which a city has a comparative advantage using the spatial coordinates identified using the research space map**. Also in this case, we can appreciate how spatial ontologically consistent clusters of competences emerge. In other words, it appears that urban areas tend to develop around their current domain of expertise implying that scientific relatedness does play a role in explaining the structure of knowledge production of a city. In order to quantify the relatedness of a specific PACS code to the overall domain of expertise of a given geographical unit, we use the knowledge density (as proposed in [57]). The knowledge density \(\omega _{i,c}^{t}\) around PACS code i in urban area c at time t is defined as:

$$ \omega _{i,c}^{t} = \frac{ \sum_{\{ k \text{ s.t. }\mathit{RCA}_{c,k}^{t}>1 \text{ and } \phi _{i,k}^{t}>0 \}} \phi _{i,k}^{t}}{\sum_{\{j \text{ s.t. } \phi _{i,j}^{t}>0 \}} \phi _{i,j}^{t}}, $$
(3)

where \(\phi _{i,j}\) is the level of knowledge similarity between PACS codes i and j. In our case, we use cosine similarity to measure the similarity between two research topics. Given this definition, for a location c and time window t, the closer topic i is to other topics in which c has a relative comparative advantage, the higher its knowledge density. To understand how this metric works, let us consider what happens when the index—which varies between zero and one—takes its extreme values. For a given PACS code i and urban area c combination, the value of \(\omega _{i,c}^{t}\) is equal to zero if c has no comparative advantages in topics related to i; while it has a knowledge density equal to one if it has an advantage in all the topics related to i. In other words, the closer is i to the current domain of expertise of c, the denser the knowledge space will be around PACS code i.

Figure 5
figure 5

Fingerprints and research space. Specialization in the Physics research space networks of four different cities showing only the PACS codes in which the urban areas exhibit a revealed comparative advantage in the period 2007–2009. Colors identify the ten different PACS sections as in Fig. 4

For each geographical unit, we can associate the knowledge density \(\omega _{i,c}^{t}\) with four different types of transitions that characterize the time evolution of the comparative advantage of a PACS code. We look at the distributions of \(\omega _{i,c} ^{t}\) when: (1) a PACS code that is inactive (i.e. \(\mathit{RCA}=0\)) at time \(t-1\), remains inactive at time t; (2) a PACS code that is inactive at time \(t-1\) becomes active but with no comparative advantage (i.e. \(0<\mathit{RCA}\leq 1\)) by the city at time t (i.e. inactive to active but not specialized); (3) a PACS code remains active but with no comparative advantage by the city at both \(t-1\) and t (i.e. remained not specialized); and lastly (4) a PACS code that is active but with no comparative advantage by the city at time \(t-1\), while a comparative advantage (i.e. \(\mathit{RCA}>1\)) emerges at time t (i.e. from not specialized to specialized). Looking at these distributions we observe that PACS codes that normally remain inactive are the ones in which the knowledge density at the previous time step was the lowest, while the opposite holds for the codes in which urban areas become specialized (a visualization of the results is reported in the swarm plot in Fig. 6). In other words, it is easier to develop a stronger comparative advantage in research topics that are related—in the research space—to the ones in which a location is already specialized in.

Figure 6
figure 6

Predicting specialization. Top panel: In the swarm plot every dot represents a PACS code-city pair where the time period is reported on x-axis, the observed level of knowledge density on the y-axis, and the transition type is identified the color of the dot. From this representation we observe that urban areas tend to develop quicker in activities around which their knowledge density was higher. Bottom panel: The boxplot represents the distribution of the values of the area under the ROC curve when knowledge density is used to predict urban scientific specialization. Values of accuracy greater than 50% imply a better predictive power than a model assuming random specialization patterns

It is interesting to explore the possibility of using the knowledge density as a predictor of the emergence of a comparative advantage of a city in a specific PACS code in the future. Operationally, we follow the same methodology proposed in [47] and postulate that the order in which each urban area will become specialized should closely follow the list of PACS codes ranked according to their associated value of knowledge density. We can test this hypothesis against a null one assuming that, instead, specialization occurs independently of the current level of knowledge density. In other words, the alternative hypothesis would suggest that an urban area develops a comparative advantage at random, regardless of its previous level of expertise and specialization. The predictive performance of the knowledge density \(\omega _{i,c}^{t}\) can be evaluated using a statistic which is normally used in the machine-learning community to measure the accuracy of a model: the area under the so-called Receiver Operating Characteristic (ROC) curve. The ROC curve is used to plot the true negative rate of a model (for example of a classifier) against its true positive rate. That is, the share of correctly classified negative values against the share of correctly classified positive values. If the value of the area under the ROC curve is greater than 50%, then the accuracy of our prediction using the knowledge density is greater than the one we would have from a random prediction where PACS would not have been ranked by their knowledge density. In our case, we actually have a distribution of such values since—for a given time period—we can compute the accuracy of our model for each geographical unit. In other words, we test our ability to predict the research trajectories of each city in each distinct time window. The results, reported in Fig. 6, show that the accuracy is higher than 50%, confirming that the structure of the research space can be used to predict how research trajectories evolve over time. While beyond the scope of the presented work, it is possible to envision the use of current estimate of the knowledge density to forecast the physics research areas in which specific urban areas will be able to specialize in future years.

3 Discussion

The construction of the physics research space by embedding topics in a high-dimensional space allows the fingerprinting of the patterns of specialization of urban areas, and the prediction of the evolution of cities’ patterns of specialization across different research topics, providing additional support to the principle of relatedness [57, 58, 76]. However, the observed level of scientific capacity, as characterized by the value of the knowledge density, varies considerably in relation to the socio-economic status of each specific geographical areas. To highlight this aspect, we zoom-out and repeat the same exercise we performed for urban areas at the level of countries and we compute the overall average knowledge density of each nation across all PACS sections and for each PACS section. Then, we use the measured average knowledge density to study the association of this measure of national scientific competence with several World Development Indicators (WDI) [87] that quantify the socio-economic status of the countries under analysis. In Fig. 7 we report an example of the associations found for 67 countries. This set of countries represents approximately 99% of the total publications in our dataset. In Table 1 we show a summary of the results for all WDI considered, and in Fig. 8 we report the average correlation for each indicator category broken down by PACS section. This correlation analysis suggests that the most advanced countries—in terms of scientific expertise in Physics—are also the ones with the higher share of production and export of high-tech goods, the higher levels of investment in R&D, the higher levels of production of measurable innovation outcomes (e.g. patent, industrial design, and trademark applications), the higher levels of educational attainment and—at the same time—the lower levels of unemployment of skilled labor. Overall, this picture shows that economic development goes hand in hand with a high value of (average) knowledge density, thus supporting the key role of scientific production in the economic growth of nations.

Figure 7
figure 7

Knowledge density, R&D, and exports. Relation between the average knowledge density of a country in the last time window of our sample (2007–2009) with its level of expenditure in research and development reported as a fraction of the gross domestic product (GDP) and with its level of trade in medium and high-tech products reported as a fraction of its manufactured exports

Figure 8
figure 8

Knowledge density and development indicators. Correlation between the average knowledge density of a country in the period 2007–2009 computed for each one of the ten PACS sections and each development indicator category as introduced in Table 1

Table 1 Knowledge density and world development indicators

It is worth remarking that the study presented here is considering only the Physics literature published in APS journals, thus missing out on more complex dynamics that could explain the (co)evolution of scientific expertise in different scientific domains in both time and space. Furthermore we relied on the Physics and Astronomy Classification Scheme to assign topics to articles, thus constraining the research space to a pre-defined taxonomy. The PACS scheme was however used for the sake of comparing with previous results in the literature, and the proposed approach does not have to be limited to research in Physics, but it can be extended to other disciplines. In order to overcome the above limitations the embeddings can be produced by simply analyzing the text of paper titles and abstracts without any a-priori knowledge of a scientific topic classification and extending the analysis to databases including a wider range of scientific disciplines, ranging from Physics and Engineering to Economics and Philosophy. The proposed approach might also help address the problem of dealing with the bursty behavior [5] of author-defined keywords. Indeed, even short-lived labels can be put in relation to more stable scientific topics since both sets of keywords will live in the same N-dimensional embedding.

Another potential application of the framework presented in this paper concerns the study of how scientific concepts change and move over time across the embedding space. This could provide us with a methodology to study “where science is going”, i.e. to understand how scientists or research topics move over time. Indeed, in the NLP literature, some approaches have been proposed to study how word analogies and semantic meaning change over time (see for example [60], authors are disambiguated following the procedure detailed in [94], while research topics are assigned considering the first 6 digits of the PACS classification scheme [59]. Overall, our dataset includes 2307 urban areas and 5800+ PACS codes. However, in our analysis, we restrict our attention to cities that have at least 6 publications in each time window. This restricts our original sample to 402 urban areas and 854 PACS codes.

4.2 Embedding model

In order to produce the PACS code embeddings, we employ the StarSpace model proposed by [54]. Starspace is a general-purpose embedding model that aims at creating embeddings for a variety of entity types (e.g. words, sentences, documents, images, etc.) by associating to each entity an N-dimensional vector. In our case, the vector size is set to \(N=200\) and the vectors are obtained by minimizing a loss function that simultaneously maximizes the (cosine) similarity between embeddings of PACS that are used by the same author, and by minimizing the (cosine) similarity between embeddings of PACS that do not appear together when looking at the career of scientists. In other words, once PACS codes are mapped into this new 200-dimensional space, PACS that frequently appear together in the list of publications of a scientist will tend to be close, while PACS that rarely appear together will belong to different areas of the embedding space. More specifically, the model minimizes the following loss function:

$$\begin{aligned} \sum_{\substack{(a,b)\in E^{+} \\ b^{-} \in E^{-}}} L^{\mathrm{batch}} \bigl( \operatorname{sim}(a,b),\operatorname{sim}\bigl(a,b^{-}_{1} \bigr), \ldots ,\operatorname{sim}\bigl(a,b^{-}_{k}\bigr) \bigr), \end{aligned}$$
(4)

where \(E^{+}\) denotes the set of positive entity pairs (i.e. PACS that often appear together), \(E^{-}\) denotes the set of negative entity pairs (i.e. PACS that rarely appear together), \(\kappa =50\) is the number of negative pairs used for each batch update (i.e., this model uses a K-negative sampling strategy as in [50]), \(\operatorname{sim}(\cdot)\) denotes the cosine similarity between two embeddings, and \(L^{\mathrm{batch}}\) denotes the batch specific loss function that compares the positive pair \((a,b)\) with the negative pairs \((a,b^{-}_{i})\) using a margin ranking loss of the form \(\max (0,\mu -\operatorname{sim}(a,b)+\sum_{i\in [1,\kappa ]}(a,b^{-}_{i}))\). The loss function is then minimized using stochastic gradient descent [95]. The value of N has been chosen after examining the prediction performance of our model when trying to reconstruct the bag-of-topics of the authors. In particular, we computed the percentage of correctly predicted PACS codes in the top k predictions made by the algorithm. This metric is commonly denoted by hit@k [96] and it is the same performance metric used also in [54]. In Fig. 9 we show how its value varies with the size of the embedding dimension N and \(k=50\). In light of this analysis, we decided to set N equal to 200 since it provided a good compromise between the training time required to fit the model and its overall prediction quality.

Figure 9
figure 9

Embedding model performance. Values of the performance metric hits@k [96] for \(k=50\). This shows how the performance of the embedding model varies with the size of the embedding dimension N