1 Introduction

The social structures behind scientific production may profoundly affect the growth and dissemination of knowledge, the well-being of our societies, and the evolution of academic research [1]. Many studies have shown how socially influenced behaviours impact different aspects of the scientific enterprise. Examples include the selection of co-authors, citation rates, and peer review processes, with authors’ attributes biases such as prestige [2], gender [3], and country of affiliation [4, 5].

Co-authorship networks, where nodes represent researchers and links represent co-authorship relations between them, have been shown as key to the understanding and map** of scientific production [622]. We obtained data from the Semantic Scholar Open Research Corpus [4, and (ii) researchers grouped by the core position of their communities for two categories: non-segregated and highly segregated in Fig. 5. We did not analyse our results by different ranges of internal papers due to the low correlation with the citation variables.

Figure 4
figure 4

Citation metrics for all researchers in communities of different segregation categories. Each panel represents the cumulative density function (CDF) for the total citations (TC), the citations per paper (CP), the proportion of citations from the same community (CC), and the proportion of citations from the same year’s co-authors (CN). The code of colours is: dark red for researchers in completely segregated (CS), grey for moderately segregated (M), light red for highly segregated (S), and blue for non-segregated communities (NS). Letters KS or MW appear when there are significant p-values for Kolmogorov-Smirnov (different distribution shapes) and Mann-Whitney (different distribution medians) for the CDFs of non-segregated and highly segregated communities. Significance levels are denoted as follows: * < 0.1, ** < 0.05, and *** < 0.01

Figure 5
figure 5

Citation metrics for researchers in communities of different segregation categories and core positions. Each row represents the cumulative density function (CDF) for the total citations TC, the citations per paper (CP), the proportion of citations from the same community (CC), and the proportion of citations from the same year’s co-authors (CN). The code of colours is: light red for highly segregated (S) and blue for non-segregated communities (NS). Letters KS or MW appear when there are significant p-values for Kolmogorov-Smirnov (different distribution shapes) and Mann-Whitney (different distribution medians) for the CDFs of non-segregated and highly segregated communities. Significance levels are denoted as follows: * < 0.1, ** < 0.05, and *** < 0.01. Here, we show 7 out of 11 cores to guide the reader, but Figure S15 shows results for the 11 cores of 2010

We use two statistical tests to compare the CDFs of non-segregated and highly segregated communities: Kolmogorov-Smirnov (KS) and Mann-Whitney (MW). The first test compares the shape of the distributions, and the second compares the differences between medians.

We first analyse the CDFs for the (i) Total citations (TC) and (ii) Citations per paper (CP). On an aggregated level, in Fig. 4 top row, our results indicate that there are no differences between highly and non-segregated researchers in terms of TC nor CP, we see that completely segregated researchers (darker red in the plot) have smaller values than other researchers, with no significant differences. However, the previous results hide some information because they are averaging over all network cores. Then, in Fig. 5, we group the researchers by the core position of their communities, and we split the results into the nucleus, middle, and periphery. In middle and periphery cores, highly segregated researchers have more TC than non-segregated ones, with opposite results in the nucleus (top row). For the CP (second row), there are no differences in the middle or periphery cores, but non-segregated researchers have more CP in the nucleus.

Then, we analyse the CDFs for (iii) the proportion of Citations from the same community (CC) and (iv) Proportion of citations from the same year’s co-authors (CN). For computing these proportions, we count the number of publications with at least one of the authors in the citing publication satisfying the rule of being in the same community (for CC) or co-author (for CN, regardless of the community). Then, we divide these counts by the total number of citations.

On an aggregated level (Fig. 4 second row), our results show that highly segregated researchers have more CC than non-segregated ones while there is no difference for CN. In addition, completely segregated researchers (darker red) receive lower CC and CN than others. There are no differences in the periphery when we group by the core position (Fig. 5 third and fourth rows). However, in middle cores, highly segregated researchers have more CC and CN; in the nucleus, non-segregated researchers have larger values.

We compare the results of 2010 with those in 2006 and 2014 in Section S8. For TC, highly segregated researchers outperform non-segregated in the periphery and middle cores, but there are no significant differences for CP. In the nucleus, non-segregated researchers do better for both TC and CP. There are no differences in CC and CN for non-segregated and highly segregated researchers, but for 2014 the trends are similar to those in 2010.

In summary, highly segregated researchers tend to have more citations per paper when they locate in peripheral cores and more citations from their communities in middle cores. At the same time, non-segregated researchers show higher values for the four metrics when they are in cores near the nucleus.

7 Discussion

Due to a range of social mechanisms, processes, and biases, co-authorship networks are organised in communities [9]. Within-group dynamics might lead to the emergence of segregation and polarisation, hampering innovation, social learning, and problem-solving [1214, 16]. Nevertheless, cohesive groups allow for the development of common narratives and language, offer support and share knowledge. As such, they have been identified as a locus for exploitation (when large in central locations) and exploration (when small in the periphery) of ideas, results, and methods [19, 42]. Still, understanding segregated groups in co-authorship networks and their possible effects is limited. Here, we tackle this problem by quantifying segregation categories of communities in co-authorship networks and characterising their topological properties and position in the network.

For our case study, we analyse the co-authorship network of Computer Science in the Semantic Scholar Open Research Corpus [23]. We detect communities with the Label-propagation algorithm and compute a structural segregation metric considering the community’s links: the Spectral Segregation Index (SSI). Based on the distribution of the SSI, we identify three main categories and focus on the two opposite limits: non-segregated and highly segregated communities. Then, we compare the communities’ size, density, clustering, and core position between categories. Furthermore, we study the relationship between segregation and impact using citations from the community’s publications.

Our results indicate that highly segregated communities tend to be more on the periphery, with some differences in density and clustering with non-segregated communities. When we analyse the total number of citations, researchers in highly segregated communities receive more citations than non-segregated ones in middle and peripheral cores. In addition, when we analyse the sources of those citations, for researchers in highly segregated communities, up to 5% more of those citations come from the same community than non-segregated communities in middle cores. Combining both results and based on previous literature, we speculate that in terms of spreading ideas and knowledge in the co-authorship network: (i) researchers in highly segregated communities attract more citations in the periphery of the network because most cited papers are not the internal ones but rather those across communities with diverse disciplines and co-authors [43]. And (ii) researchers in non-segregated communities in the nucleus are citing themselves more and are exploiting/echoing scientific research [18].

Both effects need further analysis because, as expected, highly segregated communities located on the periphery have a larger impact. Individual success correlates with the exploitation of ideas [18]. Still, also the most innovative research (exploration of new concepts and persistent citations) comes from the periphery of networks [19], and it is done by smaller groups of researchers [42]. Here, our results align with previous evidence showing nodes in the periphery being less active [38] (i.e. publishing less in our case) but having more impact. In addition, researchers in those communities are a large population that could become a collective power that can mobilise and spread information [39] (such as scientific theories).

Researchers in larger and non-segregated communities in the nucleus also increase their impact. These results need further exploration because their central positions in the network’s nucleus increase their chance of outside interactions with highly segregated communities, which can accelerate the propagation of echoed information (ranging from biased theories to new paradigms) from local groups to reach the entire network [44]. The inner impact of highly segregated communities and their impact on the whole network should be measured to intervene, if necessary, and tackle or boost the spread of echoed information to different groups [17].

7.1 Limitations

First, our analysis does not generalise for all the years of Computer Science papers available in the Semantic Scholar database because we study just three years. We have developed a repeatable methodology and replicated our findings over several years. Still, further analysis is needed to understand how the transitions of researchers between different segregation categories affect their research impact over time.

Second, our analyses only generalise to some co-authorship networks because the publications of Computer Science in the Semantic Scholar Open Research Corpus represent a vast amount of literature in a discipline prone to working in small teams [29]. Further analysis of other fields is needed to understand how these patterns apply to different co-authorship structures.

Third, we did not classify the core-periphery type of our network. Recent work has highlighted the importance of understanding if the network is prone to be divided into cores as layers (as we did with the k-core decomposition algorithm) or if a hub/spoke core division is a better descriptor [45]. However, their results show that authorship networks are the most prone to have a core-layered typology, as we used in the current work. In further analyses, the definition of segregated communities should also consider the co-authorship network’s core typology.

Finally, our fourth limitation relies on using the extreme values of the SSI ‘s PDF from the co-authorship networks to define segregation categories of communities. A more precise analysis could consider continuous values of the SSI, other features and data to represent better the consumption and production of scientific knowledge [6]. Future work could consider a continuous comparison of the metrics used in this analysis, publications’ content, researchers’ demographic diversity, and interdisciplinary citations.

7.2 Future research

Future research on this topic could consider: (i) the temporal analysis of segregated communities and their relation to gaining more or fewer citations over time, (ii) the analysis of the diversity of the scientific publications inside the communities using opinion distance [13] and their demographic diversity to understand if the segregated and isolated communities are not diverse and echoing research to the point of becoming polarised, (iii) the definition of lead researchers (using the hub/spoke core or author position in the publications) and the understanding of their relationship to segregated communities [46], iv) the measurement of the impact of segregated communities on the topology of the network formation and the spreading processes of scientific theories [47].