The structure of segregation in co-authorship networks and its impact on scientific production

Jaramillo, Ana Maria; Williams, Hywel T. P.; Perra, Nicola; Menezes, Ronaldo

doi:10.1140/epjds/s13688-023-00411-8

The structure of segregation in co-authorship networks and its impact on scientific production

Regular article
Open access
Published: 09 October 2023

Volume 12, article number 47, (2023)
Cite this article

Download PDF

You have full access to this open access article

EPJ Data Science Submit manuscript

The structure of segregation in co-authorship networks and its impact on scientific production

Download PDF

1142 Accesses
16 Altmetric
Explore all metrics

Abstract

Co-authorship networks, where nodes represent authors and edges represent co-authorship relations, are key to understanding the production and diffusion of knowledge in academia. Social constructs, biases (implicit and explicit), and constraints (e.g. spatial, temporal) affect who works with whom and cause co-authorship networks to organise into tight communities with different levels of segregation. We aim to examine aspects of the co-authorship network structure that lead to segregation and its impact on scientific production. We measure segregation using the Spectral Segregation Index (SSI) and find four ordered categories: completely segregated, highly segregated, moderately segregated and non-segregated communities. We direct our attention to the non-segregated and highly segregated communities, quantifying and comparing their structural topologies and k-core positions. When considering communities of both categories (controlling for size), our results show no differences in density and clustering but substantial variability in the core position. Larger non-segregated communities are more likely to occupy cores near the network nucleus, while the highly segregated ones tend to be closer to the network periphery. Finally, we analyse differences in citations gained by researchers within communities of different segregation categories. Researchers in highly segregated communities get more citations from their community members in middle cores and gain more citations per publication in middle/periphery cores. Those in non-segregated communities get more citations per publication in the nucleus. To our knowledge, this work is the first to characterise community segregation in co-authorship networks and investigate the relationship between community segregation and author citations. Our results help study highly segregated communities of scientific co-authors and can pave the way for intervention strategies to improve the growth and dissemination of scientific knowledge.

Social Network Theories: An Overview

A citation analysis examining geographical specificity in article titles

Article 21 June 2024

Gender of gender studies: examining regional and gender-based disparities in scholarly publications

Article 27 June 2024

1 Introduction

The social structures behind scientific production may profoundly affect the growth and dissemination of knowledge, the well-being of our societies, and the evolution of academic research [1]. Many studies have shown how socially influenced behaviours impact different aspects of the scientific enterprise. Examples include the selection of co-authors, citation rates, and peer review processes, with authors’ attributes biases such as prestige [2], gender [3], and country of affiliation [4, 5].

Co-authorship networks, where nodes represent researchers and links represent co-authorship relations between them, have been shown as key to the understanding and map** of scientific production [6–22]. We obtained data from the Semantic Scholar Open Research Corpus [4, and (ii) researchers grouped by the core position of their communities for two categories: non-segregated and highly segregated in Fig. 5. We did not analyse our results by different ranges of internal papers due to the low correlation with the citation variables.

We use two statistical tests to compare the CDFs of non-segregated and highly segregated communities: Kolmogorov-Smirnov (KS) and Mann-Whitney (MW). The first test compares the shape of the distributions, and the second compares the differences between medians.

We first analyse the CDFs for the (i) Total citations (TC) and (ii) Citations per paper (CP). On an aggregated level, in Fig. 4 top row, our results indicate that there are no differences between highly and non-segregated researchers in terms of TC nor CP, we see that completely segregated researchers (darker red in the plot) have smaller values than other researchers, with no significant differences. However, the previous results hide some information because they are averaging over all network cores. Then, in Fig. 5, we group the researchers by the core position of their communities, and we split the results into the nucleus, middle, and periphery. In middle and periphery cores, highly segregated researchers have more TC than non-segregated ones, with opposite results in the nucleus (top row). For the CP (second row), there are no differences in the middle or periphery cores, but non-segregated researchers have more CP in the nucleus.

Then, we analyse the CDFs for (iii) the proportion of Citations from the same community (CC) and (iv) Proportion of citations from the same year’s co-authors (CN). For computing these proportions, we count the number of publications with at least one of the authors in the citing publication satisfying the rule of being in the same community (for CC) or co-author (for CN, regardless of the community). Then, we divide these counts by the total number of citations.

On an aggregated level (Fig. 4 second row), our results show that highly segregated researchers have more CC than non-segregated ones while there is no difference for CN. In addition, completely segregated researchers (darker red) receive lower CC and CN than others. There are no differences in the periphery when we group by the core position (Fig. 5 third and fourth rows). However, in middle cores, highly segregated researchers have more CC and CN; in the nucleus, non-segregated researchers have larger values.

We compare the results of 2010 with those in 2006 and 2014 in Section S8. For TC, highly segregated researchers outperform non-segregated in the periphery and middle cores, but there are no significant differences for CP. In the nucleus, non-segregated researchers do better for both TC and CP. There are no differences in CC and CN for non-segregated and highly segregated researchers, but for 2014 the trends are similar to those in 2010.

In summary, highly segregated researchers tend to have more citations per paper when they locate in peripheral cores and more citations from their communities in middle cores. At the same time, non-segregated researchers show higher values for the four metrics when they are in cores near the nucleus.

7 Discussion

Due to a range of social mechanisms, processes, and biases, co-authorship networks are organised in communities [9]. Within-group dynamics might lead to the emergence of segregation and polarisation, hampering innovation, social learning, and problem-solving [12–14, 16]. Nevertheless, cohesive groups allow for the development of common narratives and language, offer support and share knowledge. As such, they have been identified as a locus for exploitation (when large in central locations) and exploration (when small in the periphery) of ideas, results, and methods [19, 42]. Still, understanding segregated groups in co-authorship networks and their possible effects is limited. Here, we tackle this problem by quantifying segregation categories of communities in co-authorship networks and characterising their topological properties and position in the network.

For our case study, we analyse the co-authorship network of Computer Science in the Semantic Scholar Open Research Corpus [23]. We detect communities with the Label-propagation algorithm and compute a structural segregation metric considering the community’s links: the Spectral Segregation Index (SSI). Based on the distribution of the SSI, we identify three main categories and focus on the two opposite limits: non-segregated and highly segregated communities. Then, we compare the communities’ size, density, clustering, and core position between categories. Furthermore, we study the relationship between segregation and impact using citations from the community’s publications.

Our results indicate that highly segregated communities tend to be more on the periphery, with some differences in density and clustering with non-segregated communities. When we analyse the total number of citations, researchers in highly segregated communities receive more citations than non-segregated ones in middle and peripheral cores. In addition, when we analyse the sources of those citations, for researchers in highly segregated communities, up to 5% more of those citations come from the same community than non-segregated communities in middle cores. Combining both results and based on previous literature, we speculate that in terms of spreading ideas and knowledge in the co-authorship network: (i) researchers in highly segregated communities attract more citations in the periphery of the network because most cited papers are not the internal ones but rather those across communities with diverse disciplines and co-authors [43]. And (ii) researchers in non-segregated communities in the nucleus are citing themselves more and are exploiting/echoing scientific research [18].

Both effects need further analysis because, as expected, highly segregated communities located on the periphery have a larger impact. Individual success correlates with the exploitation of ideas [18]. Still, also the most innovative research (exploration of new concepts and persistent citations) comes from the periphery of networks [19], and it is done by smaller groups of researchers [42]. Here, our results align with previous evidence showing nodes in the periphery being less active [38] (i.e. publishing less in our case) but having more impact. In addition, researchers in those communities are a large population that could become a collective power that can mobilise and spread information [39] (such as scientific theories).

Researchers in larger and non-segregated communities in the nucleus also increase their impact. These results need further exploration because their central positions in the network’s nucleus increase their chance of outside interactions with highly segregated communities, which can accelerate the propagation of echoed information (ranging from biased theories to new paradigms) from local groups to reach the entire network [44]. The inner impact of highly segregated communities and their impact on the whole network should be measured to intervene, if necessary, and tackle or boost the spread of echoed information to different groups [17].

7.1 Limitations

First, our analysis does not generalise for all the years of Computer Science papers available in the Semantic Scholar database because we study just three years. We have developed a repeatable methodology and replicated our findings over several years. Still, further analysis is needed to understand how the transitions of researchers between different segregation categories affect their research impact over time.

Second, our analyses only generalise to some co-authorship networks because the publications of Computer Science in the Semantic Scholar Open Research Corpus represent a vast amount of literature in a discipline prone to working in small teams [29]. Further analysis of other fields is needed to understand how these patterns apply to different co-authorship structures.

Third, we did not classify the core-periphery type of our network. Recent work has highlighted the importance of understanding if the network is prone to be divided into cores as layers (as we did with the k-core decomposition algorithm) or if a hub/spoke core division is a better descriptor [45]. However, their results show that authorship networks are the most prone to have a core-layered typology, as we used in the current work. In further analyses, the definition of segregated communities should also consider the co-authorship network’s core typology.

Finally, our fourth limitation relies on using the extreme values of the SSI ‘s PDF from the co-authorship networks to define segregation categories of communities. A more precise analysis could consider continuous values of the SSI, other features and data to represent better the consumption and production of scientific knowledge [6]. Future work could consider a continuous comparison of the metrics used in this analysis, publications’ content, researchers’ demographic diversity, and interdisciplinary citations.

7.2 Future research

Future research on this topic could consider: (i) the temporal analysis of segregated communities and their relation to gaining more or fewer citations over time, (ii) the analysis of the diversity of the scientific publications inside the communities using opinion distance [13] and their demographic diversity to understand if the segregated and isolated communities are not diverse and echoing research to the point of becoming polarised, (iii) the definition of lead researchers (using the hub/spoke core or author position in the publications) and the understanding of their relationship to segregated communities [46], iv) the measurement of the impact of segregated communities on the topology of the network formation and the spreading processes of scientific theories [47].

Availability of data and materials

The datasets generated and analysed during the current study are available in the Semantic Scholar repository, https://www.semanticscholar.org/product/api

Notes

A fixed social group into which an individual is born within a particular system of social stratification, particularly used in Hinduism.

Abbreviations

SSI:: Spectral Segregation Index
LCC:: Largest Connected Component
PDF:: Probability density function
CDF:: Cumulative density function
TC:: Total citations
CP:: Citations per paper
CC:: Citations from the same community
CN:: Proportion of citations from the same year’s co-authors

References

Fortunato S, Bergstrom CT, Börner K, Evans JA, Helbing D, Milojević S, Petersen AM, Radicchi F, Sinatra R, Uzzi B, Vespignani A, Waltman L, Wang D, Barabási AL (2018) Science of science Science 359(6379). https://doi.org/10.1126/science.aao0185
Lynn FB (2014) Diffusing through disciplines: insiders, outsiders, and socially influenced citation behavior. Soc Forces 93(1):355–382. https://doi.org/10.1093/sf/sou069
Article MathSciNet Google Scholar
Sugimoto CR, Lariviere V, Ni C, Gingras Y, Cronin B (2013) Global gender disparities in science. Nature 504:211–213
Article Google Scholar
Smith MJ, Weinberger C, Bruna EM, Allesina S (2014) The scientific impact of nations: journal placement and citation performance. PLoS ONE 9(10):1–6. https://doi.org/10.1371/journal.pone.0109195
Article Google Scholar
Opthof T, Coronel R, Janse MJ (2002) The significance of the peer review process against the background of bias: priority ratings of reviewers and editors and the prediction of citation, the role of geographical bias. Cardiovasc Res 56(3):339–346. https://doi.org/10.1016/S0008-6363(02)00712-5
Article Google Scholar
Zeng A, Shen Z, Zhou J, Wu J, Fan Y, Wang Y, Stanley HE (2017) The science of science: From the perspective of complex systems. https://doi.org/10.1016/j.physrep.2017.10.001
Pan RK, Kaski K, Fortunato S (2012) World citation and collaboration networks: uncovering the role of geography in science. Sci Rep 2(1):902. https://doi.org/10.1038/srep00902
Article Google Scholar
Pan RK, Petersen AM, Pammolli F, Fortunato S (2018) The memory of science: inflation, myopia, and the knowledge network. J Informetr 12(3):656–678. https://doi.org/10.1016/j.joi.2018.06.005. ar**v:1607.05606
Article Google Scholar
Newman MEJ (2006) Finding community structure in networks using the eigenvectors of matrices. Phys Rev E 74:036104. https://doi.org/10.1103/PhysRevE.74.036104
Article MathSciNet Google Scholar
Bettencourt LMA, Kaiser DI, Kaur J (2009) Scientific discovery and topological transitions in collaboration networks. J Informetr 3(3):210–221. https://doi.org/10.1016/j.joi.2009.03.001
Article Google Scholar
Sunstein CR (2018) #Republic: Divided Democracy in the Age of Social Media, Ned - new edition edn. Princeton University Press, Princeton, pp 59–97. https://doi.org/10.2307/j.ctv8xnhtd
Kim S (2019) Directionality of information flow and echoes without chambers. PLoS ONE 14(5):1–22. https://doi.org/10.1371/journal.pone.0215949
Article Google Scholar
Sasahara K, Chen W, Peng H, Ciampaglia GL, Flammini A, Menczer F (2021) Social influence and unfollowing accelerate the emergence of echo chambers. J Comput Soc Sci. https://doi.org/10.1007/s42001-020-00084-7. ar**v:1905.03919
Article Google Scholar
Perra N, Rocha LEC (2019) Modelling opinion dynamics in the age of algorithmic personalisation. Sci Rep 9(1):1–11. https://doi.org/10.1038/s41598-019-43830-2. ar**v:1811.03341
Article Google Scholar
Del Vicario M, Vivaldo G, Bessi A, Zollo F, Scala A, Caldarelli G, Quattrociocchi W (2016) Echo chambers: emotional contagion and group polarization on Facebook. Sci Rep 6:1–12. https://doi.org/10.1038/srep37825. ar**v:1607.01032
Article Google Scholar
Henry AD, Prałat P, Zhang CQ (2011) Emergence of segregation in evolving social networks. Proc Natl Acad Sci USA 108(21):8605–8610. https://doi.org/10.1073/pnas.1014486108
Article MathSciNet MATH Google Scholar
Jalali ZS, Wang W, Kim M, Raghavan H, Soundarajan S (2020) On the information unfairness of social networks. In: Proceedings of the 2020 Siam international conference on data mining, SDM 2020, pp 613–621. https://doi.org/10.1137/1.9781611976236.69
Chapter Google Scholar
Mason W, Watts DJ (2012) Collaborative learning in networks. Proc Natl Acad Sci USA 109(3):764–769. https://doi.org/10.1073/pnas.1110069108
Article Google Scholar
Painter DT, Daniels BC, Laubichler MD (2021) Innovations are disproportionately likely in the periphery of a scientific network. Theory Biosci 140(4):391–399. https://doi.org/10.1007/s12064-021-00359-1
Article Google Scholar
Nielsen MW, Bloch CW, Schiebinger L (2018) Making gender diversity work for scientific discovery and innovation. Nat Hum Behav 2(10):726–734. https://doi.org/10.1038/s41562-018-0433-1
Article Google Scholar
Sonnenwald DH (2008) Scientific collaboration. Annu Rev Inf Sci Technol 41(1):643–681
Article Google Scholar
Tedre M (2017) In: The science of computing: sha** a discipline, CRC Press, Boca Raton
MATH Google Scholar
Lo K, Wang LL, Neumann M, Kinney R, Weld D (2020) S2ORC: the semantic scholar open research corpus. https://doi.org/10.18653/v1/2020.acl-main.447. ar**v:1911.02782
Book Google Scholar
Raghavan UN, Albert R, Kumara S (2007) Near linear time algorithm to detect community structures in large-scale networks. Phys Rev E 76(3). https://doi.org/10.1103/physreve.76.036106
Newman MEJ (2004) Who is the best connected scientist?a study of scientific coauthorship networks. J Complex Netw, 337–370. https://doi.org/10.1007/978-3-540-44485-5_16
Cann TJB, Weaver IS, Williams HTP (2018) Is it correct to project and detect? Assessing performance of community detection on unipartite projections of bipartite networks. In: Complex networks and their applications VII. Springer, Cham, pp 267–279. https://doi.org/10.1007/978-3-030-05411-3_22
Chapter Google Scholar
Barrat A, Barthélemy M, Pastor-Satorras R, Vespignani A (2004) The architecture of complex weighted networks. Proc Natl Acad Sci 101(11):3747–3752. https://doi.org/10.1073/pnas.0400087101
Article MATH Google Scholar
Fortunato S, Hric D (2016) Community detection in networks: a user guide. Phys Rep 659:1–44. https://doi.org/10.1016/j.physrep.2016.09.002
Article MathSciNet Google Scholar
Newman MEJ (2001) The structure of scientific collaboration networks. In: PNAS
Google Scholar
Lancichinetti A, Saramäki J, Kivelä M, Fortunato S (2010) Characterizing the community structure of complex networks. PLoS ONE ar**v:1005.4376. https://doi.org/10.1371/journal.pone.0011976
Fanelli D, Larivière V (2016) Researchers’ individual publication rate has not increased in a century. PLoS ONE 11(3):1–12. https://doi.org/10.1371/journal.pone.0149504
Article Google Scholar
Echenique F, Fryer RG (2007) A measure of segregation based on social interactions. Q J Econ. https://doi.org/10.1162/qjec.122.2.441
Article Google Scholar
Montes F, Jimenez RC, Onnela J-P (2017) Connected but segregated: social networks in rural villages. J Complex Netw 6(5):693–705. https://doi.org/10.1093/comnet/cnx054. https://academic.oup.com/comnet/article-pdf/6/5/693/26058916/cnx054.pdf
Article MathSciNet Google Scholar
Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821–7826. https://doi.org/10.1073/pnas.122653799
Article MathSciNet MATH Google Scholar
Bojanowski M, Corten R (2014) Measuring segregation in social networks. Soc Netw. https://doi.org/10.1016/j.socnet.2014.04.001
Article Google Scholar
Scott DW (1992) Multivariate density estimation. Wiley, Huston. https://doi.org/10.1002/9780470316849
Book MATH Google Scholar
Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113. https://doi.org/10.1103/PhysRevE.69.026113
Article Google Scholar
Williams HTP, McMurray JRJR, Kurz T, Hugo Lambert F (2015) Network analysis reveals open forums and echo chambers in social media discussions of climate change. Glob Environ Change 32:126–138. https://doi.org/10.1016/j.gloenvcha.2015.03.006
Article Google Scholar
Barberá P, Wang N, Bonneau R, Jost JT, Nagler J, Tucker J, González-Bailón S (2015) The critical periphery in the growth of social protests. PLoS ONE. https://doi.org/10.1371/journal.pone.0143611
Article Google Scholar
Batagelj V, Zaversnik M (2003) An O(m) algorithm for cores decomposition of networks. ar**v:cs/0310049
Cronin B, Sugimoto CR (2015) Scholarly metrics under the microscope: from citation analysis to academic auditing. ASIST monograph series, Medford, NJ. https://doi.org/10.5596/c15-025
Book Google Scholar
Wu L, Wang D, Evans JA (2019) Large teams develop and small teams disrupt science and technology. Nature 566(7744):378–382. https://doi.org/10.1038/s41586-019-0941-9
Article Google Scholar
Zingg C, Nanumyan V, Schweitzer F (2020) Citations driven by social connections? A multi-layer representation of coauthorship networks. Quant. Sci. Stud. 1(4):1493–1509. https://doi.org/10.1162/qss_a_00092. ar**v:1909.13507
Article Google Scholar
Davis JT, Perra N, Zhang Q, Moreno Y, Vespignani A (2020) Phase transitions in information spreading on structured populations. Nat Phys. https://doi.org/10.1038/s41567-020-0810-3
Article Google Scholar
Gallagher RJ, Young JG, Welles BF (2021) A clarified typology of core-periphery structure in networks. Sci Adv. https://doi.org/10.1126/sciadv.abc9800. ar**v:2005.10191
Article Google Scholar
Guo L, Rohde JA, Wu HD (2020) Who is responsible for Twitter’s echo chamber problem? Evidence from 2016 U.S. election networks. Inf Commun Soc 23(2):234–251. https://doi.org/10.1080/1369118X.2018.1499793
Article Google Scholar
Törnberg P (2018) Echo chambers and viral misinformation: Modeling fake news as complex contagion. PLoS ONE. https://doi.org/10.1371/journal.pone.0203958

Download references

Acknowledgements

The authors would like to thank the US Army Research Office for the partial support provided to RM under grant number W911NF-18-1-0421. AMJ is funded by a PhD studentship from the UK Engineering and Physical Sciences Research Council. No funding bodies had any influence over the content of this report.

Author information

Authors and Affiliations

BioComplex Laboratory, Department of Computer Science, University of Exeter, Exeter, UK
Ana Maria Jaramillo & Ronaldo Menezes
Complexity Science Hub, Vienna, Austria
Ana Maria Jaramillo
SEDA Lab, Department of Computer Science, University of Exeter, Exeter, UK
Hywel T. P. Williams
School of Mathematical Sciences, Queen Mary University of London, London, UK
Nicola Perra
Department of Computer Science, Federal University of Ceará, Fortaleza, Brazil
Ronaldo Menezes

Authors

Ana Maria Jaramillo
View author publications
You can also search for this author in PubMed Google Scholar
Hywel T. P. Williams
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Perra
View author publications
You can also search for this author in PubMed Google Scholar
Ronaldo Menezes
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors conceived and designed the research. AMJ acquired the data. AMJ, HTPW, NP and RM analysed the data. All authors discussed the research and wrote and approved the final version of the manuscript.

Corresponding author

Correspondence to Ana Maria Jaramillo.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

(PDF 29.8 MB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jaramillo, A.M., Williams, H.T.P., Perra, N. et al. The structure of segregation in co-authorship networks and its impact on scientific production. EPJ Data Sci. 12, 47 (2023). https://doi.org/10.1140/epjds/s13688-023-00411-8

Download citation

Received: 08 November 2022
Accepted: 07 August 2023
Published: 09 October 2023
DOI: https://doi.org/10.1140/epjds/s13688-023-00411-8

The structure of segregation in co-authorship networks and its impact on scientific production

Abstract