Introduction

Open Source Software (OSS) is a key actor in the current software market, and a major factor in the consistent growth of the software economy. The promise of OSS is better quality, higher reliability, more flexibility, lower cost, and an end to predatory vendor lock-in, according to the Open Source initiative1. These goals are achieved thanks to the active participation of the community2: indeed, OSS projects depend on contributors to progress3,4.

The emergence of GitHub and other platforms as prominent public repositories, together with the availability of APIs to access comprehensive datasets on most projects’ history, has opened up the opportunities for more systematic and inclusive analyses of how OSS communities operate. In the last years, research on OSS has left behind a rich trace of facts. For example, we now know that the majority of code contributions are highly skewed towards a small subset of projects5,6, with many projects quickly losing community interest and being abandoned at very early stages7. Moreover, most projects have a low truck factor, meaning that a small group of developers is responsible for a large set of code contributions8,9,10. This pushes projects to depend more and more on their ability to attract and retain occasional contributors (also known as “drive-by” commits11) that can complement the few core developers and help them to move the project forward. Along these lines, several works have focused on strategies to increase the on-boarding and engagement of such contributors (e.g., by using simple contribution processes12, extensive documentation13, gamification techniques14 or ad hoc on-boarding portals15, among others16). Other social, economic, and geographical factors affecting the development of OSS have been scrutinised as well, see Cosentino et al.17 for a thorough review.

Parallel to these macroscopic observations and statistical analyses, social scientists and complex network researchers have focused, in relatively much fewer papers, on analysing how a diverse group of (distributed) contributors work together, i.e. the structural features of projects. Most often, these works pivot on the interactions between developers, building explicit or implicit collaborative networks, e.g. email exchanges18,19 and unipartite projections from the contributor-file bipartite network20, respectively. These developer social networks have been analysed to better understand the hierarchies that emerge among contributors, as well as to identify topical clusters, i.e. cohesive subgroups that manifest strongly in technical discussions. However, the behaviour of OSS communities cannot be fully understood only accounting for the relations between project contributors, since their interactions are mostly mediated through the edition of project files (no direct communication is present between group members). To overcome this limitation, here we focus on studying the structural organisation of OSS projects as contributor-file bipartite graphs. On top of technical and methodological adaptations, the consideration of these two elements composing the OSS system allows retaining valuable information (as opposed to collapsing it on a unipartite network) and, above all, recognising both classes as co-evolutionary units that place mutual constraints on each other.

Our interest on the structural features of OSS projects departs from some obvious, but worth highlighting, observations. First, public collaborative repositories place no limits, in principle, to the number of developers (and files) that a project should host. In this sense, platforms like GitHub resemble online social networks (e.g. Twitter or Facebook), in which the number of allowed connections is virtually unbounded. However, we know that other factors –biological, cognitive– set well-defined limits to the amount of active social connections an individual can have21, also online22. But, do these limits apply in collaborative networks, where contributors work remotely and asynchronously? Does a division of labour arise, even when interaction among developers is mostly indirect (that is, via the files that they edit in common)? And, even if specialised subgroups emerge (as some evidence already suggests, at least in developer social networks20), do these exhibit some sort of internal organisation?

To answer these questions, we will look at three structural arrangements which have been identified as signatures of self-organisation in both natural and artificial systems: nestedness23,24, modularity25,26,27, and in-block nestedness28,29,30. The first one, nestedness, is a suitable measure to quantify and visualise how the mentioned low truck factor, and the existence of core/drive-by developers31, translates into a project’s network structure. As for modularity, it provides a natural way to check whether OSS projects split in identifiable compartments, suggesting specialisation, and whether such compartments are subject to size limitations, along the mentioned bio-cognitive limits. Finally, since modularity and nestedness are, to some extent, incompatible in the same network32,17.

These findings open up a rich scenario, with many questions lying ahead. On the OSS environment side, our results contribute to an understanding of how successful projects self-organise towards a modular architecture: large and complex tasks, involving hundreds (and even thousands) of files appear to be broken down, presumably for the sake of efficiency and task specialisation (division of labour). Within this compartmentalisation, mature projects exhibit even further organisation, arranging the internal structure of subgroups in a nested way –something that is not grasped by modularity optimisation only. More broadly, our results demand further investigation, to understand their connection with the general topic of work team assembly (size, composition, and formation), and to the (urgent) issue of software sustainability38. OSS is a prominent example of the “tragedy of the commons”: companies and services benefit from the software, but there is a grossly disproportionate imbalance between those consuming the software and those building and maintaining it. Indeed, by being more aware of the internal self-organisation of their projects, owners and administrators may design strategies to optimise the collaborative efforts of the limited number (and availability) of project contributors. For instance, they can place efforts to drive the actual project’s block decomposition towards a pre-defined software architectural pattern; or ensure that, despite the nested organisation within blocks, all files in a block receive some minimal attention. More research on the derivation of effective project management leadership strategies from the current division of labour in a project is clearly needed and impactful.

Closer to the complex networks and data analysis tradition, our results leave room to widen the scope of this research. First, the present analysis could be complemented with weighted information. On first thought, this is within reach –one should just adapt the techniques and measurements to a weighted scenario. However, the problem is not so much a methodological one, but semantic: the number of times that a contributor interacts with a file (commits, in Git jargon) is not necessarily an accurate measure of the amount of information allocated in the file. Second, future research should tackle a larger and more heterogeneous set of projects, and even across different platforms such as Bitbucket. Admittedly, this work has focused on successful projects, inasmuch we only consider a few dozens among the most popular. Other sampling criteria could be discussed and considered in the future, to ensure richer and more diverse project collection. Beyond the richness of the analysed dataset, the relationship between maturity and structural arrangement (specially in regard to the internal organisation of subgroups) clearly demands further work. Two obvious –and intimately related– lines of research are related to time-resolved datasets, and the design of a suitable model that can mimic the growth and evolution of OSS projects. Regarding a temporal account of OSS projects, some challenges emerge due to the bursty development of projects in git-like environments. For example, a fixed sliding-window scheme would probably harm, rather than improve, possible insights into software development. On the modelling side, further empirical knowledge is needed to better grasp the cooperative-competitive interactions within these type of projects, which in turn determine the dynamical rules for both contributors and files which, presumably, differ largely.

Material and Methods

Data

Our open source projects dataset was collected from GitHub39, a social coding platform which provides source code management and collaboration features such as bug tracking, feature requests, tasks management and wiki for every project. Given that GitHub users can star a project (to show interest in its development and follow its advances), we chose to measure the popularity of a GitHub project in terms of its number of stars (i.e. the more stars the more popular the project is considered) and selected the 100 most popular projects. Other possible criteria –number of forks, open issues, watchers, commits and branches– are positively correlated with stars17, and so our proxy to mature, successful and active projects probably overlaps with other sampling procedures. The construction of the dataset involved three phases, namely: (1) cloning, (2) import, and (3) enrichment.

Cloning and import

After collecting the list of 100 most popular projects in GitHub (at the moment of collecting the data) via its API40, we cloned them to collect 100 Git repositories. We analysed the cloned repositories and discarded those ones not involving the development of a software artifact (e.g. collection of links or questions), rejecting 15 projects out of the initial 100. We then imported the remaining Git repositories into a relational database using the Gitana41 tool to facilitate the query and exploration of the projects for further analysis. In the Gitana database, Git repositories are represented in terms of users (i.e. contributors with a name and an email); files; commits (i.e. changes performed to the files); references (i.e. branches and tags); and file modifications. For two projects, the import process failed to complete due missing or corrupted information in the source GitHub repository.

Enrichment

Our analysis needs a clear identification of the author of each commit so that we can properly link contributors and files they have modified. Unfortunately, Git does not control the name and email contributors indicate when pushing commits resulting on clashing and duplicate problems in the data. Clashing appears when two or more contributors have set the same name value (in Git the contributor name is manually configured), resulting in commits actually coming from different contributors appearing with the same commit name (e.g., often when using common names such as “mike”). In addition, duplicity appears when a contributor has several emails, thus there are commits that come from the same person, but are linked to different emails suggesting different contributors. We found that, on average, around 60% of the commits in each project were modified by contributors that involved a clashing/duplicity problem (and affecting a similar number of files). To address this problem, we relied on data provided by GitHub for each project (in particular, GitHub usernames, which are unique). By linking commits to unique usernames, we could disambiguate the contributors behind the commits. Thus, we enriched our repository data by querying GitHub API to discover the actual username for each commit in our repository, and relied on those instead on the information provided as part of the Git commit metadata. This method only failed for commits without a GitHub username associated (e.g. when the user that made that commit was no longer existing in GitHub). In those cases we stick to the email in Git commit as contributor identifier. We reduced considerably the clashing/duplicity problem in our dataset. The percentage of commits modified by contributors that may involve a clashing/duplicity problem was reduced to 0.004% on average (σ = 0.011), and the percentage of files affected was reduced to 0.020% (σ = 0.042).

At the end of this process, we had successfully collected a total number of 83 projects, adding up to 48,015 contributors, 668,283 files and 912,766 commits. 18 more projects (to the total of 65 reported in this work) were rejected due to other limitations. On one hand, we discarded some projects that presented very strong divergence between the number of nodes of the two sets, e.g. projects with very large number of files but very few contributors. In these cases, although \({\mathscr{N}}\), Q and \( {\mathcal I} \) can be quantified, the outcome is hardly interpretable. An example of this is the project material-designs-icons, with 15 contributors involved in the development of 12,651 files. As mentioned above, we also discarded projects that are not devoted to software development, but are rather collections of useful resources (free programming books, coding courses, etc.). Finally, we considered only projects with a bipartite network size within the range 101S ≤ 104, as the computational costs to optimise in-block nestedness and modularity for larger sizes were too severe. The complete dataset with the final 65 projects is available at http://cosin3.rdi.uoc.edu, under the Resources section.

Matrix generation

We build a bipartite unweighted network as a rectangular N × M matrix, where rows and columns refer to contributors and source files of an OSS project, respectively. Cells therefore represent links in the bipartite network, i.e. if the cell aij has a value of 1, it represents that the contributor i has modified the file j at least once, otherwise aij is set to 0.

We are aware that an unweighted scheme may be discarding important information, i.e. the heterogeneity of time and effort that developers devote to files. We stress that including weights in our analysis can introduce ambiguities in our results. In the Github environment, the size of a contribution could be regarded either as the number of times a developer commits to a file, or as the number of lines of code (LOC) that a developer modified when updating the file. Indeed, both could represent additional dimensions to our study. Furthermore, at least for the first (number of commits), it is readily available from the data collection methods. However, weighting the links of the network by the number of commits is risky. Consider for example a contributor who, after hours or days of coding and testing, performs a commit that substantially changes a file in a project. On the other side, consider a contributor who is simply documenting some code, thus committing many times small comments to an existing software –without changing the internal logic of it. There is no simple way to distinguish these cases. The consideration of the second item (number of LOC modified) could be a proxy to such distinction, but this is information is not realistically accessible given the current limitations to data collection. Getting a precise number of LOCs requires a deeper analysis of the Git repository associated to the GitHub project, parsing the commit change information one by one –an unfeasible task if we aim at analysing a large set of projects. The same scalability issue would appear if we rely on the GitHub API to get this information, which additionally would involve quota problems with such API. One might consider even a third argument: not every programming language “weighs” contributions in the same way. Many lines of HTML code may have a small effect on the actual advancement of a project, while two brief lines in C may completely change a whole algorithm. In conclusion, we believe there is no generic solution that allows to assess the importance of a LOC variation in a contribution. This will depend first on the kind of file, then on the programming style of each project and finally on an individual analysis of each change. Thus, adding informative and reliable weights to the network is semantically unclear (how should we interpret those weights?) and operationally out of reach.

Nestedness

The concept of nestedness appeared, in the context of complex networks, over a decade ago in Systems Ecology42. In structural terms, a perfect nested pattern is observed when specialists (nodes with low connectivity) interact with proper nested subsets of those species interacting with generalists (nodes with high connectivity), see Fig. 2 (left). Several works have shown that a nested configuration is signature feature of cooperative environments –those in which interacting species obtain some benefit42,43,44. Following this example in natural systems, scholars have sought (and found) this pattern in other kinds of systems32,45,46,47. In particular, measuring nestedness in OSS contributor-file bipartite networks helps to uncover patterns of file development. For instance, in a perfectly nested bipartite network the most generalist developer has contributed to every file in the project, i.e. a core developer. Other contributors will exhibit decreasing amounts of edited files. On top of this hierarchical arrangement, we find asymmetry: specialist contributors (those working on a single file) develop precisely the generalist file, i.e. the file that every other developer also works on. Here, we quantify the amount of nestedness in our OSS networks by employing the global nestedness fitness \({\mathscr{N}}\) introduced by Solé-Ribalta et al.30:

$${\mathscr{N}}=\frac{2}{N+M}\{\mathop{\sum }\limits_{i,j}^{N}\,[\frac{{O}_{i,j}-\langle {O}_{i,j}\rangle }{{k}_{j}(N-1)}\Theta ({k}_{i}-{k}_{j})]+\mathop{\sum }\limits_{l,m}^{M}\,[\frac{{O}_{l,m}-\langle {O}_{l,m}\rangle }{{k}_{m}(M-1)}\Theta ({k}_{l}-{k}_{m})]\},$$
(1)

where Oi,j (or Ol,m) measures the degree of links overlap between rows (or columns) node pairs; ki, kj corresponds to the degree of the nodes i,j; Θ(·) is a Heaviside step function that guarantees that we only compute the overlap between pair of nodes when ki ≥ kj. Finally, 〈Oi,j〉 represents the expected number of links between row nodes i and j in the null model, and is equal to \(\langle {O}_{i,j}\rangle =\frac{{k}_{i}{k}_{j}}{M}\). This measure is in the tradition of other overlap measures, i.e. NODF48,49.

Modularity

A modular network structure (Fig. 2, center) implies the existence of well-connected subgroups, which can be identified given the right heuristics to do so. Unlike nestedness (which apparently emerges only in very specific circumstances), modularity has been reported in almost any kind of systems: from food-webs50 to lexical networks51, to the Internet27 and social networks52. Applied to OSS developer-file networks, modularity helps to identify blocks of developers working together in a set of files. High Q values in OSS projects would reveal some level of specialisation (division of labour) in the development of the project. However, if an OSS project is only modular (i.e., any trace of nestedness is missing), it may reveal that, beyond compartmentalisation, no further organisational refinement is at work. Here, we search a (sub)optimal modular partition of the nodes through a community detection analysis26,27. To this end, we apply the extremal optimisation algorithm53 (along with a Kernighan-Lin54 refinement procedure) to maximise Barber’s26 modularity Q,

$$Q=\frac{1}{L}\mathop{\sum }\limits_{i=1}^{N}\,\mathop{\sum }\limits_{j=N+1}^{N+M}\,({\tilde{a}}_{ij}-{\tilde{p}}_{ij})\,\delta ({\alpha }_{i},{\alpha }_{j})$$
(2)

where L is the number of interactions (links) in the network, \({\tilde{a}}_{ij}\) denotes the existence of a link between nodes i and j, \({\tilde{p}}_{ij}={k}_{i}{k}_{j}/L\) is the probability that a link exists by chance, and δ(αi,αj) is the Kronecker delta function, which takes the value 1 if nodes i and j are in the same community, and 0 otherwise.

In-block nestedness

Nestedness and modularity are emergent properties in many systems, but it is rare to find them in the same system. This apparent incompatibility has been noticed and studied, and it can be explained by different evolutive pressures: certain mechanisms favour the emergence of blocks, while others favour the emergence of nested patterns. Following this logic, if two such mechanisms are concurrent, then hybrid (nested-modular) arrangements may appear. Hence, the third architectural organisation that we consider in our work refers to a mesoscale hybrid pattern, in which the network presents a modular structure, but the interactions within each module are nested, i.e. an in-block nested structure, see Fig. 2 (right). This type of hybrid or “compound” architectures was first described in Lewinsohn et al.28. Although the literature covering this types of patterns is still scarce, the existence of such type of hybrid structure in empirical networks has been recently explored29,30,55, and the results from these works seem to indicate that combined structures are, in fact, a common feature in many systems from different contexts.

In order to compute the amount of in-block nested present in networks, in this work, we have adopted a new objective function30, that is capable to detect these hybrid architectures, and employed the same optimisation algorithms used to maximise modularity. The in-block nestedness objective function can be written as,

$$ {\mathcal I} =\frac{2}{N+M}\{\mathop{\sum }\limits_{i,j}^{N}\,[\frac{{O}_{i,j}-\langle {O}_{i,j}\rangle }{{k}_{j}({C}_{i}-1)}\Theta ({k}_{i}-{k}_{j})\,\delta ({\alpha }_{i},{\alpha }_{j})]+\mathop{\sum }\limits_{l,m}^{M}\,[\frac{{O}_{l,m}-\langle {O}_{l,m}\rangle }{{k}_{m}({C}_{l}-1)}\Theta ({k}_{l}-{k}_{m})\,\delta ({\alpha }_{l},{\alpha }_{m})]\},$$
(3)

Note that, by definition, \( {\mathcal I} \) reduces to \({\mathscr{N}}\) when the number of blocks is 1. This explains why the right half of the ternary plot (Fig. 6) is necessarily empty: \( {\mathcal I} \ge {\mathscr{N}}\), and therefore \({f}_{ {\mathcal I} }\ge {f}_{{\mathscr{N}}}\). On the other hand, an in-block nested structure exhibits necessarily some level of modularity, but not the other way around. This explains why the lower-left area of the simplex in Fig. 6 is empty as well (see Palazzi et al.33 for details).

The corresponding software codes for nestedness measurement, and modularity and in-block nestedness optimisation (both for uni- and bipartite cases), can be downloaded from the web page http://cosin3.rdi.uoc.edu/, under the Resources section.

Stationarity test

Figures 1 and 5 visually suggest that some quantities do not vary as a function of project size –or vary very slowly. As convincing as this visual hint may result, a statistical test is necessary to confirm that, indeed, there is a limit on the quantity at stake. The idea of stationarity on a time series implies that summary statistics of the data, like the mean or variance, are approximately constant when measured from any two starting points in series (different project sizes in our case). Typically, statistical stationarity tests are done by checking for the presence (or absence) of a unit root on the time series (null hypothesis). A time series is said to have a unit root if we can write it as

$${y}_{t}={a}^{n}{y}_{t-n}+\sum _{i}\,{\varepsilon }_{t-i}{a}^{i}$$
(4)

where ε is an error term. If a = 1 the null hypothesis of non-stationarity is accepted. On the contrary, if a < 1 there is not unit root, and the process is deemed stationary. In this work, we have employed the Augmented Dickey-Fuller (ADF) test56, as implemented in the statsmodels.tsa.stattools Python package. The results of the analysis indicate that, if the test statistic is less than the critical values at different significance levels, then, the null hypothesis of a unit root is rejected, and we can conclude that the data series is stationary.