Introduction

Since the advent of high throughput dispensing and automated microscopy, methods and software tools for large-scale single-cell image analysis have blossomed and enabled profiling and comparison of large ranges of perturbations on cell cultures1,2,3,4. Variance estimation and statistical testing in these approaches are achieved by simply producing several replicates per condition. This is made possible by the fact that cell culture permits robust standardized and sometimes fully automated replication of a sample condition. In contrast, while it is still possible to detect and quantify a single-cell event in large slide of cell tissues, their comparison and statistical analysis have remained hampered, apart from stereotypic exceptions, by the imprecision of microdissection, spatial inhomogeneity of samples, and notorious replicate variability5. In fact, spatial inhomogeneity is what makes tissues interesting to study in comparison to cell culture that often spreads a single or a few cell types uniformly but barely matches the spatial organization of these cells in an organism. In tissue samples, the state of a cell within its local context is observed once, and this exact context can in general barely be reproduced with precision. Therefore, as heterogeneity across a sample is the rule, obtaining robust standardized replicates is difficult for a single-cell event and impossible for a small cell patch or cell organization pattern observed locally. The unavailability of reliable replicates of such an event makes comparison of between-versus-within group variance irrelevant and statistical evidence unworkable.

However, what is sought in studies that relies on tissue sample observation are factors underlying cell organization, developmental process, or disease progression, independent of the variability between observations. From this point, how to deal with the impossibility of obtaining robust standardized replicates of an event? Is it possible to statistically test for the existence of a local cell to cell relationship from a single replicate? How to assess the existence and detect the heterogeneity of a local phenotype across one tissue sample? All these questions can be summarized in one: is the cell organization, observed at a specific location, driven by molecular or mechanical factors, or is it likely to be expected by chance given the distribution of cell shape and size? Being able to systematically answer this question is of growing importance as bridging the gap between profiles of single-cell gene expression and spatial cell relationships and morphology in tissue samples is at reach2). Further details on the design of the parametric distance, the identification of the parameters by fitting to a cell contour and the reconstruction of the tessellation, are provided in the sections below. We also describe how the content of each cell can be preserved in the synthetic images and how statistical significance of any observed cell pattern can be obtained using SET.

Design of a flexible parametric distance

Let y be a bivariate random vector following a centered standard joint uncorrelated (not necessarily normal) distribution such that E(y) = 0 and cov(y) = I, S a diagonal scaling matrix, R a rotation matrix, and μ a translating vector. Scaling, rotating, and translating y yield x = RSy + μ. Similarly, y can be retrieved from x by inverting the transformation y = (RS)−1 (xμ). It is then straightforward to show that E(x) = μ, cov(x) = RSSR′ and to retrieve that the Euclidean distance between the origin and y is the Mahalanobis distance (parameterized by μ, R, and S) between the origin and x which is the scaled, rotated, and translated vector y:

$$d_{L_2}(0,{\mathbf{y}}) = \sqrt {{\mathbf{y}}^{\prime} {\mathbf{y}}} = \sqrt {({\mathbf{x}} - {\mathbf{\mu }})^{\prime} ({\mathbf{RSSR}}^{\prime} )^{ - 1}({\mathbf{x}} - {\mathbf{\mu }})} = d_{M({\mathbf{\mu }},{\mathbf{R}},{\mathbf{S}})}({\mathbf{x}}).$$

Independently, the Euclidean distance dL2 can be rewritten in the following uncommon way:

$$d_{L_2}(0,{\mathbf{y}}) = \sqrt {{\boldsymbol{y}}_1^2 + {\boldsymbol{y}}_2^2} = (1^{\prime} {\mathbf{y}}^{ \circ 2})^{\frac{1}{2}},$$

where the symbol ◦ means that the exponent 2 is applied to the vector y elementwise. By simply replacing y, the Mahalanobis distance dM can then also be rewritten this way:

$$d_{M({\mathbf{\mu }},{\mathbf{R}},{\mathbf{S}})}({\mathbf{x}}) = d_{L_2}(0,{\mathbf{y}}) = \left( {1^{\prime} \left| {({\mathbf{RS}})^{ - 1}({\mathbf{x}} - {\mathbf{\mu }})} \right|^{ \circ 2}} \right)^{\frac{1}{2}}.$$

Unlike the usual quadratic form of the Mahalanobis distance showed earlier, this form presents the compelling advantage of making possible the generalization of Mahalanobis and Minkowski distances (such as Euclidean, Manhattan and Chebychev) under a single parametric function that we name Minkovski Affine Transform (MAT) distance by introducing a parameter p that denotes the Minkovski order:

$$d_{{\mathrm{{MAT}}}({\mathbf{\mu }},{\mathbf{R}},{\mathbf{S}},p)}({\mathbf{x}}) = d_{L_p}(0,{\mathbf{y}}) = \left( {1^{\prime} \left| {({\mathbf{RS}})^{ - 1}({\mathbf{x}} - {\mathbf{\mu }})} \right|^{ \circ p}} \right)^{\frac{1}{p}}.$$

Note that if p ≥ 1, dMAT is a metric, especially if p = 2 the dMAT is the Mahalanobis distance and if p < 1, triangle inequality is lost and dMAT is a semi-metric. This formulation offers the possibility to design flexible distance functions based on the affine transformation of any Minkovski metric. This relationship between original Minkovski metrics and their transformation to a parameterized MAT distance function are illustrated by Supplementary Fig. 11.

The level sets of MAT are super ellipses that are more flexible than the standard ellipses provided by the Mahalanobis distance. They offer the possibility of modeling roundish rectangular shapes such as some plant cells or diamond like cells. However, they are all symmetric about their center and about the two principal axes. In order to obtain a distance function with level sets possibly matching asymmetric cell shapes, we generalized the MAT distance further by introducing two asymmetric terms a1 and a2. These terms weight how much the value of an axis influences the value of the other axis and reversely, considering the yet unrotated and unscaled vector y. The Asymmetric Minkovski Affine Transform (AMAT) distance dAMAT we propose reads:

$$d_{{\mathrm{{AMAT}}}({\mathbf{\mu }},{\mathbf{R}},{\mathbf{S}},{\mathbf{A}},p)}({\mathbf{x}}) = \left( {1^{\prime} e^{ - {\mathbf{AJ}}{\mathrm{{diag}}}({\mathbf{y}}){\mathbf{J}}}\left| {\mathbf{y}} \right|^{ \circ p}} \right)^{\frac{1}{p}}$$

with

$${\mathbf{A}} = \left[ {\begin{array}{*{20}{c}} {{\mathrm{a}}_1} & 0 \\ 0 & {{\mathrm{a}}_2} \end{array}} \right]\quad {\mathbf{J}} = \left[ {\begin{array}{*{20}{c}} 0 & 1 \\ 1 & 0 \end{array}} \right]\quad {\mathrm{and}}\quad {\mathbf{AJ}}{\mathrm{{diag}}}\left( {\mathbf{y}} \right){\mathbf{J}} = \left[ {\begin{array}{*{20}{c}} {{\mathrm{a}}_1{\mathrm{y}}_2} & 0 \\ 0 & {{\mathrm{a}}_2{\mathrm{y}}_1} \end{array}} \right]$$

for the sake of clarity we recall here that

$${\mathbf{\mu }} = \left[ {\begin{array}{*{20}{c}} {\mu _1} \\ {\mu _2} \end{array}} \right]{\mathbf{S}} = \left[ {\begin{array}{*{20}{c}} {{\mathrm{s}}_1} & 0 \\ 0 & {{\mathrm{s}}_2} \end{array}} \right]{\mathbf{R}} = \left[ {\begin{array}{*{20}{c}} {{\mathrm{cos}}(\alpha )} & {{\mathrm{sin}}(\alpha )} \\ { - {\mathrm{sin}}(\alpha )} & {{\mathrm{cos}}(\alpha )} \end{array}} \right]{\mathbf{y}} = \left( {{\mathbf{RS}}} \right)^{ - 1}({\mathbf{x}} - {\mathbf{\mu }}).$$

Altogether, the AMAT distance comprises eight parameters: μ1, μ2 the coordinates of the cell, s1 the length of the longest axis containing μ, s2 the length of the shortest orthogonal axis containing μ, α the angle of the longest axis containing μ with the x-axis, a1 the degree of asymmetry about the longest axis, a2 the degree of asymmetry about the shortest orthogonal axis, and p the Minkovski order. This function offers a distance map with short range level sets modeling for a large panel of closed shapes such as cells can display (Supplementary Fig. 12). Note that, unlike the MAT distance, some combinations of parameters a1, a2, and p may in theory lead to an AMAT map that can contain critical points at other places than the origin, possibly leading to non-closed or disconnected level sets at long range. In short, AMAT is not guaranteed to be a distance for all combinations of parameters. However, we will see that such a situation can easily be handled, as for our modeling purposes we are only interested in short distances defined locally about the cell membrane, that is about distance 1 from the origin, for which AMAT behaves well as expected.

Fitting dAMAT = 1 to a cell contour

The segmented contour of each cell is subsampled to an arbitrary resolution of N points xi regularly spread (typically N = 100). The following sum of squared error is then minimized for each cell:

$$\mathop {{\min }}\limits_{{\mathbf{\mu }},{\mathbf{R}},{\mathbf{S}},{\mathbf{A}},p} \mathop {\sum }\limits_{i = 1}^N \left( {d_{{\mathrm{{AMAT}}}\left( {{\mathbf{\mu }},{\mathbf{R}},{\mathbf{S}},{\mathbf{A}},p} \right)}\left( {{\mathbf{x}}_{\boldsymbol{i}}} \right) - 1} \right)^2.$$

Note that if a1 = 0, a2 = 0, and p = 2 are fixed, then no minimization process is needed, as the AMAT distance is the Mahalanobis distance and the location is the centroid of the cell, and the scale and rotation parameters can be obtained by diagonalization of the covariance matrix of the pixels of the cell. If any of a1, a2, or p are let free to evolve then the minimization process is needed for all parameters and these values are instead used for initialization. The eight parameters of the AMAT distance are then initialized to the centroid of the cell for μ1 and μ2, the lengths of the principal axes of the cell for s1 and s2, the angle of the principal axis with the x-axis for α, a1 = 0, a2 = 0, and p = 2. The parameter p enables modeling of squarish cells, the parameters a1 and a2 enable triangular modeling or egg like cells, and most importantly, combinations of all of the eight parameters enable a large set of complex cell shapes to be modeled. Whether it is arbitrarily decided to fix some known parameters or not, after this fitting, each cell is represented by a vector of eight parameters that describe a specific parametrization of the AMAT distance function. The level 1 of this two-dimensional (2D) function then matches closely the contour of the cell (Supplementary Movies 15). For numerical optimization, the L-BFGS-B algorithm available in scipy.optimize was used, as it enabled us to make the process more robust, by introducing some constraints on the range of values the parameters can take.

Generation of a tessellation from individual cell metrics

While, for instance, five parameters enable us to model an elliptical shape and eight parameters enables us to model triangular or rectangular shapes, it does not mean that the cell shape will end up being reconstructed exactly as an ellipse, a triangle, or a rectangle. In fact, the competition for space between cells, each equipped with their own distance, will permit reconstruction of the cell pavement accurately, without any holes. To reconstruct the original image tessellation of K cells, we aim at performing the following minimization:

$$\mathop {{\min }}\limits_{{{C}}_1, \ldots ,{{C}}_K} \mathop {\sum }\limits_{j = 1}^K \mathop {\sum }\limits_{{\mathbf{x}}_i \in C_j} d_{\mathrm{{AMAT}}\left( {{\mathbf{\mu }}_{{j}},{\mathbf{R}}_{{j}},{\mathbf{S}}_{{j}},{\mathbf{A}}_{{j}},p_j} \right)}\left( {{\mathbf{x}}_{{i}}} \right),$$

where Cj denotes the set of pixels xi that belong to the cell j with μj and Rj (the location and orientation parameters) left free to evolve while Sj, Aj, and pj (the shape parameters) are fixed. It can be solved using a modified Lloyd algorithm. Lloyd is usually employed to obtain a Voronoi tessellation of a 2D or a 3D space with the standard Euclidean distance8,36,37. In that case all computed distances along the process are similar and do not depend on parameters. Here, we also aim at performing a tessellation but each compartment uses its own parameterized distance function, as described in the previous section, so as to impose the shape of an actual cell. To our knowledge, Lloyd, with a different metric per cell, was not used for modeling cells. Furthermore, the idea of having each of these metrics matching the properties of a real cell is to our knowledge novel. To reconstruct the original image, the first step is similar to Lloyd and consists of computing the distance of all pixels to all cells (using dedicated AMAT distances) and labeling each of those pixels with the label of its closest cell. In the second step, Lloyd was modified such that the location parameters μ1, μ2, and α of each cell are updated by minimizing the sum of square error previously described. Those three parameters only are left free to evolve (except for the incomplete cells at the border of the image for which all parameters are fixed). The five other parameters s1, s2, a1, a2, and p, describing the cell shape, are estimated once from the original cell segmentations and remain constant along the rest of the iteration process for all cells to maintain their shape. These two steps are repeated until no more pixels change label. At the end of this process, a tessellation that is an accurate approximation of the original cell tessellation is obtained with only eight parameters per cell provided at initialization by fitting (Fig. 1b–d). To synthesize a random tessellation based on all the cells of a given image, the exact same process is used over the same image dimension but the location and orientation parameters are initialized randomly, still kee** the shape parameters for all cells constant. This process applied on the image Supplementary Fig. 1 to produce its reconstruction by SET and three random SET can be visualized in Supplementary Movie 6. Figure 2 shows that random SET preserves single-cell properties of various tissues while expectedly breaking cell relationships.

Null distribution and associated p value

It is important to notice that the reconstruction by SET of the original image could possibly be one sample of the random SET, as the construction process is exactly the same. Only the initialization of the positional parameters (location and orientation) willingly differ: for the reconstruction, these parameters are the original one while in synthetic images they are randomly sampled. This is the foundation of the statistical approach we present: a thousand pictures representing alternative random tessellations of the real image are generated and compared to a reconstruction of that real image using the same process. The statistical significance of any quantitative feature computed from a local group of cells can then be obtained the following way. The considered feature is computed on each random tessellation. Altogether, the sample distribution of these values approximates the null distribution of that feature. Then, the computation of that same feature is also performed on the reconstruction of the original image. If the value computed from the reconstruction falls within the null distribution, then by definition the null hypothesis cannot be rejected. If the value obtained is aside from the null distribution, then a p value can directly be obtained as the ratio of random tessellations that display the same or a more extreme value than value computed from the reconstruction of the original image. Note that if the computed feature is a sum or the mean of independent and identically distributed events over the image, as for Fig. 5e, the null distribution can be approximated by a Gaussian under the CLT. The last combines the advantages of obtaining a more precise p value while necessitating in principle the generation of only one random SET.

Cell texture map**

Independent of the reconstruction of the cell tessellation, we additionally transport the texture content of each cell so as to enable the possible statistical analysis of organelle positioning within the context of its cell neighborhood. To this aim we used a particular weighting of barycentric coordinates called the mean value coordinates, developed by Michael S. Floater38. The mean value coordinate method offers a way to smoothly morph the content of an arbitrary polygon to the content of another arbitrary polygon with the same number of vertices. As the synthetic cell contour is about the same shape and size as the original cell contour, we do not expect significant distortion of the content if the orientation of the reference coordinates is similar. Therefore, the segmented contour of each cell and its synthetic counterpart were respectively subsampled to an arbitrary resolution of n ordered points pi and pi′ (typically 100). The first points p0 and p0′ of both contours correspond respectively to the orientation of their major axis so as to align the two shapes. For each pixel of the synthetic shape we then computed the n mean value coordinates relative to the n points of the contour pi′ and applied the same n weights to the n points of the contour pi to compute a floating point location in the original cell image. A bilinear interpolation of the four closest pixels from that location enabled recovery of a color value that was then used in the synthetic cell (Supplementary Fig. 2). Using this approach, all pixel values of all synthetic cells could be recovered (Figs. 2a, 5b and Supplementary Movie 6).

Alternative approaches

Cell compartments constraint the cell centers to be spread from one another, such that they do not behave as freely as points process approach could essentially model. Therefore, regular point process statistics would hardly be relevant for this type of spatial analyses. We then chose to compare our method to two other approaches that could be considered for such analysis (Fig. 3). These two other approaches, like ours, seek to compare the image observation to a null distribution that should capture the variation about the null hypothesis stating that cells are organized randomly. The difference between the three methods essentially lies in how that null distribution is built by computational means and how relevant it is.

Alternative approach—Shuffle on a hexagonal grid

The first approach (red distribution Fig. 3) uses a honeycomb grid containing as many hexagonal cells as in the original image39,40. For each run, cell identities were assigned randomly with respect to the observed cell type ratio (83 stem cells from a total cell count of 190 cells for Fig. 3) and the number of contacts between two stem cells was retrieved (Supplementary Fig. 6A).

Alternative approach—Shuffle on the segmentation

The second approach (gray distribution Fig. 3) uses the segmentation of the original cell pattern and cell identities are shuffled in order to preserve the distribution of the cell shapes while producing a realistic graph of cell adjacency (Supplementary Fig. 6B). In practice, this model produces a null distribution that is close to the one obtained with the honeycomb method (Fig. 3c) and led to close conclusions.

Raw image information—Mice

Ependymal images were acquired from E18, P1, and P30 mice. The experiments were performed in conformity with French and European Union regulations and the recommendations of the local ethics committee (Comité d’éthique en experimentation animale no. 005). The date of the vaginal plug was recorded as embryonic day (E) 0.5 and the date of birth as postnatal day (P) 0. Healthy, immunocompetent animals were kept in a 12 h light/12 h dark cycle at 22 °C and fed ad libitum. The mice used in this study include OF1 (Charles River Laboratories) and Centrin2-GFP (CB6-Tg(CAG-EGFP/CETN2)3-4Jgg/J; The Jackson Laboratory).

Raw image information—Immunostainings

Wholemounts of the lateral walls of the lateral LV were dissected27 from animals sacrificed by cervical dislocation and fixed for 15 min in pure methanol at −20 °C. The samples were incubated for 1 h in blocking solution (1× PBS with 0.1% Triton X-100 and 10% fetal bovine serum) at room temperature followed by overnight incubation at 4 °C in the primary antibodies diluted in blocking solution. The primary antibodies used targeted ZO1 (1:100, cell junction marker; Thermo Fischer Scientific), FOP (1:600, centriole marker, Abnova Corporation), Sas6 (1:500, pro-centriole marker, Santa Cruz), -Catenin (1:500, cell junction marker, Millipore). The following day, the samples were stained with species-specific AlexaFluor fluorophore-conjugated secondary antibodies (1:400, Thermo Fischer Scientific or Jackson ImmunoResearch Labs). Nuclei were counterstained with a 1:1500 Hoechst solution (from a 20 mg/ml stock, Sigma-Aldrich), containing the secondary antibodies for 2 h at room temperature.

Finally, the wholemounts were redissected to keep only the thin lateral walls of the LV20 which were mounted with Fluoromount-G mounting medium (Southern Biotech, 0100-01).

Raw image information—Others

The root image is a Col-0 Arabidopsis thaliana sample and has been treated by propidium iodide to label cell walls41 and imaged with a Zeiss 710 confocal. The root image is a slice from a 3D stack. The shoot apical meristem image is a FM4-64 staining of a Col-0 Arabidopsis thaliana and was acquired with a Leica SP2 confocal as described in ref. 42. The 3D stack was flattened with merryproj43. The Drosophila image is originally from ref. 44. The membranes are visualized with antibodies against E-Cadherin. Image of chick basilar papilla is originally from ref. 45 and was recently used in ref. 46. The samples were treated with anti-cingulin and anti-hair cell antigen to visualize membrane junction and cell identity. The Xenopus epidermis image was acquired from a stage 33 larva. The visualization of membranes and cell identities was made possible by phalloidin labeling of the actin, acetylated alpha tubuli-488 for cillia, andlectin-pna-594 to label mucins in goblet cells and SSCs47. The image was extracted from a 3D stack using the SME algorithm48.

Cell segmentation

The minimal input to the SET model is an image of segmented cells where each pixel takes as value the integer label that represents all the pixels of the same cell. All 2D cell segmentations presented in this manuscript were performed using a modified version of the “Morphological Segmentation” plugin of the MorphoLibJ package of ImageJ/Fiji49. However, this preprocessing step can be performed by numerous other software packages that do exist to segment images of cells. In practice, as only one image is needed, the full automation of the detection process is not required and segmentation can possibly be manually corrected. Prior segmentation, 2D images of Xenopus larva epidermis and mice ependyma were extracted from 3D stack using the SME algorithm48.

Computational resources

Lloyld relaxation with hundreds to thousands of cells each equipped with their own metric to iteratively redistribute labels over millions of pixels can be a demanding process. Our approach is faster when using only Five parameters per cell (xy position, rotation angle, and main axes length) as only the covariance matrix of the pixels of each cell need to be computed and the Mahalanobis distance to each cell equipped with its own matrix can be used. The computation of the last is made very efficient by the cdist function from the scipy Python package (see the code for implementation details). Therefore, when the cells could reproduce correctly the observed image using five parameters (e.g. E18 and adult ependymal cells) we choose this option. In this case the approximate computation time was between 1 h 30 min and 10 h for 1000 SET simulations on 200 cpus Intel Xeon Processor 2400 MHz, depending on the image size. When cell were highly asymetrics such that five parameters did not reproduce properly the observation (e.g. Xenopus), then we used eight parameters and it took between 9 and 16 days to compute about 300 simulations on the same computer configuration. All calculations were submitted in parallel thanks to the IBENS computing cluster. Note that the code made available offers the possibility to parallelize computation on all CPUs of a single computer. Furthermore, we anticipate that this type of computation would significantly gain to be ported to GPU computing as it can be highly parallelized per cell; however, we have not investigated this possibility.

Ethical aspect

The experiments using mice were performed in conformity with French and European Union regulations and the recommendations of the local ethics committee (Comité d’éthique en experimentation animale no. 005). Mice were bred and maintained in the animal facility of IBENS (Agreement 5502 from the French Ministry of Research and Agreement OGM2014 from the Préfecture de Paris-French ministry of interior). The minimal number of animals was used for the project and the procedures implemented ensured their welfare during their lives.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.