5.1 Introduction

Data in the for of counts regularly appear in studies in which the number of occurrences is investigated, such as the number of insects, birds, or weeds in agricultural or agroecological studies; the number of plants transformed or regenerated using modern breeding techniques; the number of individuals with a certain disease in a medical study; and the number of defective products in a quality improvement study, among others. These counts can be counted per unit of time, area, or volume. When using a generalized linear model (GLM) with a Poisson distribution, it is often found that there is excessive dispersion (extra variation) that is no longer captured by the Poisson model. In these cases, the data must be modeled with a negative binomial distribution that has the same mean as the Poisson distribution but with a variance greater than the mean. Most experiments have some form of structure due to the experimental design (completely randomized design (CRD), randomized complete block design (RCBD), incomplete block, or split-plot design) or the sampling design, which must be incorporated into the predictor to adequately model the data.

5.2 The Poisson Model

A Poisson distribution with parameter λ belongs to the exponential family and is a discrete random variable, whose probability function is equal to

$$ f(y)=\frac{e^{-\lambda }{\lambda}^y}{y!};\lambda >0,y=0,1,2,\cdots . $$

The mean and variance of a Poisson random variable are equal, i.e., E(y) = Var(y) = λ. A Poisson distribution is often used to model responses that are “counts.” As λ increases, the Poisson distribution becomes more symmetric and eventually it can be reasonably approximated by a normal distribution.

Let yij be the value of the count variable associated with unit i at level one and with unit j at level two, given a set of explanatory variables. Therefore, we can express this as

$$ f\left({y}_{ij}\right)=\frac{e^{-{\lambda}_{ij}}{\lambda}_{ij}^{y_{ij}}}{y_{ij}!},{y}_{ij}=0,1,2,\cdots $$

and the logarithm of the likelihood is given by:

$$ \log f\left({y}_{ij}\right)=\log \left(\frac{e^{-{\lambda}_{ij}}{\lambda}_{ij}^{y_{ij}}}{y_{ij}!}\right)=-{\lambda}_{ij}+{y}_{ij}\log \left({\lambda}_{ij}\right)-\log \left({y}_{ij}!\right). $$

A Poisson distribution has very particular mathematical properties that are used when we model “counts.” For example, the expected value of y is equal to the variance of y, such that

$$ E\left({y}_{ik}\right)=\mathrm{Var}\left({y}_{ik}\right)={\lambda}_{ij} $$

Then, λij is necessarily a nonnegative number, which could lead to difficulties if we consider using the identity bound function in this context. The natural logarithm is mainly used as a link function for expected “counts.” For single-level (factor) data, Poisson regression model is considered, where we work with the natural logarithm of the counts, log(λi), whereas for multilevel data (more than two factors), mixed models with Poisson data are considered a better choice for the logarithm of the counts λij.

Suppose that given the random effects of b, the counts y1, y2, ⋯, yn are conditionally independent such that yij ∣ bj~Poisson(λij), where

$$ \log \left({\lambda}_{ij}\right)=\eta +{\tau}_i+{b}_j. $$

This is a special case of a generalized linear mixed model (GLMM) in which the link function of this family of distributions is g(λij) =  log (λij). The dispersion parameter ϕ, in this case, is equal to 1.

Sometimes, if the data counts are extremely large, their distribution can be approximated to a continuous distribution. Whereas, if all the counts are large enough, then the square root of the counts is viable for fitting the model as it allows the variance to be stabilized. However, as mentioned in previous chapters, the estimation process under normality can be problematic, as it can provide negative fitted values and predictions, which is illogical.

5.2.1 CRD with a Poisson Response

An CRD is a design in which a fixed number of t treatments is randomly assigned to r experimental units. The linear predictor describing the mean structure of this GLM is

$$ {\eta}_{ij}=\eta +{\tau}_i $$

where ηij denotes the ijth link function of the ith treatment in the jth observation, η is the intercept, and τi is the fixed effect due to treatment i (i = 1, 2, ⋯, t; j = 1, 2, ⋯ri), with t treatments and ri replicates in each treatment i.

Example

Effect of a subculture on the number of shoots during micropropagation of sugarcane.

The objective of micropropagation in sugarcane is to produce vegetative material identical to the donor so that its genetic integrity is preserved. Despite this, somaclonal variation has been observed in plants derived from in vitro culture regardless of explant, variety, ploidy level, number of subcultures, and generation route used, among others. A total of 8 explants were planted in temporary immersion bioreactors (explant/bioreactor) to determine whether the number of subcultures (10 subcultures) influences the number of shoots observed per explant. In this example, we have ri observations (j = 1, 2, …, ri) on each of the 10 subcultures (i = 1, 2, …, 10) in a completely randomized design (Appendix 1: Data: Subcultures). The analysis of variance (ANOVA) table (Table 5.1) for this model is given below:

Table 5.1 Analysis of variance

The components of the GLM are set out below:

$$ \mathrm{Distribution}:{y}_{ij}\sim \mathrm{Poisson}\left({\lambda}_{ij}\right) $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_{ij}=\eta +{\tau}_i $$
$$ \mathrm{Link}\ \mathrm{function}:\log \left({\lambda}_{ij}\right)={\eta}_{ij} $$

where yij denotes the number of sprouts observed in subculture i explant j (i = 1, 2, ⋯, 10; j = 1, 2, ⋯, 8), ηij is the ijth link function, η is the intercept, and τi is the fixed effect of subculture i.

The following Statistical Analysis Software (SAS) code allows analyzing an CRD with a Poisson response.

proc glimmix data=sugar method=laplace; class rep1 sub1 ; model nb=sub/dist=poisson s link=log; lsmeans sub/lines ilink; run;quit;

While most of the commands used have been explained before, the options in the model statement “dist,” “s,” and “link” communicate to the SAS the type of data distribution, the fixed effects solution, and the link to use, respectively. In addition, the “lines” option asks the GLIMMIX procedure in the “lsmeans” (least squares means) command for mean comparisons, and the “ilink” option provides the inverse link function.

Part of the output is shown in Table 5.2, where part (a) shows the model and the methods used to fit the statistical model, whereas part (b) lists the dimensions of the relevant matrices in the model specification.

Table 5.2 Model information and estimation methods

Due to the absence of random effects in this model, there are no columns in matrix Z. The 11 columns in matrix X comprise an intercept and 10 columns for the effect of subcultures.

The goodness-of-fit statistics of the model are shown in part (a) of Table 5.3. The value of the generalized chi-squared statistic over its degrees of freedom (DFs) is less than 1. (Pearsons chi − square/DF = 0.79). This indicates that there is no overdispersion and that the variability in the data has been adequately modeled with the Poisson distribution.

Subsection (b) of Table 5.3 shows the maximum likelihood (ML) (“Estimate”), parameter estimates, standard errors, and t-tests for the hypothesis of the parameters.

Table 5.3 Fit statistics and estimated parameters

Table 5.4 (part (a)) shows significance tests for the fixed effects in the model “Type III fixed effects tests.” These tests are Wald tests and not likelihood ratio tests. The effect of a subculture on the number of shoots is highly significant in this model with a value of P < 0.0001, indicating that the 10 subcultures do not produce the same number of shoots, that is, the number of subcultures affects the average shoot production in the explant.

The least squares means obtained with “lsmeans” (part (b) in Table 5.4) are the values under the column “Estimate,” which along with the standard errors, were calculated with the linear predictor \( {\hat{\eta}}_i=\hat{\eta}+{\hat{\tau}}_i \). These estimates are on the model scale, whereas the “Mean” column values and their respective standard errors are on the data scale, which were obtained by applying the inverse link to obtain the \( {\hat{\lambda}}_i \) values, i.e., \( {\hat{\lambda}}_i={\exp}^{\left({\hat{\eta}}_i\right)} \) with their respective standard errors.

Table 5.4 Type III tests of fixed effects and least squares means (means)

A comparison of means, using the option “lines,” is presented in Fig. 5.1. In this figure, we can see that in the first subcultures, the average production is minimal but it increases as subcultures increase from 5 to 8, and, in subculture 9, the average number of shoots per explant begins to decrease.

Fig. 5.1
A bar graph with error bars titled mean plots the average of shoots per explant versus subcultures. It plots an increasing to decreasing trend, with the highest plot being 8 at 50 and the lowest plot being 1 at 14.

Average number of shoots per subculture. Bars with different letters are statistically different using α = 0.05

5.2.2 Example 2: CRDs with Poisson Response

Researchers want to determine whether the application of a new growth compound to walnut trees changes the amount of nuts produced per tree. They were applied at three different times (pre-flowering = 1, flowering = 2, and post-flowering = 3) and in two formulations (A and B) plus a control (C). In addition to the treatments (Trt) there was a control, where no compound was applied. In total, 7 treatments were randomly applied to the experimental units (trees), i.e., 35 trees, in a rectangular arrangement (as shown below). The average number of nuts yij observed in the formulation and the time of application are provided in Table 5.5.

Table 5.5 Number of nuts per tree (yij) in each of the combinations of the two factors

The components of the GLMM are listed below:

$$ {\displaystyle \begin{array}{c}\mathrm{Distribution}:{y}_{ij}\mid {r}_j\sim \mathrm{Poisson}\left({\lambda}_{ij}\right)\\ {}{r}_j\sim N\left(0,{\upsigma}_{\mathrm{tree}}^2\right)\end{array}} $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_{ij}=\eta +{\tau}_i+{r}_j $$
$$ \mathrm{Link}\ \mathrm{function}:\log \left({\lambda}_{ij}\right)={\eta}_{ij} $$

where yij denotes the number of nuts in treatment i on tree j (i = 1, 2, ⋯, 7; j = 1, 2, ⋯, 5), ηij is the linear predictor, η is the intercept, τi is the fixed effect due to treatment i, and rj is the random effect due to tree j.

The following SAS statements allows a GLMM to be fitted in a completely randomized design with a Poisson response variable.

proc glimmix data=crd_nuez nobound method=laplace; class trt rep; model count = trt/dist=Poi link=log; random rep; lsmeans trt/lines ilink; run;

The options in the model statement, dist, s and ilink communicates to SAS the type of data distribution, the fixed effects solution and to compute the inverse link, respectively. In addition, the option “lines” requests the GLIMMIX procedure in the “lsmeans” (least squares means) command, and the mean comparisons and the “ilink” option provide the inverse of the link function.

Part of the results is presented in Table 5.6. The value of the statistic for conditional distribution (part (a)) indicates that there is a strong overdispersion (χ2/df = 3.62), and the variance component estimates due to sampling in the experimental units (trees) is \( {\hat{\sigma}}_{\mathrm{tree}}^2=0.035 \) (part (b)).

Table 5.6 Results of the analysis of variance

In addition, Table 5.6 (part (c)) shows the type III tests of fixed effects, indicating that there is a significant difference between treatments on the average number of nuts per tree (P = 0.0001). However, it is not recommended to continue with the inference and analysis of the experiment due to the presence of extra-variance (commonly known as overdispersion; Pearsons chi − square/DF = 3.62) in the data that strongly affects the F-test and the standard errors of the means.

A highly effective alternative to deal with the inconvenience of overdispersion in the data is to use a different distribution to the Poisson distribution. A negative binomial distribution is an excellent option for count data with overdispersion. Assuming that the conditional distribution of the observations is given by:

$$ {y}_{ij}\mid {r}_j\sim \mathrm{Poisson}\left({\lambda}_{ij}\right), $$

where λij~Gamma~(1/ϕ, ϕ), ϕ as the scale parameter and \( {r}_j\sim N\left(0,{\upsigma}_{\mathrm{tree}}^2\right) \). The resulting new GLMM is:

$$ {\displaystyle \begin{array}{c}\mathrm{Distribution}:{y}_{ij}\mid {r}_j\sim \mathrm{Negative}\kern0.50em \mathrm{Binomial}\left({\lambda}_{ij},\phi \right),\\ {}{r}_j\sim N\left(0,{\upsigma}_{\mathrm{tree}}^2\right)\end{array}} $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_{ij}=\eta +{\tau}_i+{r}_j $$
$$ \mathrm{Link}\ \mathrm{function}:\log \left({\lambda}_{ij}\right)={\eta}_{ij} $$

The following GLIMMIX statements for fitting this model under a negative binomial distribution in a CRD manner is provided next.

proc glimmix data=crd_nuez nobound method=laplace; class trt rep; model count = trt/dist=Negbin link=log; random rep; lsmeans trt/lines ilink; run;

Part of the results is listed below. The information criteria in Table 5.7 part (a) are helpful in choosing which model best fits the dataset. Clearly, the negative binomial distribution provides the best fit to these data. On the other hand, in the conditional fit statistics (part (b)), we observed that the Poisson model had a strong overdispersion (Pearsons chi − square/DF = 3.62) and that by fitting the data under a negative binomial distribution, the overdispersion of the dataset was removed (Pearsons chi − Square/DF = 0.91).

Table 5.7 Poisson and negative binomial model fit statistics

Table 5.8 shows the variance component estimates (part (a)) and the type III tests of fixed effects (part (b)). The estimated variance parameter, due to trees, is \( {\hat{\sigma}}_{\mathrm{tree}}^2=0.04288 \), and the estimated scale parameter (Scale) is \( \hat{\phi}=0.06141 \). The type III tests of fixed effects (part (b)) show that there is a highly significant effect of treatments on the average number of nuts (P < 0.0001).

Table 5.8 Variance component estimates and fixed effects tests

The values under the column “Estimates” are the estimates of the linear predictor \( {\hat{\eta}}_i \) (the model scale), and the values under “Mean” are the means \( {\hat{\lambda}}_i \) (the data scale) with their respective standard errors obtained with the command “lsmeans” and “ilink” (Table 5.9). The results show that the treatments implemented in this experiment showed a higher average number of walnuts than did the “control” treatment C. In general, formula B applied to the walnut trees at the full-flowering stage showed a higher nut production.

Table 5.9 Estimates on the model scale (“Estimate”) and means on the data scale (“Mean”)

Interest often arises in areas of agricultural and biological sciences to conduct experiments that involve random effects (blocks, locations, etc.) and response variables different from the normal distribution. For example, suppose that a certain number of treatments are being tested at different randomly selected locations, out of a sufficiently large number of locations. At each location, the experimental units are randomly assigned to each of the treatments. Let yij be the number of (observed) individuals possessing the characteristic of interest in the ith treatment in the jth block. The model for the mean structure of this experiment is

$$ {\eta}_{ij}=\eta +{\tau}_i+{b}_j $$

where η is the intercept, τi is the fixed effect due to the ith treatment i, and bj is the random effect of the block j with \( {b}_j\sim N\left(0,{\upsigma}_{\mathrm{block}}^2\right) \).

5.2.3 Example 3: Control of Weeds in Cereal Crops in an RCBD

One of the main problems when growing cereal crops is the competition that exists between the weeds and seedlings. If a field supervisor is interested in testing five designed treatments plus a control for weed control in cereal crops, then a randomized complete block design (four blocks) should be used. Table 5.10 shows the number of weed plants observed in each of the treatments (yij) in parentheses.

Table 5.10 Number of weeds in each treatment (the number in parentheses corresponds to the treatment number)

Table 5.11 shows the sources of variation and the degrees of freedom of a randomized complete block design used in this experiment.

Table 5.11 Analysis of variance

Since the response is count, it will be modeled using a GLMM with a Poisson response variable, which is stated below:

$$ {\displaystyle \begin{array}{c}\mathrm{Distribution}:{y}_{ij}\mid {b}_j\sim \mathrm{Poisson}\left({\lambda}_{ij}\right)\\ {}{b}_j\sim N\left(0,{\upsigma}_{\mathrm{block}}^2\right)\end{array}} $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_{ij}=\eta +{\tau}_i+{b}_j $$
$$ \mathrm{Link}\ \mathrm{function}:\log \left({\lambda}_{ij}\right)={\eta}_{ij} $$

where yij denotes the number of weed plants observed in treatment i and block j (i = 1, 2, ⋯, 6; j = 1, 2, 3, 4), ηij is the linear predictor, η is the intercept, τi is the fixed effect due to treatment i, and bj is the random block effect \( \left({b}_j\sim N\left(0,{\sigma}_{\mathrm{block}}^2\right)\right) \).

Using the GLIMMIX procedure, the following syntax specifies the analysis of a GLMM with a Poisson response.

proc glimmix nobound method=laplace; class Block Trt; model Count = Trt/dist=Poisson s; random block; lsmeans Trt/diff lines ilink; run; quit;

Note that in the above syntax, we use “method = laplace” (or we can also use “method = quadrature”) to fit the mixed model and obtain the chi-squared/DF fit statistic. If the method of integration is not specified, then a generalized chi-squared/DF statistic is obtained. The auxiliary options after the “lsmeans” command are described below: “diff” provides paired comparisons between treatments, “lines” provides the pair comparison of means using letters, and “ilink” provides the value of the inverse of the link function. Some of the outputs are listed below.

Table 5.12 (a) presents the basic information about the model and estimation procedure used.

Table 5.12 Basic model information

Subsection (b) of Table 5.12 shows/ lists the “Dimensions” of the relevant matrices used in the model. The random effects matrix Z indicates that there are four columns due to blocks, and the fixed effects matrix X indicates that there is one column for the intercept plus six columns due to treatments.

The “Fit statistics” and “Fit statistics for conditional distribution” (parts (a) and (b) of Table 5.13, respectively) show information about the fit of the GLMM. The generalized chi-squared statistic measures the sum of the residual squares in the final model and the relationship with its degrees of freedom; this is a measure of the variability of the observations about the model around the mean.

The value of Pearson’s chi-square/DF for the conditional distribution is 11.8, well above up 1. This value gives strong evidence of overdispersion in the dataset. In other words, this value is calling our distribution and linear predictor assumption into question, which means that the variance function was not adequately specified.

Table 5.13 Model fit statistics

The F-test for testing H0 (τ1 = τ2 = ⋯ = τ6) or equivalent (μ1 = μ2 = ⋯ = μ6) indicates that there is a highly significant difference (P < 0.0001) in the average number of weeds in at least one treatment (part (c)) (Table 5.14).

Table 5.14 Variance component estimates, parameter estimates, and type III tests of fixed effects

The estimates of the linear predictor on the model scale for each of the treatments \( \left({\hat{\eta}}_i\right) \) and the inverse of the linear predictor \( \left({\hat{\boldsymbol{\lambda}}}_{\boldsymbol{i}}\right) \) on the data scale (with their respective standard errors) are calculated as follows \( {\hat{\eta}}_i=\hat{\eta}+{\hat{\tau}}_i \) and \( {\hat{\lambda}}_i={\exp}^{\left({\hat{\eta}}_i\right)} \), respectively. These values are listed in Table 5.15.

Table 5.15 Estimated least squares means (“Mean”)

The “plots” option in the “proc GLIMMIX” statement creates a set of plots for the raw residuals, Pearson residuals, and studentized residuals.

The panel consists of a plot of studentized residuals versus the linear predictor \( \left({\hat{\eta}}_i\right) \), a histogram of the residuals with a normal density superimposed, a plot of residual versus quantiles, and a box plot for the residuals. The panel of studentized residuals indicates the possibility of a slightly skewed distribution (Fig. 5.2). In this figure, we can see that the range of values of the residuals changes, as do the values of the linear predictor, indicating that the assumption of constant variance is no longer met. The residuals–quantiles plot confirms the constant variance violation. A nonconstant variance may also suggest an incorrect selection of the response distribution or variance function.

Fig. 5.2
4 graphs titled conditional studentized residuals. 2 scatterplots plot residual versus linear predictor and quantile. The former plots a stable trend, while the latter plots a linear increasing trend. A bar graph lots percent versus residual while the last box and whisker plot plots residual.

Studentized conditional residuals

5.2.4 Overdispersion in Poisson Data

Linear mixed models assume that the observations have a normal distribution conditional to the fixed effects of parameters. In addition, the mean μ is independent of the variance σ2, whereas, in most GLMMs that assume a binomial or Poisson distribution, the variance “dispersion” is set to 1. That is, if the mean is known, then we assume that the variance is also known. The extra variability not predicted by a generalized linear model’s random component reflects overdispersion. Overdispersion occurs because the mean and variance components of a GLM are related and depend on the same parameter that is being predicted through the predictor set. However, if overdispersion is present in a dataset, then the estimated standard errors and test statistics of the overall goodness of fit will be distorted and adjustments must be made. In other words, when there is overdispersion in a dataset, the standard errors of the estimated parameters are too small, which leads to test statistics for the model parameters that are too large (i.e., type I error increases).

Overdispersion can be caused by several factors: omission of predictor variables in the model, high correlation in the observations due to nested effects, misspecification of the systematic component, or incorrect distribution of the data. Systematic or overdispersion deviations may be the result of incorrect assumptions about the stochastic and/or systematic component of the model. The model may also not fit the dataset well because of an incorrect choice of the link function. Systematic deviations may also result from lack of either random effects or independence of observations. These random factors should generally address deviance violations and problems associated with the systematic component.

Fig. 5.3
A scatterplot titled conditional residuals by predicted values for Conteo. It plots residual versus predicted mean and plots a stable trend at 0. Plots are scattered at both the top and bottom of the line, with a higher concentration at the bottom.

Conditional residuals versus predicted values on the data scale

According to Stroup (2013), overdispersion occurs when the variance exceeds the theoretical variance under the distribution model of the data. For any distribution with a nontrivial variance function, overdispersion is theoretically possible for distributions belonging to the one-parameter exponential family because they lack a scale parameter to mitigate the mean–variance relationship; therefore, models such as Poisson distribution are vulnerable to overdispersion. In summary, overdispersion occurs when:

  1. (a)

    The variance is larger than expected, which leads to standard errors that are not correct.

  2. (b)

    The mean structure is not well specified.

  3. (c)

    The linear predictor η is not well specified.

  4. (d)

    The chosen distribution of the data is not appropriate.

  5. (e)

    Predictor variables are omitted.

  6. (f)

    Observations are significantly correlated.

If we do not account for overdispersion, we underestimate the standard errors (for a large variance, the standard errors are not correct) and inflate the statistical tests causing the type I error to inflate and the confidence intervals to be unreliable. Fig. 5.3 shows that as the predicted mean \( \hat{\mu} \) increases, the residuals have a larger spread in the plot, indicating that the variance may increase as a function of the mean, whereas Fig. 5.4 shows a nonconstant variance.

Fig. 5.4
A scatterplot titled conditional residuals by predicted values. It plots residual versus linear predictor and plots a stable trend at 0. Plots are scattered at both the top and bottom of the line, with a higher concentration at the bottom. The highest value plots at 8, while the lowest plots at negative 7.

Residuals on the model scale

In the fit statistics obtained under the GLMM with the Poisson distribution (part (b), Table 5.13), the value of the statistic of Pearsons chi − square/DF = 11.8) indicates that there is a strong overdispersion in the dataset. Another aspect provided by the output is the value of the test statistic F (F = 523.57) tabulated in (part (c)) of Table 5.14. A value too large may indicate that the fit is incorrect. Once the researcher has detected overdispersion, he/she must consider the strategy that will take to remedy it. There are three possible alternatives to evaluate (test) and eliminate overdispersion. Below, we will review the three aforementioned alternatives.

5.2.4.1 Using the Scale Parameter

The first alternative is to add a scale parameter and replace Var(yij| bj) = λij by Var(yij| bj) = ϕλij. This consists of replacing the logarithm of the conditional likelihood yij log (λij) − λij −  log (yij) by the quasi-likelihood yij log (λij) − λij/ϕ, assuming that ϕ > 1 could adequately model the observed variance.

The following GLIMMIX syntax invokes this alternative of adding a scale parameter under a Poisson response variable.

proc glimmix; class Block Trt; model Count = Trt/dist=Poisson; random intercept/subject=block; random _residual_; lsmeans Trt/ ilink ; run;

The SAS code is highly similar to that previously used with the addition of the “random _residual_” command to the program. Note that the Laplace integration method (“method = laplace”) has been removed, which causes the estimation to be performed using the pseudo-likelihood (PL) method; the scale parameter is estimated and used in the adjustment of the standard errors and test statistics. The GLIMMIX procedure uses the generalized chi-square divided by its degrees of freedom \( \left(\mathrm{Gener}.\mathrm{chi}-\mathrm{square}/\mathrm{DF}=\hat{\phi}\right) \) as the estimate of the scale parameter. All standard errors are multiplied by \( \sqrt{\hat{\phi}} \), and all F-test values are divided by \( \hat{\phi} \). Table 5.16 shows part of the results.

In Table 5.16, we observe the fit statistics (part (a)), covariance parameter estimates (part (b)), and the value of the scale parameter, which is equal to \( \hat{\phi}=19.4848 \) (Residual(VC)). The value of the F-statistic under the Poisson distribution in the analysis is 26.87 (part (c)); this value is obtained by dividing the F-value from the previous analysis by \( \left(523.57/\hat{\phi}\right) \). The results indicate that even under this adjustment, overdispersion exists and that this value increases from 11.8 to 19.4848 (part (a)). The inclusion of the scale parameter affects the variance estimate due to blocks \( {\sigma}_{\mathrm{block}}^2 \) as well as the estimates of treatment means (part (d)), but the main impact is on the standard errors.

The inclusion of the scale parameter implies that there is a quasi-likelihood, meaning that there is no true likelihood of the model and, therefore, there is no true likelihood process that provides a true expected value of λ and a variance of ϕλ.

Table 5.16 Results of the adjustment by adding the scale parameter

5.2.4.2 Linear Predictor Review

In count and binomial response variables, it is important to check whether the linear predictor is correctly specified, that is, whether it is being randomly affected by the experimental units within blocks. If λij is being randomly affected by the experimental units within blocks, which is important in count and binomial response variables, then, the ANOVA table should include the effect of the block × treatment source of variation; this must be specified in the linear predictor in a GLMM. Thus, the linear predictor is specified as

$$ {\eta}_{ij}=\eta +{\tau}_i+{b}_j+{\left( b\tau \right)}_{ij} $$
$$ {\displaystyle \begin{array}{c}\mathrm{Distribution}:{y}_{ij}\mid {b}_j,{b\tau}_{ij}\sim Poisson\left({\lambda}_{ij}\right)\\ {}{b}_j\sim N\left(0,{\upsigma}_{b lock}^2\right)\\ {}{b\tau}_{ij}\sim N\left(0,{\upsigma}_{b lock\times \tau}^2\right)\end{array}} $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_{ij}=\eta +{\tau}_i+{b}_j+{\left( b\tau \right)}_{ij} $$
$$ \mathrm{Link}\ \mathrm{function}:\mathit{\log}\left({\lambda}_{ij}\right)={\eta}_{ij}. $$

The following GLIMMIX program allows the above model to be adjusted:

proc glimmix method=laplace; class Block Trt; model Count = Trt/dist=Poisson; random intercept Trt/subject=block; lsmeans Trt/ ilink ; run;

Part of the output is shown in Table 5.17. The results tabulated in part (a) indicate that the overdispersion has been eliminated \( \left(\hat{\phi}=0.11\right) \), but there is a risk of underestimating the variance. For this reason, it is highly recommended that the value of \( \hat{\phi} \) should be close to 1. The estimated variance components (part (b)) for blocks and block × treatments are \( {\sigma}_{\mathrm{block}}^2=0.05969 \) and \( {\sigma}_{\mathrm{block}\times \mathrm{Trt}}^2=0.1152 \), respectively.

The type III tests of fixed effects are highly significant (P = 0.0001), indicating that the six treatments are not equally effective in weed control (part (c)). The values in part (d) under the “Mean” column are the means on the original scale of the data for each of the treatments with their respective standard errors. The values of the means – compared with the previous ones – (using the scale parameter) do not vary much, but the standard errors have a more marked variation.

Table 5.17 Results of the fit by redefining the predictor of the model

5.2.4.3 Using a Different Distribution

Another way to account for the problem of overdispersion when using a Poisson distribution is to change the assumed distribution of the response variable. Poisson variables have the same mean and variance, but, in biological sciences, with variables such as counts, this assumption is not always true. A negative binomial distribution is a good alternative (see Example 5.2), as previously discussed. A negative binomial variable’s mean is denoted by the parameter λ > 0 and variance λ + ϕλ2 by ϕ > 0. That is, the expected value E(y) = λ and variance Var(y) = λ + ϕλ2, where ϕ is the scale parameter. The components of this model are shown below:

Given that yij ∣ bj~Poisson(λij), it is assumed that λij~Gamma~(1/ϕ, ϕ), with ϕ as the scale parameter and \( {b}_j\sim N\left(0,{\sigma}_{\mathrm{block}}^2\right) \). The new specification of the resulting GLMM is as follows:

$$ {\displaystyle \begin{array}{c}\mathrm{Distribution}:{y}_{ij}\mid {b}_j\sim \mathrm{Negative}\ \mathrm{Binomial}\left({\lambda}_{ij},\phi \right)\\ {}{b}_j\sim N\left(0,{\upsigma}_{\mathrm{block}}^2\right)\end{array}} $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_{ij}=\eta +{\tau}_i+{b}_j $$
$$ \mathrm{Link}\ \mathrm{function}:\log \left({\lambda}_{ij}\right)={\eta}_{ij}. $$

The following GLIMMIX statements fit the model with a negative binomial distribution.

proc glimmix method=laplace; class block Trt; model count = Trt/dist=NegBin; random block; lsmeans Trt/ ilink ; run;

Some of the most relevant outputs from GLIMMIX are presented in Table 5.18. Pearson’s chi-squared (Pearsons chi − square/DF) value of 0.88 (part (a)) shows that overdispersion in the dataset has been removed. The estimated scale parameter tabulated in part (b) (Scale) is \( \hat{\phi}=0.1080 \). This value is not the same scale parameter estimated using the Poisson model with the “random _residual_” command, since the methodology for calculating them in these models is different. However, as mentioned above, both scale parameters affect the relationship between the mean and variance in the Poisson and negative binomial distributions.

Table 5.18 Fitting results by redefining the model structure

The value of the test statistic shown in part (c) of Table 5.18, under the negative binomial distribution for the effect of treatments, is highly similar to the value obtained with the Poisson distribution when the effect of the block × treatment interaction was added to the linear predictor. The values under “Estimate” are estimates of the linear predictor on the model scale (part (d)), whereas those under the “Mean” column are the treatment means on the data scale, using the negative binomial distribution. Of the three proposed alternatives to fit these data, the last two (including in the predictor the block–treatment interaction and assuming a negative binomial distribution) provides a better fit.

5.2.5 Factorial Designs

Many experiments involve studying the effects of two or more factors. Factorial designs are the most efficient for these types of experiments. In a factorial design, all possible combinations of factor levels are investigated in each replicate. If there are a levels of factor A and b levels of factor B, then each replicate contains all ab treatment combinations.

5.2.5.1 Example: A 2 × 4 Factorial with a Poisson Response

This application refers to a factorial experiment involving explants from cotyledons of cucumber (Cucumis sativus L.) with two factors, i.e., genotype (two levels) and culture medium (four levels). Each of the eight combinations of the genotype and culture levels were applied to four Petri dishes, each containing six leaf explants. The response variable was the number of buds in each of the leaf explants, i.e., the response variable was a count. There are two sources of variation in this application, namely, variation between Petri dishes and variation between the explants within the Petri dishes (Table 5.19).

Table 5.19 Number of buds counted in the cucumber experiment

The sources of variation and degrees of freedom for this experiment are shown in Table 5.20.

Table 5.20 Sources of variation and degrees of freedom

The components that define this model are shown below:

$$ {\displaystyle \begin{array}{c}\mathrm{Distribution}:{y}_{ijk}\mid \mathrm{petri}.{\mathrm{dish}}_k,\\ {}\mathrm{e}\mathrm{xplant}\mathrm{e}{\left(\mathrm{petri}.\mathrm{dish}\right)}_{l(k)}\sim \mathrm{Poisson}\left({\lambda}_{ijk}\right)\mathrm{petri}.{\mathrm{dish}}_j\sim N\left(0,{\upsigma}_{\mathrm{petri}.\mathrm{dish}}^2\right),\\ {}\mathrm{e}\mathrm{xplant}{\left(\mathrm{petri}.\mathrm{dish}\right)}_{l(k)}\sim N\left(0,{\upsigma}_{\mathrm{explant}\left(\mathrm{petri}.\mathrm{dish}\right)}^2\right)\end{array}} $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_{ij kl}=\eta +{\alpha}_i+{\beta}_j+{\left(\alpha \beta \right)}_{ij}+\mathrm{petri}.{\mathrm{dish}}_k+\mathrm{explant}{\left(\mathrm{petri}.\mathrm{dish}\right)}_{l(k)} $$
$$ \mathrm{Link}\ \mathrm{function}:\mathit{\log}\left({\lambda}_{ijkl}\right)={\eta}_{ijkl} $$

where ηijkl is the linear predictor in genotype i (i = 1, 2), culture medium j (j = 1, 2, 3, 4), Petri.dish k (k = 1, 2, 3, 4), and explant l (l = 1, 2, 3, 4, 5, 6), η is the intercept, αi is fixed effect due to genotype i, βj is the fixed effect due to culture medium j, (αβ)ij is the effect of the interaction between genotype i and culture medium j, Petri.dishk is the random effect of the Petri.dish, and explant(Petri.dish)l(k) is the random effect of the explant within the Petri.dish, assuming \( \mathrm{Petri}.{\mathrm{dish}}_j\sim N\left(0,{\upsigma}_{\mathrm{Petri}.\mathrm{dish}}^2\right) \) and \( \mathrm{explant}{\left(\mathrm{Petri}.\mathrm{dish}\right)}_{l(k)}\sim N\left(0,{\upsigma}_{\mathrm{explant}\left(\mathrm{Petri}.\mathrm{dish}\right)}^2\right) \).

The following GLIMMIX procedure fits a factorial experiment with a Poisson response.

proc glimmix method=laplace ; class genotype culture petri.dish explant; model y = genotype|culture/dist=Poisson; random petri.dish explant(petri.dish)); lsmeans genotype|culture/ilink lines; run;

Some of the SAS output is shown in Table 5.21. The fit statistics in part (a) for this dataset are shown below. Note that “method = laplace” was used for the estimation process and to obtain Pearson’s fit statistic χ2/DF. The result indicates that there is evidence of overdispersion (Pearsons chi − square/DF = 1.84).

Overdispersion, as discussed before, implies more variability in the data than would be expected, potentially explaining the lack of fit in a Poisson model. Part (b) shows the variance component estimates due to Petri_dish, which is equal to \( {\hat{\upsigma}}_{\mathrm{Petri}.\mathrm{dish}}^2=0.003616 \), and, for the explants within Petri.dish, it is \( {\hat{\upsigma}}_{\mathrm{explant}\left(\mathrm{Petri}.\mathrm{dish}\right)}^2=0.01462 \). However, the type III test of fixed effects indicates that there is a statistically significant effect of genotype, culture medium, and the interaction of both factors (part c).

Table 5.21 Conditional fit statistics, variance component estimates, and type III tests of fixed effects under the Poisson distribution

The plot of residuals against the linear predictor in Fig. 5.5 provides further evidence of possible overdispersion.

Fig. 5.5
4 graphs titled conditional studentized residuals. 2 scatterplots plot residual versus linear predictor and quantile. The former plots a stable trend, while the latter plots a linear increasing trend. A bar graph lots percent versus residual while the last box and whisker plot plots residual.

Studentized conditional residuals

The least squares means on the model scale for the genotype (part (a)), the culture medium (part (b)), and the interaction between both factors (part (c)) are listed under the “Estimate” column of Table 5.22, whereas under the “Mean” column are the means of these factors but in terms of the data.

Table 5.22 Estimates on the model scale and means on the data scale under the Poisson distribution

Since there is overdispersion in the data, we will fit the GLMM again using the negative binomial distribution. That is, under the following GLMM:

$$ {\displaystyle \begin{array}{c}\mathrm{Distribution}:{y}_{ij}\mid \mathrm{Petri}.{\mathrm{dish}}_k,\mathrm{explant}{\left(\mathrm{Petri}.\mathrm{dish}\right)}_{l(k)}\sim \mathrm{Negative}\ \mathrm{Binomail}\left({\lambda}_{ij},\phi \right),\\ {}\mathrm{Petri}.{\mathrm{dish}}_j\sim N\left(0,{\upsigma}_{\mathrm{Petri}.\mathrm{dish}}^2\right),\\ {}\mathrm{explant}\ {\left(\mathrm{Petri}.\mathrm{dish}\right)}_{l(k)}\sim N\left(0,{\upsigma}_{\mathrm{explant}\left(\mathrm{Petri}.\mathrm{dish}\right)}^2\right),\end{array}} $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_{ij kl}=\eta +{\alpha}_i+{\beta}_j+{\left(\alpha \beta \right)}_{ij}+\mathrm{Petri}.{\mathrm{dish}}_k+\mathrm{explant}{\left(\mathrm{Petri}.\mathrm{dish}\right)}_{l(k)} $$
$$ \mathrm{Link}\ \mathrm{function}:\log \left({\lambda}_{ijkl}\right)={\eta}_{ijkl} $$

and the scale parameter ϕ.

The following GLIMMIX program allows us to fit a GLMM with a negative binomial response variable.

proc glimmix ; class genotype culture petri.dish explant ; model y = cultivar|culture/dist=NegBin link=log; random petri.dish Explant(petri.dish); lsmeans cultivar|culture; run;

It should be noted that this program is very similar to the previous one, and the only difference is that now a negative binomial distribution is used (“dist = negbin”). Part of the results is presented in Table 5.23. As we have already mentioned, a negative binomial distribution is another model for count variables when there is overdispersion in the dataset. If Pearson’s chi-squared value divided over the degrees of freedom is less than or equal to 1, then the overdispersion is 0 or close to 0, which means that the model is able to efficiently capture the degree of overdispersion. Based on the conditional distribution, Pearson’s chi-squared (χ2/DF = 0.83) fit statistic indicates that we have no evidence of overdispersion, so we can justify the negative binomial distribution, which is better than the Poisson distribution implemented above. In part (b), we show that the estimated scale parameter is \( \hat{\phi}=0.1712 \). This value is not the same as the parameter for the quasi-Poisson model obtained with the “random _residual_” command. Note that the variance components were slightly affected. Additionally, in Table 5.23, we can see the type III tests for the fixed effects of the model in part (c), where a significant effect of genotype, culture, and the interaction between both factors (genotype*culture) can be observed on the number of buds in the leaf explant.

Table 5.23 Conditional fit statistics, variance component estimates, and type III tests of fixed effects under the negative binomial distribution

The “lines” option in the “lsmeans” command is used to obtain Fisher’s least significant difference (LSD) means for both factors and their interaction. The means and their respective standard errors, on the model scale (“Estimate” column) and on the data scale (“Mean” column), are tabulated in Table 5.24, the genotype and culture medium are in Table 5.25, and the interaction between both factors is in Table 5.26. The estimated values in this mean comparison for cultivar (Table 5.24) correspond to the values of the linear predictor \( {\hat{\eta}}_i \) on the model scale, whereas the means on the data scale is \( {\hat{\lambda}}_i \) (part (a)) and the comparison of means (on the model scale) are tabulated in part (b).

Table 5.24 Estimates on the model scale and means on the data scale under the negative binomial distribution

For the culture medium (Table 5.25), the estimated values in this comparison of means correspond to the values of the linear predictor \( {\hat{\eta}}_j \) (on the model scale), but, by applying the inverse link to \( {\hat{\eta}}_j, \) we obtain the values under the “Mean” column that provide the means on the data scale (part (a)). The mean comparisons on the model scale are shown in part (b).

Table 5.25 Means estimates on the model scale and data scale for the culture medium

The results indicate that the means in culture media 2 and 3 provided a statistically similar average number of buds compared to the means in culture media 1 and 4 (see Fig. 5.6).

Fig. 5.6
A bar graph with error plots the average number of buds versus culture medium. Medium 2 has the highest number of buds with 14, while 4 exhibits the lowest at 5.

Comparison of the average number of buds as a function of the type of culture medium (LSD, α = 0.05)

The interaction between both factors (Table 5.26), the average number of buds, and the mean comparisons are shown in Table 5.26.

Table 5.26 Estimates on the model scale and means on the data scale for the interaction between genotype and culture medium

The values under “Estimates” (Table 5.26) correspond to those of the linear predictor \( {\hat{\eta}}_{ij} \) (model scale), but the values under “Mean” correspond to the means \( {\hat{\lambda}}_{ij} \) on the data scale.

Graphically, Fig. 5.7 shows that genotype 1 in culture medium 2 provides the highest number of buds, whereas the lowest number of buds was observed in culture medium 4. For genotype 2, the highest number of buds was observed in culture media 2 and 3. Finally, culture medium 4 is less suitable for both genotypes.

Fig. 5.7
A double bar graph with error plots the average number of buds versus culture medium. Medium 2 consists of the highest number of buds in both genotypes 1 and 2 with 17 and 12.5, respectively, while 4 exhibits the lowest for both at 4.5.

Effect of the cultivar × culture medium interaction on the average number of buds (LSD, α = 0.05)

5.2.6 Latin Square (LS) Design

A Latin square (LS) is used where heterogeneity is associated with the crossing of two factors, generally, both with the same number of levels. This design was originally used in agricultural experimentation with plots placed in a square arrangement, with expected heterogeneity along the rows and columns of the square. Blocking in both directions across rows and columns is done in this experimental design. Sometimes in experimentation, blocking in two directions may be appropriate, i.e., the use of an LS design is a good option. Some examples are provided below to illustrate the use of this experimental design:

  • Field experiments on plots set in a square arrangement with rows and columns that contribute to the heterogeneity between plots. For example, gradients of fertility, moisture, management practices, and so on.

  • Experiments in greenhouses, rooms with a controlled environment, or growth chambers where the placement of shelves, trays, etc. with respect to walls or light sources can introduce systematic variability related to temperature, humidity, or light in different directions (e.g., left to right, back to front, or top to bottom).

  • Laboratory experiments in which there are two potential sources of variability (e.g., technicians, machines, etc.) and researchers are aware of the possible impact of variation from both sources.

For an LS layout, the number of rows (r) and columns (c) should be equal to the number of treatments (t) and the number of replicates of each treatment. The assignment of treatments is such that each treatment appears exactly once in each row and column, with each row and column containing a full set of treatments. Thus, the treatment effect estimates are independent of the differences between rows or columns, and the rows, columns, and treatments are orthogonal to each other.

The analysis of variance for this experimental design, assuming that there are r rows, c columns, and t treatments, with r = c = t, contains the following sources of variability (Table 5.27).

Table 5.27 Sources of variation and degrees of freedom of a Latin square design

From the analysis of variance table, the linear model for an LS design with t treatments is as follows:

$$ {y}_{ijk}=\mu +{f}_j+{c}_k+{\tau}_i+{\varepsilon}_{ijk} $$

where yijk is the response observed in treatment i in row f and column c, μ is the overall mean, fj is the random effect of row j assuming \( {f}_j\sim N\left(0,{\sigma}_f^2\right) \), ck is the random effect of column k with \( {c}_k\sim N\left(0,{\sigma}_c^2\right) \), τi is the fixed effect of treatment i, and εijk is the distributed random error term N(0, σ2). Note that the treatments are allocated in the jkth quadrant (in row j and column k).

5.2.6.1 Latin Square Design with a Poisson Response

In a series of field experiments, several “inducer-attractant” strategies were tested to control insect pests in oilseed rape. In one experiment, the use of wild turnip rape (turnip rape) as an earlier flowering trap crop (TR) (the “attractor”) was tested together with the use of a repellent (an antifeedant) applied to oilseed rape in spring (S, the “inducer”). Untreated oilseed rape (U) was included as a control. The experiment was set up as a 6 × 6 Latin square with two replicates of each of the three treatments per row and column. An assessment of the number of mature pollen beetles was made on 10 plants per plot in early April, 1 day after spraying the repellent (antifeedant). The average number of adult beetles sampled on 10 plants per plot was recorded (Appendix 1: Data: Beatles). The question is: Is there evidence that the attractor or inducer works? That is, are fewer beetles present in the proposed treatments compared to the control?

The model components that define this GLMM are as described below:

$$ {\displaystyle \begin{array}{c}\mathrm{Distribution}:{y}_{ijkl}\mid {f}_j,{c}_k\sim \mathrm{Poisson}\left({\lambda}_{ijkl}\right)\\ {}{f}_j\sim N\left(0,{\sigma}_f^2\right),{c}_k\sim N\left(0,{\sigma}_c^2\right)\end{array}} $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_{ijkl}=\eta +{f}_j+{c}_k+{\tau}_i $$
$$ \mathrm{Link}\ \mathrm{function}:\log \left({\lambda}_{ijkl}\right)={\eta}_{ijkl} $$

where ηijkl is the linear predictor that relates the effect of the repetition l (l = 1, 2) in row j (j = 1, 2, ⋯, 6) and column k (k = 1, 2, , ⋯, 6) when treatment i is applied (i = 1, 2, 3, ), η is the intercept, τi is the fixed effect of treatment i, fj is random effect of row j, and ck is the random effect due to column k, assuming that there is no interaction between the rows and columns as well as between the treatments and rows or the treatments and columns. The assumed distributions for rows and columns are \( f\sim N\left(0,{\sigma}_f^2\right) \) and \( {c}_k\sim N\left(0,{\sigma}_c^2\right) \), respectively. The model uses the linear predictor (ηijkl) to estimate the means (λijkl = μijkl) of the treatments.

The following GLIMMIX program fits a Latin square design with a Poisson response:

Proc glimmix nobound method=laplace; class Row Column Treatment; model count = treatment/dist=Poi link=log; random row column; lsmeans treatment/lines ilink; run;

Part of the output is shown in Table 5.28. In the values of the fit statistics (part (a)), we observe that the value of Pearson’s chi-square divided by the degrees of freedom is less than 1 \( \left(\frac{\chi^2}{DF}=0.55\right) \), indicating that there is no overdispersion in the data and that the Poisson distribution adequately models the dataset.

The type III tests of fixed effects in part (b) indicate that there is no significant evidence of differences between the treatments (P = 0.0621).

Table 5.28 Results of the analysis of variance

Part (c) of Table 5.28 shows the estimates of treatments on the model scale (“Estimate”) and on the data scale (“Mean”) with their respective standard errors. The values 4.6191, 6.9396, and 5.1561 (under the “Mean” column) correspond to the treatment means for S, TR, and U, respectively.

5.2.6.2 Randomized Complete Block Design in a Split Plot

Sometimes the researcher is interested in testing multiple factors using different experimental units, and, in most cases, the experimenter cannot randomly accommodate the treatment combinations. Suppose that one wishes to test two factors, A and B with a and b levels each, respectively. The levels of the first factor (A) are randomly applied to the primary experimental units. Then, the levels of the second factor (B) are applied to the secondary subunits formed within the primary unit in which the first factor was applied. In other words, the primary experimental unit (whole plot) was used for the application of the first factor; then, after this, it was divided to form the secondary experimental units (subplots) for the application of the levels of the second factor. Since the split-plot design has two levels of experimental units, the whole plot portions (primary units) and subplots (secondary units) have different experimental errors. Split-plot experiments were invented in agriculture by Fisher (1925), and their importance in industrial experimentation has been widely recognized (Yates 1935).

As a simple illustration, consider a study of three pulp preparation methods (factor A) and four temperature levels (factor B) on the effect of paper tensile strength (paper quality). A batch of pulp is produced by one of the three methods; it is then divided into four equal portions (samples). Each portion is cooked at a specific level of temperature. The assignment of treatments to plots and subplots is shown in Table 5.29.

Table 5.29 Assigning treatments to whole plots and subplots

The standard ANOVA model for two factors in a split-plot design, in which there are three levels of factor A and four levels of factor B nested within factor A, is described below:

$$ {y}_{ij k}=\mu +{\alpha}_i+{r}_k+\alpha {(r)}_{ik}+{\beta}_j+{\left(\alpha \beta \right)}_{ij}+{\varepsilon}_{ij k} $$

where yijk is the observed response at level i (i = 1, 2, 3) of factor A and at level j (j = 1, 2, 3, 4) of factor B in block k (k = 1, 3, 3), μ is the overall mean, αi is the effect at level i of factor A, rk is the random effect of blocks assuming \( {r}_k\sim N\left(0,{\sigma}_r^2\right) \), α(r)ik is the random effect of the error of the whole plot assuming \( \alpha {(r)}_{ik}\sim N\left(0,{\sigma}_{\alpha (r)}^2\right) \), βj is the effect at level j of factor B, (αβ)ij is the interaction fixed effect at level i of factor A and at level j of factor B, and εijk is the normal random experimental error {εijk~iidN(, σ2)}. The ANOVA table with sources of variation is shown in Table 5.30 for this experimental design.

Table 5.30 Sources of variation and degrees of freedom for a randomized block design with a split-plot treatment arrangement

Example 5.1

A split-plot design in randomized complete block arrangement with a Poisson response

A split plot is probably the most common design structure in plant and soil research. Such experiments involve two or more treatment factors. Typically, large units called whole plots are grouped into blocks. The levels of the first factor are randomly assigned to whole plots. Each whole plot is divided into smaller units, called subplots (split plots). Next, the levels of the second factor are randomly assigned to units of split plots within each whole plot.

In this example, four blocks were implemented, which were divided into seven parts for the seven levels of the first factor (A1, A2, A3, A4, A5, A6, and A7), as whole plots. Then, each whole plot was divided into four units for randomly assigning the four levels of factor B, known as subplots (B1, B2, B3, and B4). Both factors were used to control the growth of a particular weed. Both factors were randomly allocated in each block, as shown below:

Block 1

 

Block 4

A1

A7

A3

A2

A5

A4

A6

A6

A3

A7

A2

A1

A5

A4

B3

B3

B4

B1

B2

B1

B3

 

B3

B3

B4

B1

B2

B1

B3

B1

B2

B3

B3

B1

B2

B2

B1

B2

B3

B3

B1

B2

B2

B2

B4

B1

B4

B3

B3

B4

 

B2

B4

B1

B4

B3

B3

B4

B4

B1

B2

B2

B4

B4

B1

B4

B1

B2

B2

B4

B4

B1

The sources of variation and degrees of freedom for this experiment are shown below in Table 5.31:

Table 5.31 Sources of variation and degrees of freedom for a randomized block design with a split-plot treatment arrangement

In this experiment, the response variable was the number of weeds in each of the plots (Appendix 1: Weed counts). The components that define this GLMM are as shown below:

$$ {\displaystyle \begin{array}{c}\mathrm{Distribution}:{y}_{ijk}\mid {r}_k,\alpha {(r)}_{ik}\sim \mathrm{Poisson}\left({\lambda}_{ijk}\right)\\ {}{r}_k\sim N\left(0,{\sigma}_r^2\right),\alpha {(r)}_{ik}\sim N\left(0,{\sigma}_{ar}^2\right)\end{array}} $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_{ij k}=\eta +{\alpha}_i+{r}_k+\alpha {(r)}_{ik}+{\beta}_j+{\left(\alpha \beta \right)}_{ij} $$
$$ \mathrm{Link}\ \mathrm{function}:\log \left({\lambda}_{ijk}\right)={\eta}_{ijk} $$

where ηijk is the linear predictor that relates the effect of factor A with i levels (i = 1, 2, ⋯, 7)and factor B with j levels (j = 1, 2, 3, 4) in block k with (k = 1, 2, 3, 4); η is the intercept, αi is the fixed effect at level i of factor A, βj is the fixed effect at level j of factor B, (αβ)ij is the fixed effect of the interaction between level i of factor A and level j of factor B, rk is the random effect due to block; and α(r)ik is the random error effect of the whole plot, assuming \( {r}_k\sim N\left(0,{\sigma}_r^2\right) \) and \( \alpha {(r)}_{ik}\sim N\left(0,{\sigma}_{AR}^2\right) \), respectively. The model uses the aforementioned linear predictor (ηijk) to estimate the means (λijk = μijk) of the treatments.

The following GLIMMIX program fits a split-plot block design with a Poisson response variable:

proc glimmix method=laplace; class block a b; model count=a|b / dist=Poisson link=log; random block block*a; lsmeans a|b /lines ilink; run;

Part of the output is shown below.

Table 5.32 Results of the analysis of variance

As in the previous examples, the Poisson model was found to be inadequate because the value of Pearson’s chi-squared statistic divided by the degrees of freedom is greater than 1 \( \left(\frac{\chi^2}{df}=4.50\right) \).This indicates that we have probably misspecified either the conditional distribution of y ∣ b or the linear predictor, but, in this case, there is evidence that we need to look for other distributions for this dataset (part (a), Table 5.32. In addition, in part (b), the values of variance component estimates due to blocks and blocks × A are tabulated \( \left({\hat{\sigma}}_r^2=0.01526;{\hat{\sigma}}_{ra}^2=0.2454\right) \). On the other hand, the type III tests of fixed effects (part (c)) show a significant effect of factor B and the interaction between both factors.

An alternative to reduce the overdispersion is to keep the same linear predictor, changing the Poison distribution in the response variable by the negative binomial distribution, that is:

$$ {\displaystyle \begin{array}{c}\mathrm{Distribution}:{y}_{ijk}\mid {r}_k,\alpha {(r)}_{ik}\sim \mathrm{Negative}\ \mathrm{binonial}\ \left({\lambda}_{ijk},\phi \right)\\ {}{r}_k\sim iidN\left(0,{\sigma}_r^2\right),\alpha {(r)}_{ik}\sim iidN\left(0,{\sigma}_{AR}^2\right)\end{array}} $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_{ij k}=\eta +{\alpha}_i+{r}_k+\alpha {(r)}_{ik}+{\beta}_j+{\left(\alpha \beta \right)}_{ij} $$
$$ \mathrm{Link}\ \mathrm{function}:\log \left({\lambda}_{ijk}\right)={\eta}_{ijk} $$

The following syntax fits a GLMM under a negative binomial distribution.

proc glimmix method=Laplace; class block a b; model count=a|b / dist=NegBin link=log; random intercept a /subject=block; lsmeans a|b/lines ilink; run;

Part of the output is shown below (Table 5.33). According to the results tabulated in (a), they indicate that the overdispersion has been removed from the analysis \( \left(\frac{\chi^2}{df}=0.71\right) \). The variance components estimates, tabulated in part (b), are \( {\sigma}_r^2=0.0024 \) and \( {\sigma}_{AR}^2=0.1222 \) for blocks and blocks × A, respectively. The estimated scale parameter is \( \hat{\phi}=0.3458 \). Note that the results under the negative binomial distribution differ from those obtained under the Poisson distribution, which is due, of course, to the fact that the negative binomial distribution better captures overdispersion. The fixed effects F-test for factor A is significant at the 5% significance level (part (c)), whereas factor B and the interaction effect do not significantly influence the response variable.

Table 5.33 Results of the analysis of variance

Example 5.2

A split-split plot in time in a randomized complete block design with a Poisson response.

The propagation of coffee seedlings through grafting in nurseries depends on several factors such as the type of substrate, the rootstock of the plant that will host the graft, type of graft, light intensity, type and size of the container, humidity, temperature, and so forth. The objective of this experiment was to evaluate the effect of shade cloth (light intensity), type of container, and clone on the number of leaves produced by the Coffea canephora P. clones grafted with the Coffea arabica L. variety Oro azteca.

The factors studied were the color of the shade cloth (black, pearl, and red), container size (tube of 0.5 kg and 1 kg), and five coffee clones of the variety Coffea canephora P. plus a franc foot (Coffea arabica L. and Var. Oro azteca) over a period of 11 months (Appendix 1: Coffee data). The clones used in the experiment are listed below (Table 5.34). Different physiological parameters were evaluated for 11 months.

Table 5.34 Clones of Coffea canephora P

This work was implemented in four randomized complete blocks. The following table exemplifies how a block was constructed.

Shade cloth red

Shade cloth Perl

Shade cloth black

Tray

Container

0.5 kg

Container

1 kg

Tray

Container

0.5 kg

Container

1 kg

Tray

Container

0.5 kg

Container

1 kg

C2

C5

C4

C4

C5

C2

C5

C2

C4

C4

Pf

C3

C3

Pf

C4

Pf

C3

C2

C3

C1

C5

C5

C1

C3

C1

C5

C3

C5

C2

Pf

Pf

C2

C5

C2

Pf

C5

Pf

C4

C1

C1

C4

Pf

C4

C1

Pf

C1

C3

C2

C2

C3

C1

C3

C24

C1

The statistical model describing a split-split plot in time design is described below:

$$ {\displaystyle \begin{array}{c}{y}_{ij k lm}=\mu +{\alpha}_i+{r}_m+{(ar)}_{im}+{\beta}_j+{\left(\alpha \beta \right)}_{ij}+{\gamma}_k+{\left(\alpha \gamma \right)}_{ik}+{\left(\beta \gamma \right)}_{jk}+{\left(\alpha \beta \gamma \right)}_{ij k}\\ {}+{\left( rab\gamma \right)}_{ij k m}+{\tau}_l+{\left(\alpha \tau \right)}_{il}+{\left(\beta \tau \right)}_{jl}+{\left(\alpha \beta \tau \right)}_{ij}+{\left(\gamma \tau \right)}_{kl}+{\left(\alpha \gamma \tau \right)}_{ik l}\\ {}+{\left(\beta \gamma \tau \right)}_{jk l}+{\left(\alpha \beta \gamma \tau \right)}_{ij k l}+{\varepsilon}_{ij k lm}\end{array}} $$
$$ i=1,2,3;j=1,2,3,4,5;k=1,2,3;l=1,\cdots, 11;m=1,2,3,4 $$

where yijklm is the response variable in repetition m, shade cloth i, clone j, and tray k in time l; μ is the overall mean; αi is the fixed effect due to the type of shade cloth; βj, γk, and τl are the fixed effects due to clone type, tray,and sampling time, respectively; (αβ)ij, (αγ)ik,(βγ)jk, (ατ)il, (βτ)jl, and (γτ)kl are the effects of the double interactions of the factors shade cloth type with clone, tray, and sampling time; (αβγ)ijk, (αβτ)ij, (αγτ)ikl, (βγτ)jkl, and (αβγτ)ijkl are the effects of the third and fourth interactions of the factors under study; (ar)im is the random effect of blocks with type of shade cloth with rm, (ar)im, (rabγ)ijkm are the random effect due to blocks, blocks with type of shade cloth, blocks with type of shade cloth, and time assuming \( {r}_m\sim N\left(0,{\sigma}_r^2\right) \), \( {(ar)}_{im}\sim N\left(0,{\sigma}_{r\alpha}^2\right) \) \( {\left( rab\gamma \right)}_{ijkm}\sim N\left(0,{\sigma}_{\alpha \beta \gamma \left(\mathrm{rep}\right)}^2\right) \), and εijklm is random error {εijklm~N(0, σ2)}.

The following SAS program fits a GLMM in a split-split plot in time under a randomized complete block design with a Poisson response.

proc glimmix data=work.Nhojas_cafe nobound method=laplace; class shade clone tray rep time; model y = shade|clone|tray|time/dist=poi link=log; random intercept shade shade*clone*tray/subject=rep type=ar(1) ; lsmeans shade|clone|tray|time /lines ilink; run;

Some of the results are listed below. To study which correlation structure best fits this experimental design, five types of correlation structures were tested (Table 5.35): compound symmetry (“CS”), autoregression of order 1 (“AR(1)”), unstructured (“UN”), Toeplizt of order 1 (“Toep(1)”), and ante (ANTE(1)). To do this, in the “random” command with the “type” option, the type of correlation to be tested is specified, and it is here where the option of type of variance–covariance structure must be changed. The fit statistics indicate that the variance–covariance structure that best fits the model is the autoregressive structure of order 1 〈AR(1)〉. This can be seen in the following table in which the goodness-of-fit statistics for choosing between all these variance–covariance structures are reported.

Table 5.35 Fit statistics for choosing the correlation structure

Table 5.36 shows the conditional statistics and variance component estimates. The fit statistic Pearsons chi − square/DF = 0.57 in part (a) indicates that, in a conditional model, there is no evidence of mis-specifying the distribution or linear predictor. In other words, there is no overdispersion in the dataset, and, therefore, it is reasonable that the analysis and inference can be based on the Poisson model.

Table 5.36 Conditional fit statistics and variance component estimates

The analysis of variance for the type III tests of fixed effects (Table 5.37) indicates that there is a highly significant effect of the main effect type of shade cloth (P = 0.0001), clone (P = 0.0001), and tray (P = 0.0001) as well as of most of the interactions, except for the interactions shade_cloth*clone; (P = 0.3846), shade_cloth*tray*time (P = 0.9289), clone*tray*time (P = 0.9760), and shade_cloth*clone*tray*time (P = 0.2484).

Table 5.37 Type III fixed effects tests

The means and standard errors of each of the main effects, on the data scale, for shade_cloth, tray, and clone are shown in the “Mean” column in part (a) of Table 5.38, whereas in part (b), the mean comparisons for the type of shade cloth are shown.

Table 5.38 Estimated means on the model scale and on the data scale for the shade cloth

Table 5.39 presents the estimates of the linear predictor (“Estimates” column) in terms of the model scale and treatment means in terms of the data scale (“Mean” column) for the type of clone (part (a)). In addition, in Table 5.39 (part (b)), the mean comparisons are presented for the type of clone.

Table 5.39 Estimated means on the model scale and on the data scale for the type of clone

Table 5.40 presents the estimates for the levels of the tray on both scales (part (a)). Similarly, in this table (part (b)), the treatment mean comparisons are presented for the levels of the tray.

Table 5.40 Estimated means on the model scale and on the data scale for the tray factor

Tables 5.41, 5.42, 5.43, and 5.44 show the means and standard errors on both scales of the two-factor and three-factor interactions.

Interaction type of shade cloth*clone

Table 5.41 Estimated means on the model scale and on the data scale for the type of shade cloth*clone

Interaction type of shade cloth*tray

Table 5.42 Estimated means on the model scale and on the data scale for the interaction type of shade cloth*tray

Interaction clone*tray

Table 5.43 Estimated means on the model scale and on the data scale for the clone–tray interaction

Interaction shade*clone*tray

Table 5.44 Estimated means on the model scale and on the data scale for the shade–clone–tray interaction

Although it is not the objective of this book, part of the results is discussed below. In Fig. 5.8, it is possible to observe that the red shade cloth significantly stimulates leaf production in coffee grafts, followed by the black and pearl shade cloths. The production of leaves in coffee grafts shows a bimodal figure that can be due to factors such as humidity and temperature. Extreme conditions of both factors cause stress at the growing points and, therefore, the appearance of leaves.

Fig. 5.8
A bar graph with error plots the average number of leaves versus time. 6 months plots the highest in red at 8 while 5 months plot higher values for black and pearl at 7.3 and 7.5, respectively. 2 months plot the lowest values for all three at 1 for black and pearl and 1.3 for red.

Effect of mesh type on the average number of leaves

Regarding the type of clone used as rootstock, the clones showed a better average leaf production in months 5 and 6, whereas the lowest production was observed in months 1, 2, 8, and 9. The franc foot showed a higher average of leaves compared to the rest of the clones (Fig. 5.9).

Fig. 5.9
A bar graph with error plots the average number of leaves versus time. 11 months plots the highest for p f at 9 while 6 months plots the highest values for clones 2 and 5 at 6 and 8.1, respectively. 5 months plot higher values for clones 1, 3, and 4 at 8. 2 months plot the lowest values for all 6 plots.

Effect of mesh type on the average number of leaves

5.3 Exercises

Exercise 5.3.1

A researcher in the area of plant sciences wants to know what is the response of a plant in vitro culture when it is exposed to different concentrations (ppm) of a chemical compound to the number of outbreaks that the explant produces (yij). The data for this experiment are given below (Table 5.45):

Table 5.45 In vitro culture (Conc = concentration in ppm)
  1. (a)

    Write down the analysis of variance table (sources of variation and degrees of freedom).

  2. (b)

    Write down the components of the GLMM.

  3. (c)

    Analyze the dataset with the model proposed in (b).

  4. (d)

    Compare and contrast the results of these analyses. If necessary, reanalyze the dataset using the same model as above, but, now, assume that the data have a negative binomial distribution.

  5. (e)

    Summarize the relevant results.

Exercise 5.3.2

Earthworms (Lubricus terrestris L.) were counted in four replicates of a factorial experiment at the W.K. Kellogg Biological Station in Battle Creek, Michigan, in 1995. A 24 factorial experiment was conducted. Factors and treatment levels were plowing (chiseled and unplowed), input level (conventional and low), manure application (yes/no), and crop (corn and soybean). The objective of interest was whether L. terrestris density varies according to these management protocols and how various factors act and interact. The data (not pooled) in the table shows the total worm counts (per square foot) in the factorial design 24 for the experimental units 64 (24 × 4) (juvenile and adult worms). The numbers in each cell of the table correspond to the counts in the replicates (Table 5.46).

Table 5.46 Results of the experiment with earthworms
  1. (a)

    Write down the analysis of variance table (sources of variation and degrees of freedom).

  2. (b)

    Write down the components of the GLMM.

  3. (c)

    Analyze the dataset with the model proposed in (b).

  4. (d)

    Summarize the relevant results.

Exercise 5.3.3

This experiment involves an investigation of genotypic variation within cultivars of pore (Allium porrum L.) with respect to adventitious shoot formation in the callus tissue. The data in Table 5.47 refer to 20 genotypes of 1 cultivar. Each genotype is represented by six calluses. These observations are the number of shoots per callus. The data are subject to two sources of variation, i.e., variation between genotypes and variation between the calluses within the genotypes.

Table 5.47 Results of the callus tissue experiment
  1. (a)

    Write down the analysis of variance table (sources of variation and degrees of freedom).

  2. (b)

    Write down the components of the GLMM.

  3. (c)

    Analyze the dataset with the model proposed in (b).

  4. (d)

    Reanalyze the dataset using the same model as above, but, now, assume that the data have a negative binomial distribution.

  5. (e)

    Compare and contrast the results of these analyses.

  6. (f)

    Summarize the relevant results.

Exercise 5.3.4

In an experiment at the Research Institute for Animal Production “Schoonoord” in the Netherlands, the effects of active immunization against androstenedione on the fertility of Texel ewes were studied (Engel and te Brake 1993). The number of fetuses per ewe can be considered as the net result of a process that determines the number of ovulations and a probability process for these ovulations to produce fetuses. In this study, the goals are to model and analyze (a) the number of ovulations and the number of fetuses in relation to Fecundin (androstenedione-7a-carboxyethylthioether) treatment, animal age, mating period and (b) the number of fetuses in relation to treatment, animal age, and number of ovulations observed. A summary of the experiment and a summary of the data are shown below (Table 5.48).

Table 5.48 Factors T (Treatment: 1: Fecundin; 2: Control), A (Age: 1: ≤0.5; 2: 0.5 <  −  ≤ 1.5; 3: 1.5 <  −  ≤ 2.5; 4:≥2.5 years); M (Mating period: 1: October 1; 2: October 22); n (number of ovulations), and x (number of fetuses)

Of the 125 Texel ewes, 63 are treated with Fecundin, whereas the remaining 62 serve as a control group. The ewes are sorted into four age classes (e.g.,<0.5, 0.5 − 1.5, 1.5 − 2.5, and > 2.5 years) and two mating periods (starting on October 1 and October 22, 1986, respectively). The interactions with age are interesting and because it is a factor, it is easier to handle than a covariate where age was entered as a factor. The number of animals in the four age classes is 25, 44, 24, and 32, respectively. The age class is evenly distributed in the combinations of mating period and treatment groups. Ewes were slaughtered at 75–80 days after the last mating, and the number of ovulations and number of fetuses were determined. Ovulation numbers ranged from 1 to 5. For six animals, the number of ovulations was not known, so these ewes were excluded from the database.

  1. (a)

    Analyze the dataset using a GLMM with the predictor: ηijkl = η + τi + αj + βk + (τα)ij + (τβ)ik + (ταβ)ijk + bl, where τ, α, and β are the fixed effects of treatment, age, and mating period and b is the random effect due to animal. Assuming that each b has normal distribution with a zero mean and variance \( {\sigma}_b^2 \), and under the assumption that the number of ovulations and the number of fetuses have a Poisson distribution.

  2. (b)

    From the analyses performed, do you observe the presence of overdispersion in the dataset? If so, propose an alternative distribution for the analysis for this dataset.

  3. (c)

    Reanalyze the dataset using the same model as before with the new data distribution.

  4. (d)

    Compare and contrast the results of these analyses.

  5. (e)

    Summarize the relevant results.

Exercise 5.3.5

The following example deals with one of the most harmful insects in the root system of the main crops, whose common name is “blind hen.” The experiment consisted of six treatments formulated for larval control in a randomized block arrangement (A, B, C, D, E, and F). The count per area shows the number of larvae in two age groups (a and b) (Table 5.49).

Table 5.49 Results of the blind hen experiment
  1. (a)

    Write down the analysis of variance table (sources of variation and degrees of freedom).

  2. (b)

    Write down the components of the GLMM.

  3. (c)

    Analyze the dataset with the model proposed in (b).

  4. (d)

    Does the proposed model in (b) adequately describe the variation observed in the dataset? Summarize the relevant results.