Introduction

The modern role of an analytical chemist in the field of mass spectrometry (MS) broadly falls across three categories: (1) understanding fundamental mechanisms, (2) innovating new preparatory, ionization, and analyzer technologies, and (3) validating emerging technologies. Precise and accurate measurements, defined as those made with minimal error and no biases, are paramount to achieving these goals. Thus, mass spectrometry research naturally lends itself to statistical influences during the experimental setup and data analysis stages, and peer-review journals are increasingly demanding rigorous statistics [1].

A natural component of MS technology development and validation is optimization. Design of experiments (DOE), the focus of this tutorial review, broadly encompasses the use of statistics to select the levels and combinations of experimental parameters on which response variables may be modeled and subsequently mathematically optimized. The adoption of DOE practices represents an emerging trend in mass spectrometry. This review frames its considerations related to design selection and construction for a researcher with an understanding of statistical principles and a rudimentary understanding of modeling. For convenience, a glossary is provided in the supplemental material (ESM 1) that defines statistics terms, which are indicated by italics the first time they are introduced. The intent of this review is to enable a post-doctorate mass spectrometrist to utilize DOE to design and execute an optimization study independently and without a statistician consultant and to introduce the concepts of DOE to graduate level students who, with assistance, can incorporate these principles into thesis and publication level work. These concepts are emphasized in a decision tree (Figure 1) and detailed in a step-by-step procedure (ESM 2) using common GUI software.

Figure 1
figure 1

Decision tree to select the appropriate DOE on the basis of the initial number of factors and level of information needed in the response model. The designs highlighted in blue are explored in more detail in the case studies

Historical Framework and Statistical Principles of DOE

DOE originated to solve rudimentary experimental problems, but its principles are directly transferable to advanced technologies such as mass spectrometry. Developed in 1926 by Ronald A. Fisher, DOE was first used to arrange agriculture field experiments in a geometry such that the results would be independent of environmental biases [2]. The principles used in this first design construction have emerged as pillars of statistics. The importance of replication in understanding variation was defined by many of his predecessors, including William “Student” Gosset. Other concepts were seeded by Fisher and expanded upon by his contemporaries, including Stuart Chapin (1950, randomization) [3], and Wald and Tukey (1943 and 1947, respectively, blocking) [4, 5].

In complex experimental design involving greater than one factor, the order in which factor settings are evaluated is chosen to adhere to the following three principles, which are defined in greater detail in the context of a previous mass spectrometry review [6].

  1. 1.

    Blocking is introduced to account for known experimental biases. In mass spectrometry, drift can result on a day-to-day basis, and thus may serve as a natural demarcation for blocks.

  2. 2.

    Randomization is performed within blocks and protects against unknown or uncontrollable sources of error. The common sources of error in mass spectrometry experiments are outlined in Table 1.

    Table 1 Origins of variation and error in mass spectrometry experiments. DOE may be used to optimize conditions and control for known sources of error
  3. 3.

    Replication can be used to calculate the pure error derived from measurements. If the measurement error is the primary source of variability as in optimization studies, duplicating individual data points, rather than whole studies, is sufficient.

In traditional optimization experiments, factors are sequentially tested one-factor-at-a-time (OFAT). Conclusions are drawn against the null hypothesis that there is no difference between two parameter settings, yet this type of testing remains susceptible to type I and type II errors. Even with careful attention to adequate sample size, the OFAT approach ignores potential synergies between factors and risks selecting parameter settings falling on local maxima, versus identifying the true optimum. The alternative to OFAT is DOE, which is based on a full factorial design strategy. In this strategy, all factors are combinatorically tested at specific levels simultaneously. More specifically, for each factor (k), a testing range (bounds) is selected based on experience, and two or more levels (X) within that range are tested; the total number of experiments equals Xk. These data points can be modeled by standard linear regression techniques, and when the true optimum lies between the bounds, it will be identified.

The power of DOE is its ability to choose a subset of the full factorial data points, produce models with similar statistical power, and more efficiently locate the true optimum. It can accomplish this by making assumptions about the experimental error based on replication of a subset of points versus the entire design and using randomization and blocking to reduce biases. Table 2 collates all applications of DOE in MS published from 2005 to 2015. The case studies presented below highlight designs that are most applicable to mass spectrometry optimization and provide insight into the design selection, explanations into the statistical framework of the design, and discussions regarding the type of information that may be obtained.

Table 2 A 2005–2015 tabulated review of mass spectrometry publications that employed DOE to optimize off-line preparations and extractions, on-line parameters related to MS ionization and detection of species, and post-acquisition analysis parameters. Publications were queried in Web of Science (search date: 10/30/2015) using the following terms: TOPIC: ("design of experiment*" or "fractional design" or "factorial design" or "screening design" or "central composite design" or "Taguchi") AND TOPIC:("mass spectrom*" or "electrospray" or "MALDI" or "GC-MS" or "LC-MS" or "TOF") Refined by: DOCUMENT TYPES: (ARTICLE). For a review of LC optimizations, please see Hibbert et al. (2012) [7]

MS Case Studies to Define Doe Tools

The focus of this manuscript is experimental design with the goal of optimizing a response that may be observed or measured. Depending on the level of familiarity with the system, the significant factors may be known but not optimized, or may need to be discovered. Consequently, the starting number of factors may be roughly broken into three classes: large (>14), mid-size (5–14), or small (2–4). Using this starting point, and by making certain assumptions about the level of detail needed to analyze the response model, the appropriate model may be selected (Figure 1). The case studies below highlight key differences between each design, provide insights into utilizing DOE, and explore their use for mass spectrometrists.

Case Study #1 (Zheng et al. 2013) [212]: Screening of a Large Number of Factors

Screening of Factors to Determine the Most Influential Variables

Design Utility:

Often a new or unfamiliar system needs to be optimized and it is not clear which of a large number of factors (>14) are relevant. Mass spectrometer control software (e.g., XCalibur) have up 40 adjustable parameters relevant to ionization, detection, and accurate mass determination and analysis software (e.g., Proteome Discoverer) is equally complicated with over 50 adjustable settings. Screening designs are well suited for selecting a subset of factors for subsequent response surface analysis. Options for these DOEs include Placket-Burman designs [363], described in the following case study, Taguchi arrays, and resolution III fractional factorial designs. As a rule of thumb, optimization cannot be implemented directly from these designs, but would be performed in a subsequent study.

Study Summary:

Mass spectrometry analysis software is as important to ensuring accurate and robust results as the experimental preparation or ionization parameters. Though certain cut-offs, such as a 1% false discovery in proteomics, are accepted, many software choices are left to the discretion of analysts. Thus, optimization of search parameters represents an under-appreciated area in mass spectrometry. In a study performed by Zheng and coworkers, they tackled the problem in metabolomics of minimizing the number of “unreliable” peaks due to low signal-to-noise, and maximizing the number of reliable peaks identified in the widely used software XCMS [212] in two studies conducted on metabolite standards and human plasma.

Factor and Bound Selection:

Screening of 17 factors, of which six were qualitative variables, was desired. It is up to the experimenter to select bounds or limits that are wide enough that they do not bias selection of or exclude the optimum but are narrow enough that they are experimentally feasible and do not expand the design space unnecessarily. In this study, two parallel DOEs (Design I and Design II), which shared a lower/upper bound limit, were executed to expand the range of the bounds that could be tested in the context of a two-level screening design study. This reflected an understanding on the part of the researchers that they did not have enough experience with the system to narrow the bounds sufficiently. An alternative approach would be to conduct a design with wide bounds, followed by a second screen centered on the preferred levels. However, even with the additional design approach, the 17 factors were screened in a total of 39 runs. This was a 99.97% reduction in the number of experiments compared with the full factorial design of 217 = 131,072 runs.

When experience and literature review are not sufficient to determine even a wide range of bounds, preliminary testing should be conducted to ensure that the factor limits produce real and accurate data. Because the nature of mass spectrometry is that the lack of detection of analytes indicates low abundance rather than absence, the inclusion of this type of “zero” data due to incompatible settings would improperly bias the response models, and thus this data must be treated as “missing” or poorly substituted with the limit of detection. The removal of these data points is particularly detrimental for DOE experiments and subsequent modeling because the number of experiments has already been statistically minimized. When these situations arise, such as in the case of incompatible MALDI matrix compositions in Brandt et al.’s optimization study [338], alternative DOEs that do not sample those points must be used or the bounds must be changed.

DOE Construction:

Plackett-Burman designs allow for the main effects of (n – 1), where n is the number of runs, factors to be estimated. Screening designs generally test factors at two levels and cannot be used to estimate any interactions. A description of the mathematics to construct a screening design array [364] is beyond the scope of this review, but they are readily generated by all statistical software.

Results:

In this XCMS analysis optimization study, significant changes to the mean response value were caused by 10 of 17 factors. Six of these factors had a negative correlation with the number of reliable peaks, and thus were set to their lowest level, which corresponded to the default value. The remaining four factors were followed up with a central composite design (CCD) study, which produces detailed response surface models and is described in depth in case study #3. The combination of these two designs resulted in a 10% increase in reliable peaks in the standard solution, and approximately a 30% increase in reliable peaks in plasma, thus demonstrating the power of this approach for complex samples and a large number of variables.

Case Study #2 (Zhang et al. 2014) [194]: Approaches for a Mid-Range Number of Factors

Higher Level Fractional Designs to Estimate Main and Interaction Effects

Design Utility:

DOE for a mid-size study (5–14 factors) provides enough data points (degrees of freedom) to model parameters with a quadratic equation that includes two-way interactions or synergies. However, it excludes three-way or higher interactions between factors. Optimization may be executed directly when it is appropriate to make a statistical assumption known as the “sparsity of effects principle” or the “heredity principle.” These principles were first discussed by Wu and Hamada (1992), who detailed that the mathematical probability of an interaction being both real and statistically significant was much lower as the model increased in size [365]. Resolution IV fractional factorial designs [366] and definitive screening designs [367, 368] (ESM 2) directly facilitate optimization of a mid-range number of factors and may be constructed using free or proprietary software packages described in a recent review [7].

Study Summary:

The goal of the study performed by Zhang et al. was to optimize spectrum-to-spectrum reproducibility by adjusting the input parameters that are used in the matrix assisted laser desorption ionization (MALDI) control software for automated data acquisition (Bruker FlexControl software, v3.0, Bruker Daltonics) [194]. The parameters in the design were selected according to a resolution IV fractional factorial design. In this design, one sacrifices the modeling of three-way or higher interactions, which would produce a full response surface model. The justification was made based on the authors’ understanding of the sparsity principle, and their desire to minimize the number of experiments: 19 for resolution IV fractional factorial, 48 for full response surface, and 32 for 2-level full factorial design. They detected metabolites from a Pseudomonas aeruginosa cell culture suspension mixed with sinapinic acid matrix and then calculated the Pearson product–moment correlation coefficient between spectra as the mathematical definition of reproducibility.

Response Variable Selection:

Ideally, continuous response variables, such as a correlation coefficient, should be chosen unless the researcher is familiar with generalized linear modeling techniques. For example, the number of peaks observed in a mass spectrum is a Poisson count because the value must be an integer, and a better alternative would be to model the abundances of selected analytes.

DOE Construction: Aliasing:

The use of any fractional factorial design [366], regardless of its resolution, requires an understanding of how the design matrix affects the number of estimable interactions. As summarized in this case-study, for a resolution IV design, all two-way interactions/synergies may be modeled. However, the model estimates for the interactions have more error, compared with the main effects, because they are confounded or aliased with each other. Design software will automatically choose the combinations of parameter levels in the design matrix to minimize errors across the entire model. If preexisting insights into the response surface are known, the researcher may choose to override aspects of this construction. This is usually done if an interaction is already known and very accurate model estimates are desired for that term. Thus, aliasing is a concept that should be well grasped by any experimenter employing DOE.

Mathematically, when models are built on response variables, the factor effects are estimated as the average change between the bounds normalized across all other levels. For example, in a full factorial design with five factors named A, B, C, D, and E (Yates notation), the effect of A (Equation 1) is the average of all responses obtained at the high (+) level of A minus the average of all responses obtained at the low (−) level of A [364]. These estimates may also be obtained through standard least squares analysis (Equation 2) as regression coefficients. Though the values between the two methods will differ, their relative magnitude and direction will stay approximately the same. As these effects represent the sum of total effects of A, they are named contrast values; however, due to confusion, this effect versus contrast nomenclature has become interchangeable in the literature.

Equation 1. Method to calculate the effect of A directly as an average estimate. The y-responses obtained at runs at the high level minus the low level are averaged. Run parameters are indicated by a lower case letter representing the high level or the absence of a letter representing a low level setting. In a five-factor full factorial design, there are 25 total runs, with 50% of the runs \( \left(\frac{1}{2}\times {2}^5\right) \) taken at the high and low level of A.

$$ A=\frac{a+ ab+ac+ ad+ae+abc+abd+\cdots + ab cde}{\frac{1}{2}\times {2}^5}-\frac{b+bc+bd+ be+bcd+bde+\cdots + bcde}{\frac{1}{2}\times {2}^5}, $$
(1)

where lower case letters indicate the testing of the (+) level of a factor.

Equation 2. Method of least squares regression to obtain regression coefficients.

$$ \begin{array}{l}\mathrm{y}=\left({\beta}_a{x}_a+\cdots +{\beta}_e{x}_e\right)+\left({\beta}_{ab}{x}_a{x}_b+\cdots +{\beta}_{de}{x}_d{x}_e\right)+\left({\beta}_{abc}{x}_a{x}_b{x}_c+\cdots +{\beta}_{cde}{x}_c{x}_d{x}_e\right)+\\ {}\left({\beta}_{ab cd}{x}_a{x}_b{x}_c{x}_d+\cdots +{\beta}_{bcde}{x}_b{x}_c{x}_d{x}_e\right)+{\beta}_{ab cd e}{x}_a{x}_b{x}_c{x}_d{x}_e,\end{array} $$
(2)

where y is the response variable, x a x e , are the parameter settings for factors a–e, β a β e , are the main effect regression coefficients, β ab β de , are the secondary effect regression coefficients, and so forth.

Fractional factorial designs are a subset of factorial designs and result in Xk-p number of runs, where p is defined by the resolution. A consequence of producing experimental designs with a reduced number of runs is the loss of information because effects become confounded, or aliased with each other. Mathematically, this means that the linear combination of factor levels, as noted in Equation 1, is identical for two or more effects. Practically, this means that depending on the fractional factorial design resolution or DOE chosen, certain effects may have higher error or may not be able to be estimated.

Results:

In the two-level, five-factor, resolution IV design employed in this study, 16 runs, or ½ of the full factorial design runs, were statistically selected such that all main effects may be estimated with excellent certainty. All two-way interactions may be estimated, but they are partially confounded with each other, and higher order effects may not be estimated because they are completely aliased. The design was further augmented with a center point, or “0” level to allow for quadratic terms to fit a curved surface. The application of nonlinear functions is well established in mass spectrometry. For example, the effects of fluence on ion signal in MALDI is logarithmic [369] and the effects of hydrophobicity is approximately quadratic on ionization efficiency [370]. Using this modeling approach, three of the main effects, two interactions, and a quadratic term were determined to be significant.

The DOE models were mathematically optimized to achieve real, statistically significant gains in reproducibility, up to 98%, demonstrating the efficiency and power of DOE. A second advantage of constructing mathematical models is they can reveal empirical relationships between the response and each factor that would not otherwise be apparent in OFAT studies. In this study, the negative correlation between reproducibility and base peak resolution was surprising and informative, as it was formerly assumed to be positively correlated. These can spur hypothesis formation and future research that would otherwise not receive attention.

Case Study #3 (Switzar et al. 2011) [8]: Response Surface Studies

Full Models May Be Generated to Characterize the Response Surface and Most Accurately Find the Optimum

Design Utility:

The most precise optimization studies are best performed on experiments with a minimal number (2–4) of factors because the full response surface may be characterized in a reasonable number of runs (case study #3, Figure 2). These designs test the greatest number of levels per factor, generate models with high resolution, and require a good understanding of the system to constrain bounds to a narrow to moderate range.

Figure 2
figure 2

Construction of linear, interaction, quadratic, and response surface models (Y) based on sampling points for three continuous variables (ABC). As the model grows in complexity, the predicted optimum better approximates the true optimum, shown as a star. The data on which models were fit was obtained from Gao et al. (2014) [372]. Models were fit in R using the “RSM” package [373] and its associated dependencies.

Study Summary:

In the study performed by Switzar et al. they sought to optimize protein digestion conditions for small molecule (drug)–protein complexes. Optimization of digestions has been performed previously to maximize protein coverage [9], but protein complexes pose unique challenges and potentially called for establishment of new parameters. As the factors affecting tryptic enzymatic activity had been established (pH, time, temperature), this offered an opportunity to perform a focused response surface study to determine optimal settings. The best options for this type of DOE include a central composite design, detailed below, a Box Behnken design, or a fractional factorial design of resolution V.

DOE Construction: Orthogonality and Rotatability:

A central composite design (CCD) is effectively an augmented fractional design. Extra design points, called axial points, are added to the design matrix such that five levels of each factor are tested. Experiments may be designed to be orthogonal and/or rotatable [371]. Rotatability ensures that the variance associated with the predictive response is uniform over the entire design space. Mathematically, orthogonality helps ensure that the parameters in the model are estimated independently, despite the addition of blocking and replicates in the design. Practically, this results in model estimates with greater precision. The level at which the axial points are selected is based on a distance (α) from the center-point and the value of α directly effects the orthogonality and rotatability of a design.

Results:

In this study, global protein digestion coverage and production of the specific peptide complexed to the drug were equally valued responses. Computational software can iteratively vary parameters, determine the response range, and select levels that are on average the best across multiple models. This is important for any study looking simultaneously at production of multiple peptides whose abundances are equally important. Optimization accomplished through this method resulted in over 90% tryptic coverage and good (7.7 × 105) peak area of the drug-complexed peptide. Importantly, the authors validated their gains in tryptic and thermolysin digestion by translating it from a model drug-HSA (human serum albumin) system to a “clinically relevant HSA adduct.” Significant gains in protein coverage were observed with thermolysin only, yet significant increases in the abundance of the targeted drug–peptide complex was observed in both protocols (up to 7-fold) over established literature parameters.

Additional Considerations in Constructing Designs

The starting point in DOE is to consider what type of information is needed and therefore what model or design is desired (Figure 1). A good rule of thumb for larger studies is to spend about a quarter of the laboratory group’s time and financial resources on a preliminary screening DOE, followed by an in-depth response surface study (or studies) of the statistically important variables [374].

As detailed in each case study, multiple designs are available for each class of study. Software accessibility may be a driving factor in choosing the design since computer aided design programs vary somewhat in the preformulated DOEs available. The development of these GUI softwares over the last decade is primarily responsible for enabling scientists to responsibly design and execute DOEs without requiring great statistical expertise [7]. In the supplement (ESM 2), we provide a step-by-step example complete with software screen shots demonstrating how to use one such software to construct a designed experiment.

Choosing the complexity of a response surface design involves deciding the trade-off between how precisely the optimal point is estimated and the complexity of the design (ESM 2, Figure S1). As shown in Figure 2, increasing the number or types of regression coefficients allows the optimum to be more precisely located, especially in the case of quadratic functions. However, the cost to efficiency in terms of the number of additional experimental runs required for a more robust DOE that adds three-way or higher interactions may not necessarily be worth a small gain, for example 5%, in response. In the three-factor example below, the linear, interaction, quadratic, and response surface models require 4, 8, 11, and 20 runs to estimate 3, 6, 6, and 9 regression coefficients, respectively. Visually, the implications of using lower resolution models can be seen by looking at the value of the predicted response, shown as a heat map on the y-axis, at the optimized factor settings. Only in the quadratic or response surface models below is the response maximized at the optimal conditions.

Power calculations should be used to minimize type II (false negative) errors and to determine the sample size needed to compare parameter settings. Since the random variation in an optimization study is small relative to the random variation in biological studies, the sample sizes required for sufficient power in optimization studies are generally much lower than those required for biological studies. DOE software packages allow one to evaluate the power of various designs.

Conclusions

The upfront planning to design an experiment using well-defined statistical tools ultimately saves time and resources. Post-hoc modeling yields concrete results, produces optimized conditions over multiple responses, and elucidates interactions that may be governed by novel mechanisms. This type of analysis is well suited to the needs of the modern mass spectrometrist and may be executed by any well-equipped laboratory with access to statistical software.