Prévia do material em texto
Quality of Analytical Measurements: Statistical Methods for Internal Validation☆
M Cruz Ortiz, Departamento de Química, Facultad de Ciencias, Universidad de Burgos, Burgos, Spain
Luis A Sarabia and M Sagrario Sánchez, Departamento de Matemáticas y Computación, Facultad de Ciencias, Universidad de Burgos,
Burgos, Spain
Ana Herrero, Departamento de Química, Facultad de Ciencias, Universidad de Burgos, Burgos, Spain
© 2019 Elsevier Inc. All rights reserved.
This article is an update of M.C. Ortiz, L.A. Sarabia, M.S. Sánchez, A. Herrero, 1.02—Quality of Analytical Measurements: Statistical Methods for Internal
Validation, Editor(s): Steven D. Brown, Romá Tauler, Beata Walczak, Comprehensive Chemometrics, Elsevier, 2009, pp. 17–76.
Introduction 3
Confidence and Tolerance Intervals 7
Confidence Interval 8
Confidence Interval on the Mean of a Normal Distribution 8
Case 1: Known variance 8
Case 2: Unknown variance 10
Confidence Interval on the Variance of a Normal Distribution 10
Confidence Interval on the Difference in Two Means 11
Case 1: Known variances 11
Case 2: Unknown variances 11
Case 3: Confidence interval for paired samples 12
Confidence Interval on the Ratio of Variances of Two Normal Distributions 12
Confidence Interval on the Median 12
Joint Confidence Intervals 13
Tolerance Intervals 13
Case 1: b-content tolerance interval 13
Case 2: b-expectation tolerance interval 14
Case 3: Distribution free intervals 14
Hypothesis Tests 15
Elements of a Hypothesis Test 15
Hypothesis Test on the Mean of a Normal Distribution 18
Case 1: Known variance 18
Case 2: Unknown variance 18
Case 3: The paired t-test 19
Hypothesis Test on the Variance of a Normal Distribution 19
Hypothesis Test on the Difference in Two Means 20
Case 1: Known variances 20
Case 2: Unknown variances 21
Test Based on Intervals 22
Hypothesis Test on the Variances of Two Normal Distributions 22
Hypothesis Test on the Comparison of Several Independent Variances 24
Case 1: Cochran’s test 24
Case 2: Bartlett’s test 25
Case 3: Levene’s test 25
Goodness-of-Fit Tests: Normality Tests 26
Case 1: Chi-square test 26
Case 2: D’Agostino normality test 26
One-Way Analysis of Variance 27
The Fixed Effects Model 28
Power of the Fixed Effects ANOVA model 30
Uncertainty and Testing of the Estimated Parameters in the Fixed Effects Model 31
Case 1: Orthogonal contrasts 32
Case 2: Comparison of several means 32
The Random Effects Model 34
Power of the Random Effects ANOVA model 35
Confidence Intervals for the Estimated Parameters in the Random Effects Model 35
☆Change History: October 2019: M. Cruz Ortiz, Luis A. Sarabia, M. Sagrario Sánchez, Ana Herrero added MATLAB live-scripts for the computations; re-written
introduction to tolerance intervals; corrected estimates in Table 13; updated texts; corrected mistakes and updated references.
Comprehensive Chemometrics 2nd edition: Chemical and Biochemical Data Analysis https://doi.org/10.1016/B978-0-12-409547-2.14746-8 1
https://doi.org/10.1016/B978-0-12-409547-2.14746-8
2 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Statistical Inference and Validation 36
Trueness 36
Precision 37
Statistical Aspects of the Experiments to Determine Precision 39
Consistency Analysis and Incompatibility of Data 39
Case 1: Elimination of data 39
Case 2: Robust methods 41
Accuracy 43
Ruggedness 43
Appendix 45
Some Basic Elements of Statistics 45
The Normal Distribution 46
Student’s t Distribution 46
The w2 (Chi-square) Distribution 47
The F Distribution 48
Convergence of Random Variables 48
Some Computational Aspects 48
Normal distribution 49
Student’s t distribution with n degrees of freedom 49
w2 distribution with n degrees of freedom 49
Fn1, n2 distribution with n1 and n2 degrees of freedom 49
Power for the z-test, Eq. 49
Power for the t-test, Eq. 50
Power for the chi-square test, Eq. 50
Power for the F-test, Eq. 50
Power for fixed effects ANOVA, Eq. 50
Power for random effects ANOVA, Eq. 50
References 50
Nomenclature
1 − a Confidence level
1 − b Power
CCa Limit of decision
CCb Capability of detection
Fn1, n2
F distribution with n1 and n2 degrees of freedom (d.f.)
H0 Null hypothesis
H1 Alternative hypothesis
N(m,s) Normal distribution with mean m and standard deviation s
NID(m,s) (Normally and Independently Distributed) independent random variables equally distributed as normal with
mean m and standard deviation s
s Sample standard deviation
s2 Sample variance
tn Student’s t distribution with n degrees of freedom (d.f.)
�x Sample mean
V(X) Variance of the random variable X
a Significance level, probability of type I error
b Probability of type II error
D Bias (systematic error)
« Random error
m Mean
n Degree(s) of freedom, d.f.
s Standard deviation
s2 Variance
sR Reproducibility (as standard deviation)
sr Repeatability (as standard deviation)
xn
2 w2 (chi-square) distribution with n degrees of freedom
Quality of Analytical Measurements: Statistical Methods for Internal Validation 3
Introduction
Every day millions of analytical determinations are made in thousands of laboratories all around the world. These measurements
are necessary for assessment of merchandise in the commercial interchanges, supporting health care, maintaining security, for
quality control of water and environment, for characterization of raw materials and manufactured products, and for forensic
analyses. Practically, any aspect of the contemporary social activity is somehow supported in the analytical measurements. The cost
of these measurements is high, but the cost of the decisions made based on incorrect results is much greater. For example, a test that
wrongly shows the presence of a forbidden substance in a food destined for human consumption can result in an expensive claim,
the confirmation of the presence of an abuse drug can lead to serious judicial sentences or doping in the sport practice may result in
severe sanctions. The importance of providing a correct result is evident but it is equally important to be able to prove that the result
is correct.
Once an analytical problem is posed to a laboratory and the analytical method is selected, the next step is the in-house validation
of the method. This is the process of defining the analytical requirements to respond to the problem and to confirm that the
considered method has performance characteristics consistent with those required. The results of the validation experiments must
be evaluated in order to ensure that the method meets the measurement required specification.
The set of operations to determine the value of an amount (measurand) suitably defined is called the measurement. The method
of measurement is the sequence of operations that is used when conducting the measurements. It is documented with enough
details so that the measurement may be done without additional information.
Once a method is designed or selected, it is necessary to evaluate its performance characteristics and to identify the factors that
can change these characteristics and to what extent they can change. If, in addition, the method is developed to solve a particular
analytical problem, it is necessary to verify that the method is fit for purpose.1 This process of evaluation is called validation of the
method. It implies the determination of several parameters that characterize the method performance: decision limit, capability of
detection, selectivity, specificity, ruggedness, and accuracy (trueness and precision). In any case, they are the measurements
themselves which allow evaluation of the performance characteristics of the method and its fit for purpose. In addition, when
using the method, the obtained measurements are also the ones that will be used to make decisions on the analyzed sample, for
example, whether the amount of an analyte fulfills a legal specification. Therefore, it is necessary to suitably model the data that a
method provides. In what follows we will consider that the data provided by the analytical method are real numbers; other
possibilitiesexist, for example, the count of bacteria or impacts in a detector take only (discrete) natural values, or also, sometimes,
the data resulting from an analysis are qualitative, for example, the identification of an analyte through its m/z ratios in a mass
spectrometry-chromatography analysis.
With regard to the analytical measurement, it is admitted that the value, x, provided by the method of analysis consists of three
terms, the true value of the parameter m, a systematic error (bias) D, and a random error E with zero mean, in an additive way as
expressed in Eq. (1):
x ¼ m + D + e (1)
All the possible measurements that a method can provide when analyzing a sample constitute the population of the measurements.
This is indeed a theoretical situation because it is being assumed that there are infinitely many samples and that the method of
analysis remains unalterable. In these conditions, the model of the analytical method, Eq. (1), is mathematically a random variable,
X, with mathematical expectation m + D and variance equal to the variance of E; in statistical notation, E(X) ¼ m + D and V(X) ¼
V(e), respectively.
A random variable, and thus the analytical method, is described by its cumulative distribution function FX(x), that is, the
probability that the method provides measurements less than or equal to x for any value x. Symbolically, this is written as FX(x) ¼ pr
{X � x} for any real value x. In most of the applications, it is assumed that FX(x) is differentiable, which implies, among other
things, that the probability of obtaining exactly a specific value is zero. In the case of a differentiable cumulative distribution
function, the derivative of FX(x) is the probability density function (pdf ) fX(x). Any function f(x) such that it is positive, f(x) � 0, and
the area under the function is 1,
R
Rf(x)dx ¼ 1, is the pdf of a random variable. The probability that the random variable X takes
values in the interval [a, b] is the area under the pdf over the interval [a, b], that is,
pr X 2 a; b½ �f g ¼
Z b
a
f xð Þ dx (2)
and the mean and variance of X are written as in Eqs. (3), (4), respectively
E Xð Þ ¼
Z
R
x f xð Þdx (3)
V Xð Þ ¼
Z
R
x − E Xð Þð Þ2 f xð Þdx (4)
In general, mean and variance do not characterize in a unique way a random variable and therefore neither the method of analysis.
Fig. 1 shows the pdf of four random variables with the same mean 6.00 and standard deviation 0.61.
4 5 6 7 8
0
0.2
0.4
0.6
0.8
1.0
1.2
4 5 6 7 8
0
0.2
0.4
0.6
0.8
1.0
1.2
4 5 6 7 8
0
0.2
0.4
0.6
0.8
1.0
1.2
4 5 6 7 8
0
0.2
0.4
0.6
0.8
1.0
1.2
(A) (B)
(C) (D)
Fig. 1 Probability density functions of four random variables with mean 6 and variance 0.375. (A) Uniform in [4.94, 7.06]; (B) Symmetric triangular in [4.5, 7.5];
(C) Normal N(6, 0.61); (D) Weibull with shape 1.103 and scale 0.7 shifted to give a mean of 6. Dotted vertical lines mark the interval [5.0, 7.0].
4 Quality of Analytical Measurements: Statistical Methods for Internal Validation
These four distributions, uniform or rectangular (Fig. 1A), triangular (Fig. 1B), normal (Fig. 1C), and Weibull (Fig. 1D), are
frequent in the scope of analytical determinations, and they appear in Appendix E of the EURACHEM/CITAC Guide1 and also they
are used in metrology.2
If the only available information regarding a quantity X is the lower limit, l, and the upper limit, u, but the quantity could be
anywhere in between, with no idea of whether any part of the range is more likely, then a rectangular distribution in the interval [l, u]
would be assigned to X. This is so because it is the pdf that maximizes the “information entropy” of Shannon, in other words the pdf
that adequately characterizes the incomplete knowledge about X. Frequently, in reference material, the certified concentration is
expressed in terms of a number and unqualified limits (e.g., 1000 � 2 mg L−1). In this case, a rectangular distribution should be
used (Fig. 1A).
When the available information concerning X includes the knowledge that values close to c (between l and u) are more likely
than those near the bounds, the adequate distribution is a triangular one (Fig. 1B), with the maximum of its pdf in c.
If a good location estimate, m, and a scale estimate, s, are the only information available regarding X, then, according to the
principle of maximum entropy, a normal probability distribution N(m,s) (Fig. 1C) would be assigned to X (remember that m and s
may have been obtained from repeated applications of a measurement method).
Finally, the Weibull distribution (Fig. 1D) is very versatile; it can mimic the behavior of other distributions such as the normal or
exponential. It is adequate for the analysis of reliability of processes, and in chemical analysis it is useful in describing the behavior
of the figures of merit of a long-term procedure. For example, the distribution of the capability of detection CCb3 is a Weibull one or
the distribution of the determinations of ammonia in water by UV-vis spectroscopy during 350 different days in Aldama.4
In the four cases given in Fig. 1, the probability of obtaining values between 5 and 7 has been computed with Eq. (2). For the
uniform distribution (Fig. 1A) this probability is 0.94, whereas for the triangular distribution (Fig. 1B) is 0.88, for the normal
distribution (Fig. 1C) is 0.90, and for the Weibull distribution (Fig. 1D), 0.93. Sorting in decreasing order of the proportion of values
that each distribution accumulates in the interval [5.0, 7.0] we have uniform, Weibull, normal, and triangular although the triangular
and normal distributions tend to give values symmetrically around themean and theWeibull distribution does not. If another interval
is considered, say [5.4, 6.6], the distributions accumulate probabilities of 0.57, 0.64, 0.67, and 0.54, respectively, in which the
difference among values is larger than before and, in addition, sorted the distributions as normal, triangular, uniform, and Weibull.
Table 1 Values of b such that p ¼ pr{X < b} where X is each one of the random variables defined in the caption of Fig. 1.
P Random variable
Uniform Triangular Normal Weibull
0.01 4.96 4.71 4.58n 5.34m
0.05 5.05 4.97n 5.00 5.37m
0.50 6.00m 6.00m 6.00m 5.83n
0.95 6.95n 7.03 7.01 7.22m
0.99 7.04n 7.29 7.42 8.12m
n, Minimum b among the four distributions; m, Maximum b among the four distributions.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 5
If for each of those variables, value b is determined so that there is a fixed probability, p, of obtaining values below b (i.e., the
value b such that p ¼ pr{X < b} for each distribution X), the results of Table 1 are obtained. For example, second row, 5% of the
times the uniform distribution at hand gives values less than b ¼ 5.05, less than 4.97 if it is the triangular distribution, and so on.
In the table, the extreme values among the four distributions for each probability p have been identified, and large differences are
observed caused by the form in which the values far from 6 are distributed (notice the differences in Fig. 1 for the normal, the
triangular, or the uniform distribution) and also due to the asymmetry of the Weibull distribution.
Therefore, the mean and variance of a random variable give very limited information on the values provided by the random
variable, unless additional information is at hand about the form of its density (pdf ). For example, if one knows that the
distribution is uniform or symmetrical triangular or normal, the random variable is completely characterized by its mean and
variance.
In practice, the pdf of a method of analysis is unknown. We only have a finite number, n, of measurements, which are the
outcomes obtained when applying repeatedly (n times) the same method to the same sample. These n measurements constitute a
statistical sample of the random variable X defined by the method of analysis.
Fig. 2 shows histograms of 100 results obtained when applying fourmethods of analysis, named A, B, C, and D, to aliquot parts
of a sample to determine an analyte. Clearly, the four methods behave differently.
From the experimental data, the (sample) mean and variance are computed as
�x ¼
Pn
i¼1xi
n
(5)
s2 ¼
Pn
i¼1 xi − �xð Þ2
n − 1
(6)
�x and s2 are estimates of the mean and variance of the distribution of X. These estimates with the data in Fig. 2 are shown in Table 2.
According to the model of Eq. (1), E Xð Þ ¼ m + D ’ �x, that is, the sample mean estimates the true value m plus the bias D. Assuming
that the true value is m ¼ 6 and subtracting it from the sample means in the first row of Table 2, the bias estimated for methods A and
B would be 0.66 and 0.16 for methods C and D. The bias of amethod is one of its performance characteristics andmust be evaluated
during the validation of the method. In fact, technical guides, for example, the one by the International Organization for
Standardization (ISO), state that, for a method, better trueness means less bias. To estimate the bias, it is necessary to have samples
with known concentration m (e.g., certified material, spiked samples).
The value of the variance is independent of the true content, m, of the sample. For this reason, to estimate the variance, it is only
necessary to have replicated measurements on aliquot parts of the same sample. The second row of Table 2 shows that methods
B and C have the same variance, 1.26, which is 5 times greater than the one of methods A and D, 0.25. The dispersion of the data
obtained with a method is the precision of the method and constitutes another performance characteristic to be determined in the
validation of the method. In agreement with model in Eq. (1), a measure of the dispersion is the variance V(X), which is estimated
by means of s2.
In some occasions, for evaluating trueness and precision, it is more descriptive to use statistics other than mean and variance. For
example, when the distribution is rather asymmetric, as in Fig. 1D, it is more reasonable to use the median than the mean. The
median is the value in which the distribution accumulates 50% of the probability, 5.83 for the pdf in Fig. 1D and 6.00 for the other
three distributions, which are symmetric around their mean. In practice, it is frequent to see the presence of anomalous data
(outliers) that influence the mean and above all the variance, which is improperly increased; in these cases, it is advisable to use
robust estimates of central tendency and spread (dispersion).5–7 Details can be found in the chapter of the present book devoted to
robust procedures.
Fig. 2 and Table 2 show that the two characteristics of a measurement method, trueness and precision, are independent to one
another, in the sense that a method with better trueness (less bias), methods C and D, can be more, case D, or less, case C, precise.
Analogously, methods A and B have an appreciable bias but A is more precise than B. A method is said to be accurate when it is
precise and fulfills trueness.
Histograms are estimates of the pdf and allow evaluation of the performance of each method in a more detailed way than when
only considering trueness and precision. For example, the probability of obtaining values in any interval can be estimated with the
3 4 5 6 7 8 9 10
0
10
20
30
40
(A) (B)
(C) (D)
3 4 5 6 7 8 9 10
0
10
20
30
40
3 4 5 6 7 8 9 10
0
10
20
30
40
3 4 5 6 7 8 9 10
0
10
20
30
40
Fig. 2 Frequency histograms of 100 measures obtained with four different analytical methods, named (A), (B), (C), and (D), on aliquot parts of a sample. Dotted
vertical lines mark the interval [5.0, 7.0].
Table 2 Some characteristics of the distributions in Fig. 2.
Method
A B C D
Mean, �x 6.66 6.66 6.16 6.16
Variance, s2 0.25 1.26 1.26 0.25
fr {5 < X < 7} 0.70 0.56 0.58 0.98
fr{X < 6} 0.08 0.29 0.49 0.39
pr{5 < N(�x, s) < 7} 0.75 0.55 0.62 0.94
pr{N(�x, s) < 6} 0.09 0.28 0.44 0.37
fr, frequencies; pr, probabilities.
6 Quality of Analytical Measurements: Statistical Methods for Internal Validation
histogram. The third row in Table 2 shows the frequencies for the interval [5.0, 7.0]. Method D (best trueness and precision among
the four) provides 98% of the values in the interval, whereas method B (worst trueness and precision) provides only 56% of the
values in the interval. Nonetheless, trueness and precision should be jointly considered. See how according to data in Table 2 the
effect of increasing the precision, using method A instead of B when the bias is “high” is an increase of 14% of results of the
measurement method to be in the interval [5.0, 7.0], whereas when the bias is small (C and D), there is an increase of 40%. This
behavior should be taken into account when optimizing a method and also in the ruggedness analysis, which is another
performance characteristic to be validated according to most of the guides. As can be seen in the fourth row of Table 2, if the
method that provides more results below 6 is needed, C would be the method selected.
The previous explanations show the usefulness of knowing the pdf of the results of a method of analysis. As in practice we have
only a limited number of results, two basic strategies are possible to estimate the pdf: (1) to assess that the experimental data are
Quality of Analytical Measurements: Statistical Methods for Internal Validation 7
compatible with a known distribution (e.g., normal) and then use the corresponding pdf; (2) to estimate the pdf by a data-driven
technique based on a computer-intensive method such as the kernel method8 or by using other methods such as adaptive or
penalized likelihood.9,10 The data of Fig. 2 can be adequately modeled by a normal distribution, according to normality hypothesis
tests whose details are explained later in Section “Goodness-of-Fit Tests: Normality Tests”. The fitted normal distributions are used
to compute the probabilities of obtaining values in the interval [5.0, 7.0] or less than 6, last two rows in Table 2. When comparing
these values with those computed with the empirical histograms (compare rows 3 and 5, and rows 4 and 6), there are no appreciable
differences and the normal pdf can be used instead.
In the validation of an analytical method and during its later use, statistical methodological strategies are needed to make
decisions from the available experimental data. The knowledge of these strategies supposes a way of thinking and acting that,
subordinated to the chemical knowledge, makes it objective both the analytical results and their comparison with those of other
researchers and/or other analytical methods.
Ultimately, a good method of analysis is a serious attempt to come close to the true value of the measurement, always unknown.
For this reason, the result of a measurement has to be accompanied by an evaluation of uncertainty or its degree of reliability. This is
done by means of a confidence interval. When the requirement is to establish the quality of an analytical method, then its capability
of detection, precision, etc. must be compared with those corresponding to other methods. This is formalized with a hypothesis test.
Confidence intervals and test of hypothesis are the basic tools in the validation of analytical methods.
In this introduction, the word sample has been used with two different meanings. Usually, there is no confusion because the
context allows one to distinguish whether it is a sample in the statistical or chemical sense.
In Chemistry, according to the International Union for Pure and Applied Chemistry (IUPAC) (Page 50 in Section 18.3.2 of
Inczédy et al.11), “sample” should be used only when it is a part of a selected material of a great amount of material. This meaning
coincides with that of a statistical sample and implies the existence of sampling error, that is, error caused by the fact that the sample
can be more or less representative of the amount in the material. For example, suppose that we want to measure the amount of
pesticide that remains in the groundof an arable land after a certain time. We take several samples “representative” of the ground of
the parcel (statistical sampling) and this introduces an uncertainty in the results characterized by a variance (theoretical) ss
2.
Afterward, the quantity of pesticide in each chemical sample is determined by an analytical method, which has its own uncertainty,
characterized by sm
2 , in such a way that the uncertainty in the quantity of pesticide in the parcel is ss
2 + sm
2 provided that the method
gives results independent of the location of the sample. Sometimes, when evaluating whether a method is adequate for a task, the
sampling error can be an important part of the uncertainty in the result and, of course, should be taken into account to plan the
experimentation.
When the sampling error is negligible, for example, when a portion is taken from a homogeneous solution, the IUPAC
recommends using words such as test portion, aliquot, or specimen.
In summary, there is a clear link between measurement method and a random variable which is why the probability is the
natural form of expressing experimental uncertainty. This is thus the focus of the present article that is organized as follows:
Section “Confidence and Tolerance Intervals” describes confidence intervals to measure bias and precision under the normality
hypothesis and tolerance intervals, useful in evaluating the fit for purpose of a method. Also, a nonparametric interval on the
median is described.
Section “Hypothesis Test” is devoted to making decisions based on experimental data that, as such, are affected by uncertainty.
In this section, the computation of the power of a test is systematically proposed as a key element to evaluate the quality of the
decision at the desired significance level. A brief incursion into tests based on intervals is also made as they solve the problem of
deciding whether an interval of values is acceptable, for example, a relative error less than 10% in absolute value. The section ends
with some goodness-of-fit tests to evaluate the compatibility of a theoretical probability distribution with some experimental data.
Section “One-Way Analysis of Variance” is dedicated to the analysis of variance (ANOVA) for both fixed and random effects, and
in Section “Statistical Inference and Validation” some more specific questions related to the usual parameters of the analytical
method validation and their relation with the developed statistical methodologies are analyzed.
Mathematical proofs are not covered in this article and, to be operative from a practical point of view, several examples have
been included so that the reader can verify the understanding of the formulas and the argumentation for their thoughtful use. This
aspect is completed with the inclusion of an Appendix where some essential aspects related to the effectiveness of the statistical
models and the limits laws are described. The Appendix also contains the necessary sentences, in MATLAB code, to repeat all the
calculations proposed along the article. The same sentences are also available as supplementary material in the form of MATLAB .
mlx live scripts (at least release R2016a is needed to read and execute them).
Confidence and Tolerance Intervals
There are some important questions when evaluating a method, for example, “in a given sample, what is the maximum value that it
provides?” that, due to the random character of the results, cannot be answered with just a number.
In order to include the degree of certainty in the answer, the question should be reformulated as: What is the maximum value,U,
that will be obtained 95% of the times that the method is used in the sample? The answer to the question thus posed would be a
tolerance interval, and to build it the probability distribution must be known. For instance, let us suppose that it is aN(m,s) and we
denote by z0.05 the critical value of a N(0,1) ¼ Z distribution, the one that accumulates probability 0.95. Then, a possible answer is
8 Quality of Analytical Measurements: Statistical Methods for Internal Validation
U ¼ m + z0.05s because then the probability that the analytical method gives values greater than U is pr{method > U} ¼ pr
{N(m,s) > m + z0.05s}, which, according to the result in Appendix, is equal to pr{Z > z0.05} ¼ 0.05. In general, for any percentage
of results 100(1 − a)%, the maximum value provided by the method would be
U ¼ m + zas (7)
with a probability a that the aforementioned assertion is false.
If, instead, the interest was in the value L so that the 100(1 − a)% of the results are greater than L, then the answer would be
L ¼ m − zas (8)
Finally, the interval [L, U] that contains 100(1 − a)% of the values obtained with the method would be
L;U½ � ¼ m − za=2 s;m + za=2 s
� �
(9)
An analytical example where one of these tolerance intervals with a normal distributionN(m,s) needs to be computed would be: An
analytical method gives values (mg L−1) that follow a N(9, 0.5) distribution when measuring a standard with 9 mg L−1. To assess
whether the method is still properly working, ten standards are included in the daily sequence of determinations. The probability
distribution of the mean of these ten values is a N 9;0:5=
ffiffiffiffiffiffi
10
p� �
. Following Eq. (9), the tolerance interval at 95% level is
9� 1:96� 0:5=
ffiffiffiffiffiffi
10
p ¼ 9 � 0.31 mg L−1. Consequently, if one day a mean of, say, 9.5 mg L−1 is obtained, the method does not
work properly because 9.5 does not belong to the tolerance interval and the method should be revised, at the risk of doing
this revision uselessly 5% of the times. Notice that the tolerance interval is always the same, built at the desired confidence level
100(1 − a)% with the distribution N 9;0:5=
ffiffiffiffiffiffi
10
p� �
and it is not updated daily with the new samples.
Different to Eq. (9), two variants of tolerance intervals, namely the b-content and the b-expectation tolerance intervals, are
explained in Section “Tolerance Intervals” due to their relevance in the context of validation of analytical methods. In any case, any
of them is completely different from the confidence intervals introduced and developed in the following sections (from
Section “Confidence Interval” to Section “Joint Confidence Intervals”).
After explaining all the studied cases, the section finishes with a comparative analysis of both concepts (tolerance and confidence
intervals).
Confidence Interval
We have already remarked that estimation of solely the mean, �x, and variance, s2, from n independent results provides very limited
information on the method performance. The objective now is to make affirmations of the type “in the sample, the amount of the
analyte m, estimated by �x, is between L and U (m 2 [L, U])” with a certain probability that the statement is true. Following this
particular example, we should consider that �x is a value taken by the random variable �X (sample mean) and use its distribution to
answer the new question. Its distribution function is obtained mathematically from the one of X, FX(x), and thus depends on the
information we have about FX(x) (e.g., if the variance is known or should be also estimated, etc.).
In the general case, with a random variable X, obtaining a confidence interval for X from a sample x1, x2,. . ., xn consists of
obtaining two functions l(x1, x2,. . ., xn) and u(x1, x2,. . ., xn) such that
pr X 2 lu½ �f g ¼ pr l � X � uf g ¼ 1 − a (10)
1 − a is the confidence level and a is the significance level, meaning that the statement that the value of X is between l and u will
be false 100a% of the times.
In the next sections this idea will be particularized for some different cases, according to the random variable X of interest. Fig. 3
is a diagram that summarizes the cases studied in the following sections. All the examples are written in MATLAB live-script file
Intervals_section1022_live.mlx, in the supplementary material, so that they can be easily repeated or adapted for the reader’sown data.
Confidence Interval on the Mean of a Normal Distribution
Case 1: Known variance
Suppose that we have a random variable that follows a normal distribution with known variance. This will be the case, for example,
of using an already validated method of analysis. The assumption means that we know that E in Eq. (1) is normally distributed and
also its variance. If we are using samples of size n and taking into account the properties of the normal distribution (see Appendix),
the sample mean,�X, is a random variable N m; s ffiffi
n
p
. ��
; thus, the particular expression of Eq. (10) for this random variable is
pr m − za=2
sffiffiffi
n
p � �X � m + za=2
sffiffiffi
n
p
¼ 1 − a (11)
that is, 100(1 − a)% of the values of the sample mean are in the interval in Eq. (11). A simple algebraic manipulation (subtract m
and �X, multiply by −1) gives
Equal
Unequal
For mean μ0
Known
variance
Unknown
variance
For difference
in means μ1-μ2
Known
variances
Unknown
variances
For standard
deviation σ0
For ratio of standard
deviations σ1/σ2
CONFIDENCE INTERVALS UNDER NORMAL
DISTRIBUTION(S)
ONE SAMPLE TWO INDEPENDENT SAMPLES
Fig. 3 Diagram summarizing the different cases for computing confidence intervals.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 9
pr �X − za=2
sffiffiffi
n
p � m � �X + za=2
sffiffiffi
n
p
¼ 1 − a (12)
Therefore, according to Eq. (10), the confidence interval on the mean that is obtained from Eq. (12) is
�X − za=2
sffiffiffi
n
p ; �X + za=2
sffiffiffi
n
p
� �
(13)
Analogously, the confidence intervals at confidence level 100(1 − a)% for the maximum and minimum values of the mean are
computed from Eqs. (14), (15), respectively
pr m � �X + za
sffiffiffi
n
p
¼ 1 − a (14)
pr �X − za
sffiffiffi
n
p � m
¼ 1 − a (15)
and, thus, the corresponding one-sided intervals would be −1; �X + za
sffiffi
n
p
� i
and �X − za
sffiffi
n
p ;1
h �
.
In an experimental context, whenmeasuring n aliquot parts of a test sample, we obtain n values x1, x2,. . ., xn. Their sample mean �x
is the particular value taken by the random variable �X and is also an estimate of the true value m.
Example 1: Suppose that an analytical method follows a N(m,4) and we have a sample of size 10 with values 98.87, 92.54, 99.42,
105.66, 98.70, 97.23, 98.44, 103.73, 94.45 and 101.08. With this sample, the mean is 99.01 and using Eq. (13), the interval at 95%
confidence level is 99:01 − 1:96� 4 ffiffiffiffi
10
p ; 99:01 + 1:96� 4 ffiffiffiffi
10
p
. i.h
¼ [96.53, 101.49].
For the interpretation of this interval, notice that with different samples of size 10 (same analytical method), different intervals
will be obtained at the same 95% confidence level. The endpoints of these intervals are nonrandom values, and the unknown mean
value, which is also a specific value, will or will not belong to the interval. Therefore, the affirmation “the interval contains the
mean” is a deterministic assertion that is true or false for each of the intervals. What one knows is that it is true for 100(1 − a)% of
those intervals. In our case, as 95% of the constructed intervals will contain the true value, we say, at 95% confidence level, that the
interval [96.53, 101.49] contains m.
This is the interpretation with the frequentist approach adopted in this article, that is to say that the information on random
variables is obtained by means of samples of them and that the parameters to be estimated are not known but are fixed amounts
(e.g., the amount of analyte in a sample, m, is estimated by the measurement results obtained by analyzing it n times). With a
Bayesian approach to the problem, a probability distribution is attributed to the amount of analyte m and once fixed an interval
of interest [a,b], the “a priori” distribution of m, the experimental results, and the Bayes’ theorem are used to calculate the
probability a posteriori that m belongs to the interval [a,b]. It is shown that, although in most practical cases the uncertainty
intervals obtained from repeated measurements using either theory may be similar, their interpretation is completely different.
The works by Lira and Wöger12 and Zech13 are devoted to compare both approaches from the point of view of the experimental
data and their uncertainty. Also, an introduction to Bayesian methods for analyzing chemical data can be seen in Armstrong and
Hibbert.14,15
10 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Case 2: Unknown variance
Suppose a normally distributed random variable with unknown variance that must be estimated, together with the mean, from n
experimental data. The confidence interval is computed as in Case 1, but now the random variable �X follows (see Appendix) a
Student’s t distribution with n − 1 degrees of freedom (d.f.); thus, the interval at the 100(1 − a)% confidence level is obtained from
pr �X − ta=2, n
sffiffiffi
n
p � m � �X + ta=2, n
sffiffiffi
n
p
¼ 1 − a (16)
where ta/2,n is the upper percentage point (100a=2%) of the Student t distribution with n ¼ n − 1 d.f. and s is the sample standard
deviation. Analogously, the one-sided intervals at the 100(1 − a)% confidence level come from
pr m � �X + ta, n
sffiffiffi
n
p
¼ 1 − a (17)
pr �X − ta, n
sffiffiffi
n
p � m
¼ 1 − a (18)
Example 2: Suppose that the probability distribution of an analytical method is a normal, but its standard deviation is unknown.
With the data of Example 1, the sample standard deviation, s, is computed as 3.90. As t0.025,9 ¼ 2.262 (see Appendix), the
confidence interval at 95% level is [99.01 − 2.26 � 1.24, 99.01 + 2.26 � 1.24] ¼ [96.21, 101.81]. The 95% confidence interval on
the minimum of the mean (i.e., the 95.0% lower confidence bound) is made up, according to Eq. (18), by all the values greater than
96.74 ¼ 99.01 − 1.83 � 1.24. The corresponding interval on the maximum (upper confidence bound for mean), Eq. (17), will be
made up by the values less than 101.28 ¼ 99.01 + 1.83 � 1.24.
The length of the confidence intervals from Eqs. (12)–(15) is a function of the sample size and tends towards zero when the
sample size tends to infinity. This functional relation permits the computation of the sample size needed to obtain an interval of
given length, d. It will suffice to consider d
2 ¼ za=2
sffiffi
n
p and take as n the nearest integer greater than 2 za=2 s
d
� �2
. For example, if we want
a 95% confidence interval with length d less than 2, in the hypothesis of Example 1, we will need a sample size greater than or
equal to 62.
The same argument can be applied when the standard deviation is unknown. However, in this case, to compute n by 2 ta=2, n s
d
� �2
it is necessary to have an initial estimation of s, which, in general, is obtained in a pilot study with size n0, in such a way that
in the previous expression the d.f., n, are n0 − 1. An alternative is to define the desired length of the interval in standard deviation
units (remember that the standard deviation is unknown). For instance, in Example 2, if we want d ¼ 0.5s, we will need a sample
size greater than (4za/2)
2 ¼ 61.5; note the substitution of ta/2,n by za/2, which is mandatory because we do not have the sample size
needed to compute ta/2,n, which is precisely what we want to estimate.
Confidence Interval on the Variance of a Normal Distribution
In this case, the data come from a N(m,s) distribution with m and s unknown, and we have a sample with values x1, x2,. . ., xn. The
distribution of the random variable “sample variance” S2 is related to the chi-square distribution, w2 (see Appendix). As a
consequence, the 100(1 − a)% confidence interval for the variance s2 is obtained from
pr
n − 1ð ÞS2
w2a=2, n
� s2 � n − 1ð ÞS2
w21 − a=2, n
( )
¼ 1 − a (19)
where w2a/2, n is the critical value of a w2 distribution with n ¼ n − 1 d.f. at significance level a/2. As in the previous case for the
sample mean, we should distinguish between the random variable sample varianceS2 and one of its values, s2, computed with
Eq. (6) from sample x1, x2, . . ., xn.
The intervals for the maximum andminimum of the variance at 100(1 − a)% confidence level are obtained from Eqs. (20), (21),
respectively.
pr s2 � n − 1ð ÞS2
w21 − a, n
( )
¼ 1 − a (20)
pr
n − 1ð ÞS2
w2a, n
� s2
( )
¼ 1 − a (21)
Example 3: Knowing that the n ¼ 10 data of Example 2 come from a normal distribution with both mean and variance unknown,
the 95% confidence interval on s2 is found from Eq. (19) as [7.21, 50.81] because s2 ¼ 15.25, w20.025, 9 ¼ 19.02, and w20.975, 9 ¼ 2.70.
If the analyst is interested in obtaining a confidence interval for the maximum variance, the 95% upper confidence interval is
found from Eq. (20) as [0, 41.27] because w20.95, 9 ¼ 3.33, that is, the upper bound for the variance is 41.27 with 95% confidence.
Notice the lower bound in 0. To obtain confidence intervals on the standard deviation, it suffices to take the square root of the
Quality of Analytical Measurements: Statistical Methods for Internal Validation 11
aforementioned intervals because this operation is a monotonically increasing transformation; therefore, the intervals at 95%
confidence level on the standard deviation are [2.69, 7.13] and [0, 6.42], respectively.
The sample size, n, needed so that s2/s2 is between 1 − k and 1 + k is given by the nearest integer greater than
1 + 1=2 za=2
ffiffiffiffiffiffiffiffiffiffiffi
1 + k
p
+ 1
� �
=k
h i2
. For example, for k ¼ 0.5, such that the length of the confidence interval verifies 0.5 < s2/s2 < 1.5,
we would need n ¼ 40 data (at least). Just for comparative purposes, we will admit in the example that with the sample of size 40 we
obtain the same variance s2 ¼ 15.25. As w0.025, 39
2 ¼ 58.12, and w0.975, 39
2 ¼ 23.65, the two-sided interval at 95% confidence level is
now [10.23, 25.15], which verifies the required specifications.
Confidence Interval on the Difference in Two Means
Case 1: Known variances
Consider two independent random variables, N1 and N2, distributed as N(m1,s1) and N(m2,s2) with unknown means and known
variances s1
2 and s2
2. We wish to find a 100(1 − a)% confidence interval on the difference in means m1 − m2. With a random sample
of n1 observations from the first distribution, x11, x12, . . ., x1n1, and n2 observations from the second one, x21, x22, . . ., x2n2, the 100
(1 −a)% confidence interval on m1 − m2 is obtained from the equation
pr �X1 − �X2ð Þ − za=2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s21
n1
+
s22
n2
s
� m1 − m2 � �X1 − �X2ð Þ + za=2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s21
n1
+
s22
n2
s8<
:
9=
; ¼ 1 − a (22)
where �X1 and �X2 are the random variables of the sample mean, which take the values �x1 and �x2. The reader can easily write the
expressions analogous to Eqs. (14), (15) for the one-sided intervals.
Case 2: Unknown variances
The approach to this topic is similar to the previous case, but here even the variances s2
1 and s2
2 are unknown. However, it can be
reasonable to assume that they are equal, s2
1 ¼ s2
2 ¼ s2, and that the differences observed in their estimates with the samples, s1
2 and
s2
2, are not significant. The methodology to decide whether this can be assumed, or not, is explained later, in
Section “Hypothesis Test”.
An estimate of the common variance s2 is given by the pooled sample variance in Eq. (23) which is an arithmetic average of both
variances weighted by the corresponding d.f.,
s2p ¼ n1 − 1ð Þs21 + n2 − 1ð Þs22
n1 + n2 − 2
(23)
The 100(1 − a)% confidence interval is obtained from the following equation:
pr �X1 − �X2 − ta=2, n sp
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
n1
+
1
n2
r
� m1 − m2 � �X1 − �X2 + ta=2, n sp
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
n1
+
1
n2
r
¼ 1 − a (24)
where n ¼ n1 + n2 − 2 are the d.f. of the Student’s t distribution. The one-sided intervals at 100(1 − a)% confidence level have the
analogous expressions deduced from Eq. (24) by substituting ta/2,n for ta,n. If a fixed length is desired for the confidence interval, the
computation explained in Section “Confidence Interval on the Mean of a Normal Distribution” can be immediately adapted to
obtain the needed sample size.
Example 4: We want to study the stability of a substance after being stored for a month. Here stability means that the content of the
substance remains unchanged. Two series of measurements (n1 ¼ n2 ¼ 8) were carried out before and after the storage period and
we will estimate the difference in means by a 95% confidence interval. The results were �x1 ¼ 90:8, s21 ¼ 3:89 and
�x1 ¼ 92:7, s22 ¼ 4:02, respectively. Therefore, the two-sided interval when assuming equal variances (sp
2 ¼ 3.96, Eq. (23)) is
90:8 − 92:7ð Þ � 2:1448� ffiffiffiffiffiffiffiffiffiffi
3:96
p ffiffiffiffiffiffiffiffiffiffiffi
1
8 + 1
8
q
, that is [−4.03,0.23]. Therefore, at 95% confidence level, the difference of the means
belongs to the interval, including null difference, that is, the substance is stable.
When the assumption s2
1 ¼ s2
2 is not reasonable, we can still obtain an interval on the difference m1 − m2 by using the fact that the
statistic
�X1 − �X2 − m1 − m2ð Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s21=n1 + s22=n2
p is distributed approximately as a t with d.f. given by,
n ¼ s21=n1 + s22=n2
� �2
s21=n1ð Þ2
n1 − 1 +
s22=n2ð Þ2
n2 − 1
(25)
The 100(1 − a)% confidence interval is obtained from the following equation:
pr �X1 − �X2 − ta=2, n
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s21
n1
+
s22
n2
s
� m1 − m2 � �X1 − �X2 + ta=2, n
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s21
n1
+
s22
n2
s8<
:
9=
; ¼ 1 − a (26)
12 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Example 5: Wewant to compute a confidence interval on the difference of two means with unknown and not equal variances, with
the results that come from an experiment carried out with four aliquot samples by two different analysts. The first analyst obtains
�x1 ¼ 3:285, and the second �x2 ¼ 3:257. The variances were s1
2 ¼ 3.33 � 10−5 and s2
2 ¼ 9.17 � 10−5, respectively. Assuming that
s1
2 6¼ s2
2, Eq. (25) gives n ¼ 4.9, so the d.f. to apply Eq. (26) are 5 and t0.025,5 ¼ 2.571. Thus, the 95% confidence interval is
3:285 − 3:257ð Þ � 2:571
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
3:33�10−5
4 + 9:17�10−5
4
q
, that is, [0.014, 0.042]. So, at 95% of confidence, the two analysts provide unequal
measurements because zero is not in the interval.
The confidence intervals for the maximum and the minimum are obtained by considering the last or the first term respectively in
Eq. (26) and replacing ta/2,n by ta,n.
Case 3: Confidence interval for paired samples
Sometimes we are interested in evaluating an effect (e.g., the reduction of a polluting agent in an industrial spill by means of a
catalyst) but it is impossible to have two homogeneous populations of samples without and with treatment to obtain the two
means of the recoveries, because the amount of polluting agent may change, for example, over time. In these cases, the solution is to
determine the polluting agent before and after applying the procedure to the same spill. The difference between both determina-
tions is a measure of the effect of the catalyst. The (statistical) samples obtained in this way are known as paired samples. Formally,
with the two paired samples of size n, x11, x12,. . ., x1n and x21, x22,. . ., x2n, we compute the differences between any pair of
data, di ¼ x1i − x2i, i ¼ 1,2,. . .,n. If these differences follow a normal distribution, the 100(1 − a)% confidence interval is obtained
from
pr �d − ta=2, n
sdffiffiffi
n
p � m � �d + ta=2, n
sdffiffiffi
n
p
¼ 1 − a (27)
where �d and sd are the mean and standard deviation of the differences di and n ¼ n − 1 are the d.f. of the t distribution.
Confidence Interval on the Ratioof Variances of Two Normal Distributions
This section approaches the question of giving a confidence interval on the ratio s2
1/s
2
2 of the variances of two distributions N1 �N
(m1,s1) andN2�N(m2,s2) with unknownmeans and variances. Let x11, x12, . . ., x1n1 be a random sample of n1 observations fromN1
and x21, x22, . . ., x2n2 be a random sample of n2 observations fromN2. The sample variances obtained with these two samples, s1
2 and
s2
2, are the particular values of the random variables S1
2 and S2
2, and the 100(1 − a)% confidence interval on the ratio of variances is
computed from the following equation:
pr F1 − a=2, n1, n2
S21
S22
� s21
s22
� Fa=2, n1, n2
S21
S22
¼ 1 − a (28)
where F1−a/2, n1, n2 and Fa/2, n1, n2 are the critical values (upper tail) of an F distribution with n1 ¼ n2 − 1 d.f. in the numerator and
n2 ¼ n1 − 1 d.f. in the denominator. The Appendix contains a description of some relevant properties of the F distribution.
We can also compute one-sided confidence intervals. The 100(1 − a)% upper or lower confidence bound on s1
2/s2
2 is obtained
from Eqs. (29), (30), respectively. Remember that, when computing the intervals by using Eq. (29), the lower bound is always 0.
pr
s21
s22
� Fa, n1, n2
S21
S22
¼ 1 − a (29)
pr F1 − a, n1, n2
S21
S22
� s21
s22
¼ 1 − a (30)
Example 6: In this example, we compute a two-sided 95% confidence interval for the ratio of the variances in Example 4
(n1 ¼ n2 ¼ 8, s1
2 ¼ 3.89, s2
2 ¼ 4.02). The resulting interval is [0.20 � (3.89/4.02), 4.99 � (3.89/4.02)] ¼ [0.19, 4.83]. As 1 belongs
to this interval, we can admit that both variances are equal.
Confidence Interval on the Median
This case is different from the previous cases, because the confidence interval is a “distribution-free” interval, that is, there is no
distribution assumed for the data. As it is known, a percentile (pct) is the value xpct such that 100pct% of the values are less than or
equal to xpct. It is possible to compute confidence intervals on any pct, but for values of pct near one or zero we need very large
sample sizes, n, because the values n � pct and n � (1 − pct) must be greater than 5. For the median (pct ¼ 0.5), it suffices to
consider samples of size 10 or more.
The fundamentals of these confidence intervals are based on the binomial distribution whose details are outside the scope of
this article and can be found in Sprent.16 We use the data of Example 1 to show step by step how a 100(1 − a)% confidence
interval on the median is computed (the guided example is for a ¼ 0.05 with za/2 ¼ 1.96). The procedure consists of three steps:
Quality of Analytical Measurements: Statistical Methods for Internal Validation 13
1. To sort the data in ascending order. In our case, 92.54, 94.45, 97.23, 98.44, 98.70, 98.87, 99.42, 101.08, 103.73, and 105.66. The
rank of each datum is the position that it occupies in the sorted list, for example, the rank of 98.44 is four.
2. To calculate the rank, rl, of the value that will be the lower endpoint of the interval. It is the nearest integer less than
1
2 n − za=2
ffiffiffi
n
p
+ 1
� �
. In our case, this value is 0:5 10 − 1:96
ffiffiffiffiffiffi
10
p
+ 1
� � ¼ 2:40, thus rl ¼ 2.
3. To calculate the rank, ru, of the value that will be the upper endpoint of the interval, which is the nearest integer greater than
1
2 n + za=2
ffiffiffi
n
p
− 1
� �
. In our case, this value is 0:5 10 + 1:96
ffiffiffiffiffiffi
10
p
− 1
� � ¼ 7:60, then ru ¼ 8.
Hence, the 95% confidence interval on the median is made by the values that are between the values in position 2 and 8, that is,
[94.45, 101.08].
Joint Confidence Intervals
Sometimes it is necessary to compute confidence intervals for several parameters but maintaining a 100(1 − a)% confidence that all
of them contain the true value of the corresponding parameter. For example, for two parameters statistically independent, we can
assure a 100(1 − a)% joint confidence level by taking separately the corresponding 100(1 − a)1/2% confidence intervals because (1
− a)1/2 � (1 − a)1/2 ¼ (1 − a). In general, if there are k parameters, we will compute the 100(1 − a)1/k% confidence interval for any
of them.
However, if the used sample statistics are not independent to one another, the above computation is not valid. The Bonferroni
inequality states that the probability that all the affirmations are true at 100(1 − a)% confidence level is greater than or equal to
1 −(
Pk
i¼1ai), where 1 − ai is the confidence level of the i-th interval (usually ai ¼ a/k). For example, if a joint 90% confidence
interval is needed for the mean of two distributions, according to Bonferroni inequality ai ¼ a/2 ¼ 0.10/2 ¼ 0.05; thus, each
individual interval should be the corresponding 95% confidence interval.
Tolerance Intervals
In the introduction to present Section “Confidence and Tolerance Intervals”, the tolerance intervals of a normal distribution have
been calculated knowing its mean and variance. Remember that the tolerance interval [l, u] contains 100(1 − a)% of the values of
the distribution of X or, equivalently, pr{X =2 [l, u]} ¼ a. Actually, the values of the parameters that define the probability
distribution are unknown and this uncertainty should be transferred into the endpoints of the interval. There are several types of
tolerance regions, but in this article, we will restrict ourselves to two common cases.
Case 1: b-content tolerance interval
Given a random variable X, an interval [l, u] is a b-content tolerance interval at g confidence level if the following holds:
pr pr X 2 l; u½ �f g � bf g � g (31)
Expressed in words, [l, u] contains at least 100b%of the values of Xwith g confidence level. For the case of an analytical method, this
is to say that we have to determine, based on a sample of size n, for instance, the interval that will contain 95% (b ¼ 0.95) of the
results and this assertion must be true 90% of the times (g ¼ 0.90). Evidently, b-content tolerance intervals can be one-sided, which
means that the procedure will provide 95% of its results above l (respectively, below u) 90% of the times. We leave to the reader the
corresponding formal definitions.
One-sided and two-sided b-content tolerance intervals can be computed either by controlling the center or by controlling the
tails, and for both continuous and discrete random variables (a review can be seen in Patel17 and applications in Analytical
Chemistry in Meléndez et al.18 and Reguera et al.19).
Here we will only describe the case of a normally distributed X with unknown mean and variance. From this distribution, we
have a sample of size n that is used to compute the mean �x and standard deviation s. We want to obtain a two-sided b-content
tolerance interval controlling the center, that is, an interval such that
pr pr X 2 �x − ks; �x + ks½ �f g � bf g � g (32)
To determine k, several approximations have been reported; consult Patel17 for a discussion on them. The approach by Wald and
Wolfowitz20 is based on determining k1 such that
pr N 0;1ð Þ � 1ffiffiffi
n
p + k1
− pr N 0; 1ð Þ � 1ffiffiffi
n
p − k1
¼ b (33)
Therefore
14 Quality of Analytical Measurements: Statistical Methods for Internal Validation
k ¼ k1
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
n − 1
w2g, n − 1
s
(34)
wg, n−1
2 is the point exceeded with probability g when using the w2 distribution with n − 1 d.f.
Example 7: With the data in Example 1, and b ¼ g ¼ 0.95, we have �x ¼ 99:01, s ¼ 3.91, k1 ¼ 2.054, and w0.95, 9
2 ¼3.33; thus,
according to Eq. (34), k ¼ 3.379 and, as a consequence, the interval [99.01 − 3.38 � 3.91, 99.01 + 3.38 � 3.91] ¼ [85.79, 112.23]
contains 95% of the results of the method 95% of the times that the procedure is repeated with a sample of size 10.
Case 2: b-expectation tolerance interval
The interval [l, u] is called a b-expectation tolerance interval if
E pr X 2 l; u½ �f gð Þ ¼ b (35)
Unlike the b-content tolerance interval, condition in Eq. (35) only demands that, on average,the probability that the random
variable takes values between l and u is b.
As in the previous case, we limit ourselves to obtain intervals of the form [�x − ks, �x + ks]. When the distribution of the random
variable is normal and we have a sample of size n, the solution was obtained for the first time by Wilks21 and is
k ¼ t1 − b
2 , n
ffiffiffiffiffiffiffiffiffiffiffi
n + 1
n
r
(36)
where t(1−b)/2, n is the upper (1 − b)/2 point of the t distribution with n ¼ n − 1 d.f.
Example 7 (continuation): With the same data, the 95% expectation tolerance interval would be [99.01 − 2.37 � 3.91, 99.01 +
2.37 � 3.91] ¼ [89.74, 108.28] as now k is directly computed with the critical value t0.025, 9 ¼ 2.262.
This interval is shorter than the b-content tolerance interval because it only assures the expected value (the mean) of the
probabilities that the individual values belong to the interval. In fact, the interval [89.74, 108.28] contains 95% of the values of X
only 64% of the times, conclusion drawn by applying Eq. (32) with k ¼ 2.37. Also, note that when the sample size tends to infinity,
the value of k in Eq. (36) tends towards z(1−b)/2 which is the length of the theoretical interval that, in our example, would be [91.35,
106.67] obtained by substituting k by z0.025 ¼ 1.96.
Case 3: Distribution free intervals
It is also possible to obtain tolerance intervals independent of the distribution (provided it is continuous) of variable X. These
intervals are based on the rank of the observations, but they demand very large sample sizes, which makes them quite useless in
practice. For example, the sample size n needed to guarantee that the b-content tolerance interval [l, u] is [x(1),x(n)] (i.e., the
endpoints are the smallest and the greatest values in the sample), it is necessary that n fulfills approximately the equation log(n) +
(n − 1) log (g) ¼ log (1 − b) − log (1 − g).
22 If we need, as in Example 7, b ¼ g ¼ 0.95, the value of n has to be 89. Nevertheless,
Willinks23 used the Monte Carlo method to compute shorter “distribution-free” uncertainty intervals proposed in Draft Supple-
ment2 but it still requires sample sizes that are rather large in the scope of chemical analysis. A complete theoretical development on
tolerance intervals (including their estimation by means of Bayesian methods) is in the book by Guttman.24
The tolerance intervals are of interest to show that a method is fit for purpose because when establishing that the interval [�x − ks,
�x + ks] will contain, on average, 100b% of the values provided by the method (or 100b% of the values with g confidence level), we
are including precision and trueness. To assess that the method is “fit for purpose” it suffices that the tolerance interval [�x − ks, �x + ks]
is included in the specifications that the method should fulfill. Note that a method with high precision (small value of s) but with a
significant bias can get to fulfill the specifications in the sense that a high proportion of its values are within the specifications.
In addition, in the estimation of s, the repeatability can be introduced as the intermediate precision or the reproducibility to
consider the scope of application of the method. The use of a tolerance interval solves the problem of introducing the bias as a
component of the uncertainty.
With the aim of developing analytical fit for purpose methods, the Societé Française des Sciences et Techniques Pharmaceutiques
(SFSTP) proposed25–28 the use of b-expectation tolerance intervals in the validation of quantitative methods. In four case studies, it
has shown the validity of b-expectation tolerance intervals as an adequate way to conciliate both the objectives of the analytical
method in routine analysis and those of the validation step, and it proposes them29 as a criterion to select the calibration curve. Also,
it has analyzed30 their adequacy to the guides that establish the performance criteria that should be validated and their usefulness31
in the problem of the transference of an analytical method. González and Herrador32 have proposed their computation for the
estimation of uncertainty of the analytical assay. In all these cases, b-expectation tolerance intervals based on the normality of data
are used, that is, using Eq. (36). To avoid dependence on the underlying distribution and the use of the classic distribution-free
methods, Rebafka et al.33 proposed the use of a bootstrap technique to calculate b-expectation tolerance intervals, whereas Fernholz
and Gillespie34 studied the estimation of the b-content tolerance intervals by using bootstrap.
To summarize this whole section about tolerance and confidence intervals, it is worth pointing out some comparative aspects
because there is a tendency to confuse both concepts that have nothing in common but the word interval. The difference between
Quality of Analytical Measurements: Statistical Methods for Internal Validation 15
them is clear: the confidence interval is the set that is supposed to contain (with a 100(1 − a)% confidence) the true value of the
unknown parameter; the tolerance interval is the set that contains a value which is taken by the random variable in a percentage of b,
with a given confidence g.
In particular, confidence intervals must be used in the process of evaluating trueness and precision of a method when there is no
need to fulfill external requirements but just to compare with other methods or to quantify uncertainty and bias of the results
obtained with it.
A usual error is to mistakenly consider a confidence interval as a tolerance interval when the difference between them is
important. For instance, with the data of Example 7, notice that to compute the confidence interval, the standard deviation of the
mean is estimated as s=
ffiffiffi
n
p ¼ 1:24, whereas the standard deviation of the individual results of the method is estimated as s ¼ 3.91,
very different.
Also, it is important to remember that when the sample size n tends to infinity, the length of a confidence interval tends toward
zero, independently of the chosen confidence level. For example, with the confidence intervals for the mean, in the limit we will
have �x ¼ m thus the estimator and the true parameter will be equal for sure (1 − a ¼ 1). On the contrary, the length of a b-content
tolerance interval does not tend towards zero when increasing the sample size but to the interval that contains for sure (1 − g ¼ 1)
the 100b % of the values.
There are other aspects of the determination of the uncertainty that are of practical interest, for example, the problem that arises
by the fact that any uncertainty interval, particularly an expanded uncertainty interval, should be restricted to the range of feasible
values of the measurand. Cowen and Ellison35 analyzed how to modify the interval when the data are close to a natural limit in a
feasible range such as 0 or 100% mass or mole fraction.
Hypothesis Tests
This section is devoted to the introduction of a statistical methodology to decide whether an affirmation is false, for example, the
affirmation “this method of analysis applied to this sample of reference provides the certified value”. If, on the basis of the
experimental results, it is decided that it is false, we will conclude that the method has bias. The affirmation is customarily called
hypothesis and the procedure of decision making is called hypothesis testing. A statistical hypothesis is an asseveration on the
probability distribution that follows a random variable. Sometimes one has to decide on a parameter, for example, whether the
mean of a normal distribution is a specific value. In other occasions it may be required to decide on other characteristics of the
distribution, for example, whether the experimental data are compatible with the hypothesis that they come from a normal or
uniform distribution.
Elements of a Hypothesis Test
As the results obtained with analytical methods are modeled by a probability distribution, it is evident that both thevalidation of a
method and its routine use involve making decisions that are naturally formulated as problems of hypothesis testing. In order to
describe the elements of a hypothesis test, we will use a concrete case. Like in the case of intervals, all the examples can be followed
with live-script in the supplementary material entitled Tests_section1023_live.mlx.
Example 8: For an experimental procedure, we need solutions with pH values less than 2. The preparation of these solutions
provides pH values that follow a normal distribution with s ¼ 0.55. pH values obtained from 10 measurements were 2.09, 1.53,
1.70, 1.65, 2.00, 1.68, 1.52, 1.71, 1.62, and 1.58. The question to be answered is whether the pH of the resulting solution is
adequate to proceed with the experiment.
We express this formally as
H0 : m ¼ 2:00 inadequate solutionð Þ
H1 : m < 2:00 valid solutionð Þ (37)
The statement “m ¼ 2.00” in Eq. (37) is called the null hypothesis, denoted as H0, and the statement “m < 2.00” is called the
alternative hypothesis, H1. As the alternative hypothesis specifies values of m that are less than 2.00 it is called one-sided alternative.
In some situations, we may wish to formulate a two-sided alternative hypothesis to specify values of m that could be either greater or
less than 2.00 as in
H0 : m ¼ 2:00
H1 : m 6¼ 2:00
(38)
The hypotheses are not affirmations about the sample but about the distribution from which those values come, that is to say, m is
the value, unknown, of the pH of the solution that will be the same as the value provided by the procedure if the bias is zero (see the
model of Eq. (1)). In general, to test a hypothesis, the analyst must consider the experimental goal and define, accordingly, the null
hypothesis for the test, as in Eq. (37). Hypothesis-testing procedures rely on using the information in a random sample; if this
information is inconsistent with the null hypothesis, we would conclude that the hypothesis is false. If there is not enough evidence
to prove falseness, the test defaults to the decision of not rejecting the null hypothesis though this does not actually prove that it is
correct. It is therefore critical to choose carefully the null hypothesis in each problem.
Table 3 Decisions in hypothesis testing.
Researcher’s decision The unknown truth
H0 is true H0 is false
Accept H0 No error Type II error
Reject H0 Type I error No error
Table 4 Some parametric hypothesis tests.
Null hypothesis Alternative hypothesis Statistic Critical region
1 m ¼ m0 m 6¼ m0 Zcalc ¼ �x − m0
s ffiffi
n
p=
{Zcalc < − za/2} [ {Zcalc > za/2}
2 m ¼ m0 m < m0 {Zcalc < − za}
3 m ¼ m0 m > m0 {Zcalc > za}
4 m ¼ m0 m 6¼ m0 tcalc ¼ �x − m0
s ffiffi
n
p=
{tcalc < − ta/2, n−1} [ {tcalc > ta/2, n−1}
5 m ¼ m0 m < m0 {tcalc < − ta, n−1}
6 m ¼ m0 m > m0 {tcalc > ta, n−1}
7 m1 ¼ m2 m1 6¼ m2 Zcalc ¼ �x1 − �x2ffiffiffiffiffiffiffiffiffiffi
s2
1
n1
+
s2
2
n2
q {Zcalc < − za/2} [ {Zcalc > za/2}
8 m1 ¼ m2 m1 > m2 {Zcalc > za}
9 m1 ¼ m2 m1 6¼ m2 tcalc ¼ �x1 − �x2
sp
ffiffiffiffiffiffiffiffiffi
1
n1
+ 1
n2
p {tcalc < − ta/2, n1+n2−2} [ {tcalc > ta/2, n1+n2−2}
10 m1 ¼ m2 m1 > m2 {tcalc > ta, n1+n2−2}
11 md ¼ 0 md 6¼ 0 tcalc ¼ �d
sd = ffiffi
n
p {tcalc < − ta/2, n−1} [ {tcalc > ta/2, n−1}
12 md ¼ 0 md > 0 {tcalc > ta, n−1}
13 s2 ¼ s0
2 s2 6¼ s0
2 w2calc ¼ n − 1ð Þs2
s2
0
{wcalc
2 < w1-a/2,n-1
2 } [ {wcalc
2 > wa/2,n-1
2 }
14 s2 ¼ s0
2 s2 > s0
2 {wcalc
2 > wa,n−1
2 }
15 s1
2 ¼ s2
2 s1
2 6¼ s2
2 Fcalc ¼ s2
1
s2
2
{Fcalc < F1−a/2, n1−1, n2−1} [ {Fcalc > Fa/2, n1−1, n2−1}
16 s1
2 ¼ s2
2 s1
2 > s2
2 {Fcalc > Fa, n1−1, n2−1}
The values za are the percentiles of a standard normal distribution such that a¼pr{N(0,1)>za}. The values ta,n are the percentiles of a Student’s t distribution with n degrees of freedom
such that a¼pr{t> ta,n}. The values w
2
a,n are the percentiles of a w
2 distribution with n degrees of freedom such that a¼pr{w2>w2a,n}. The values Fa,n1,n2 are the percentiles of an F
distribution with n1 degrees of freedom for the numerator and n2 degrees of freedom for the denominator, such that a¼pr{F>Fa,n1,n2}. sp is the pooled variance defined in Eq. (23).
�d is the mean of the differences di ¼ x1i − x2i between the paired samples; sd is their standard deviation.
16 Quality of Analytical Measurements: Statistical Methods for Internal Validation
In practice, to test a hypothesis, we must take a random sample, compute an appropriate test statistic from the sample data, and
then use the information contained in this statistic to make a decision. However, as the decision is based on a random sample, it is
subject to error. Two kinds of potential errors may be made when testing hypothesis. If the null hypothesis is rejected when it is true,
then a type I error has been made. A type II error occurs when the researcher accepts the null hypothesis when it is false. The situation
is described in Table 3.
In Example 8, if the experimental data lead to rejection of the null hypothesis H0 being true, our (wrong) conclusion is that the
pH of the solution is less than 2. A type I error has been made and the analyst will use the solution in the procedure when in fact it is
not chemically valid. If, on the contrary, the experimental data lead to acceptance of the null hypothesis when it is false, the analyst
will not use the solution when in fact the pH is less than 2 and a type II error has been made. Note that both types of error have to be
considered because their consequences are very different. In the case of type I error, an unsuitable solution is accepted, the procedure
will be inadequate, and the analytical result will be wrong with the subsequent damages that it may cause (e.g., the loss of a client, or
a mistaken environmental diagnosis). On the contrary, the type II error implies that a valid solution is not used with the
corresponding extra cost of the analysis. It is clear that the analyst has to specify the assumable risk of making these errors, and
this is done in terms of the probability that they will occur.
The probabilities of occurrence of type I and II errors are denoted by specific symbols, defined in Eq. (39). The probability a of
the test is called the significance level, and the power of the test is 1 − b, whichmeasures the probability of correctly rejecting the null
hypothesis.
a ¼ pr type I errorf g ¼ pr reject H0 H0 is truej gb ¼ pr type II errorf g ¼ pr accept H0 H0 is falsej gff (39)
In Eq. (39), the symbol “|” indicates that the probability is calculated under that condition. In the example we are following, a will
be calculated with the normal distribution of mean 2 and standard deviation 0.55.
Statistically expressed, with the n ¼ 10 results in Example 8 (sample mean �x ¼ 1:708), one wants to decide about the value of the
mean of a normal distribution with known variance and one-sided alternative hypothesis (a one-tail test).
Quality of Analytical Measurements: Statistical Methods for Internal Validation 17
With these premises, the related statistic is written in Table 4 (second row) and gives Zcalc ¼ �x − m0ð Þ= s=
ffiffiffi
n
pð Þ ¼
1:708 − 2:0ð Þ= 0:55=
ffiffiffiffiffiffi
10
p� � ¼ −1:679.
In addition, the analyst must assume the risk a, say 0.05. This means that the decision rule that is going to apply to the
experimental results will accept an inadequate (chemical) solution 5% of the times. Therefore, the critical or rejection region is
written in Table 4, second row, as CR ¼ {Zcalc < −1.645}, meaning that the null hypothesis will be rejected for the samples of size
10 that provide values of the statistic less than −1.645. In the example, the actual value Zcalc ¼ −1.679 belongs to the critical region;
thus, the decision is to reject the null hypothesis at 5% significance level.
Given the present facilities of computation, instead of the CR, the available statistical software calculates the so-called P-value,
which is the probability of obtaining the current value of the statistic under the null hypothesis H0. In our case, P-value¼ pr
{Z � −1.679} ¼ 0.0466. When the P-value is less than the significance level a, the null hypothesis is rejected because this is the same
as saying that the value of the statistic belongs to the critical region.
The next question that immediately arises is about the power of the applied decision rule (statistic and critical region).
To calculate b, defined in Eq. (39), it is necessary to exactly specify the meaning of the alternative hypothesis. In our case, what is
meant by pH smaller than 2. From a mathematical point of view, the answer is clear: any number less than 2, for example, 1.9999
which clearly does not make sense from the point of view of the analyst. In this context, sometimes due to previous knowledge, in
other cases because of the regulatory stipulations or simply by the detail of the standardized work procedure, the analyst can decide
the value of pH that is considered to be less than 2.00, for example, a pH less than 1.60. This is the same as assuming that “pH equal
to 2” is any smaller value whose distance to 2 is less than 0.40. In these conditions,
b ¼ pr N 0;1ð Þ < za −
dj j
s
ffiffiffi
n
p
(40)
where |d| ¼ 0.40 in our problem and when replacing it in Eq. (40), we have b ¼ 0.26 (calculations can be seen in Example A9 of
Appendix). That is to say, whatever the decision made, the decision rule leads to throw a valid chemical solution away 26% of the
times. Evidently, this is an inadequate rule.
A simple examination of Eq. (40) explains the situation. To decrease b, we should decrease the value za − dj j ffiffiffi
n
p
=s. This may be
done by decreasing za (i.e., increasing the significance level a) or increasing dj j ffiffiffi
n
p
=s. As both the procedure precision, s, and the
difference of pH that we wish to detect are fixed, the only possibility left is to increase the sample size n. Solving Eq. (40) in n we
have
n ’ za + zb
� �2
dj j=sð Þ2 (41)
The values of b and a for sample sizes of 10, 15, 20, and 25, maintaining d and s fixed, are drawn in Fig. 4. As can be seen, a and b
exhibit opposite behavior and, unless the sample size is increased, it is not possible to simultaneously decrease the probability of
both errors. In our case, Eq. (41) gives n ¼ 20.5 for a ¼ b ¼ 0.05, thus n ¼ 21 because the sample size must be an integer. The dotted
lines in Fig. 4 intersect in values of b of 0.263, 0.126, 0.058 and 0.025 when increasing the sample size while maintaining the
significance level a ¼ 0.05. Again, we see that for a given a, the risk b decreases with the increase in n.
Eq. (40) also allows the analyst to decide the standard deviation (precision) necessary to obtain a decision rule according to the
risks a and b that the analyst is willing to admit. For example, if one must decide on the validity of the prepared solution with
0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
0
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
α = pr {type I error}
n = 10
n = 15
n = 20
n = 25 β
=
p
r
{t
yp
e
II
er
ro
r}
Fig. 4 Simultaneous (opposite) behavior of a and b for different sample sizes, n ¼ 10, 15, 20 and 25 maintaining d and s fixed at 0.4 and 0.55 respectively.
Dotted lines intersect the different curves for a ¼ 0.05.
ONE SAMPLE
Equal
Unequal
For mean μ0
Known
variance
Z-test
Unknown
variance
t-test
For difference in
means μ1−μ2
Known
variances
Z-test
Unknown
variances
t-test
For standard
deviation σ0
χ2-test
For ratio of standard
deviations σ1 σ2
F-test
HYPOTHESIS TESTING UNDER NORMAL
DISTRIBUTION(S)
TWO INDEPENDENT SAMPLES
Fig. 5 Diagram for selecting the appropriate hypothesis test.
18 Quality of Analytical Measurements: Statistical Methods for Internal Validation
10 results and the analyst states a ¼ b ¼ 0.05, the only option according to Eq. (40) is to increase |d|/s. By solving
0:05 ¼ pr N 0;1ð Þ < 1:645 − 0:40
s
ffiffiffiffiffiffi
10
p
�
, one obtains s ¼ 0.3845. This means that the procedure should be improved from the
current value of 0.55 until 0.38. If only five results were allowed, the standard deviation would have to decrease to 0.27 for
maintaining both the significance level and the power of the test.
Finally, there is an aspect in Eq. (40) that should not go unnoticed. Maintaining a, b, and n fixed, it is possible to reduce d (the
pH value that can be distinguished from 2.00) if the analyst simultaneously increases the precision of the method, provided that the
ratio |d|/s remains constant. Said otherwise, without changing any of the specifications of the hypothesis test, by diminishing s we
can discriminate a value of pH nearer to 2. Qualitatively this argument is clear: if a procedure is more precise, more similar results are
easier to distinguish, so that with a more precise procedure different values will appear that would be considered equal with a less
precise procedure. Eq. (40) quantifies this relation for the hypothesis test we are conducting.
In summary, a hypothesis test includes the following steps: (1) defining the null, H0, and alternative, H1, hypotheses according
to the purpose of the test and the properties of the distribution of the random variable, which, according to Eq. (1), is the
distribution of the values provided by a method of measurement; (2) deciding on the probabilities a and b, that is, the risk for the
two types of error that will be assumed for the decision; (3) computing the needed sample size; (4) obtaining the results, computing
the corresponding test statistic and evaluating whether it belongs to the critical region CR; and, finally, drawing the analytical
conclusion, which should entail more than reporting the pure statistical test decision. The conclusion should include the elements
of the statistical test, the assumed distribution, a, b, and n. Care must be taken in writing the conclusion; for example, it is more
adequate to say “there is no experimental evidence to reject the null hypothesis” than “the null hypothesis is accepted”.
Table 4 summarizes the tests most frequently used in the validation of analytical procedures and in the analysis of their results.
Fig. 5 is the diagram equivalent to the one in Fig. 3 but for hypothesis testing.
Hypothesis Test on the Mean of a Normal Distribution
Case 1: Known variance
Admit that the data follow a normal N(m,s) distribution with unknown m as in the worked Example 8. The corresponding tests are
in rows 1 to 3 in Table 4. The test statistic is always the same, but, depending on whether the alternative hypothesis is two-sided (row
1 in Table 4) or one-sided (rows 2 and 3), the critical region is different. The value za/2 verifies a/2 ¼ pr{Z > za/2} or the analogous
result for za. For the two-tail test, the relation among n, a, and b is given by
n ’ za=2 + zb
� �2
dj j=sð Þ2 (42)
whereas for the one-tail tests, Eq. (41) must be used.
Case 2: Unknown variance
In this case, both the mean, m, and the standard deviation, s, of the normal distribution are unknown. The hypothesis tests are
in row 4 of Table 4 for the two-tail case and in rows 5 and 6 for the one-tail tests. The statistic tcalc should be compared to the values
ta,n−1 or ta/2,n−1 of the Student’s t distribution with n − 1 d.f. The equation that relates a, b, and n is
Quality of Analytical Measurements: Statistical Methods for Internal Validation 19
b ¼ pr −ta=2, n − 1 � tn − 1 Dð Þ � ta=2, n − 1
�
(43)
where D ¼ dj j
s
ffiffiffi
n
p
is the noncentrality parameter of a noncentral t(D) distribution, which in Eq. (43) has n − 1 d.f. Note the analogy
with the “shift” of theN(0,1) in Eq. (40). The discussion about the relative effect of sample size and precision is similar to the case in
which the variance is known. The corresponding equations for one-tail tests are b ¼ pr{− ta, n−1 � tn−1(D)} if H1: m < m0 and b ¼ pr
{tn−1(D) � ta, n−1} if H1: m > m0.
To compute n from Eq. (43), the standard deviation is needed. To solve this additional difficulty, the comments in Case 2 of
Section “Confidence Interval on the Mean of a Normal Distribution”are valid and can be applied here also. Usually, d ¼ 2s
or 3s. Let us compare the solutions with known and unknown variance with the same data of Example 8, but supposing that the
variance is unknown. We wish to detect differences in pH of 0.73s (the same d/s as in Example 8). By using a sample of
size 10, the probability b is 0.31 instead of the previous 0.26 (calculations can be seen in Example A10 of Appendix). This increase
in the probability of type II error is due to the less information we have about the problem; now the standard deviation is
unknown.
Case 3: The paired t-test
In Case 3 of Section “Confidence Interval on the Difference in Two Means” the experimental procedure and the reasons for
considering paired samples have been already explained. To decide on the effect of a treatment, the null hypothesis is that the mean
of the differences is zero, that is, H0: md ¼ 0 and the two-sided alternative H1: md 6¼ 0. This is the test shown in row 11 of Table 4,
where there is only a one-tail test (row 12) because, if needed, it suffices to consider the opposite differences di ¼ x2i − x1i instead of
di ¼ x1i − x2i, i ¼ 1,. . .,n. The statistic and the critical region are analogous to those of Case 2 (test on the mean with unknown
variance).
Example 9: Table 5 shows the recovery rates obtained with two solid-phase extraction (SPE) cartridges after fortification of
wastewater samples with a sulfonamide. The samples came from 10 different locations. We want to decide whether cartridge A is
more efficient than cartridge B and to compute the b risk of the test. To answer these questions, it is important to specify that we
consider “different” those differences between the means of recovery rates that are greater than 2%.
We use a paired t-test on the mean of the differences between the recovery rates obtained with the two cartridges on the same
sample (those of cartridge A minus those of cartridge B). By considering these differences, we eliminate the effect of the location of
the wastewater samples on the performance of the two cartridges. The test is carried out as follows:
H0 : md ¼ 0 no differences in recoveriesð Þ
H1 : md > 0 cartridge A gives recoveries greater than cartridge Bð Þ (44)
Following row 12 of Table 4, the critical region is CR ¼ {tcalc > ta, n−1} and the value of the statistic is tcalc ¼ �d
sdffiffi
n
p ¼ 2:69
3:526ffiffiffi
10
p ¼ 2:412.
The critical value t0.05,9 is equal to 1.833; thus, the actual tcalc belongs to the critical region. Therefore, the null hypothesis is
rejected for a ¼ 0.05 and we can conclude that cartridge A is more efficient than cartridge B, because the mean of the differences is
positive.
To evaluate the power (1 − b) of the test, the equation b ¼ pr{tn−1(D) � ta, n−1} with D ¼ d
ffiffiffi
n
p
for d ¼ |m − m0|/s ¼ |d |/
s ¼ 2/3.53 ¼ 0.57 provides 1 − b ¼ 1 − 0.492 ¼ 0.508 for a ¼ 0.05 and n ¼ 10 (calculations can be seen in Example A11 of
Appendix). Hence, 50% of the times the conclusion of accepting that there is no difference between recovery rates is wrong.
In this case, the risk of a type II error is very large; in other words, the power is very poor when we want to discriminate differences of
2% in recovery because the ratio d is small.
Hypothesis Test on the Variance of a Normal Distribution
The variance is a measure of the dispersion of the data used to evaluate the precision of a procedure of analysis; thus, decisions must
be made on this parameter frequently. The corresponding hypothesis tests are in rows 13 and 14 of Table 4.
Example 10: A validated procedure has a repeatability of s0 ¼ 1.40 mg L−1 when measuring concentrations around 400 mg L−1.
After a technical revision of the instrument, the laboratory is interested in testing the hypothesis
H0 : s2 ¼ s20 ¼ 1:96 the repeatability did not changeð Þ
H1 : s2 > 1:96 the repeatability got worseð Þ (45)
Table 5 Recovery rates obtained by using two different extraction cartridges for a sulfonamide spiked in wastewater.
Location 1 2 3 4 5 6 7 8 9 10
Cartridge A (%) 77.2 74.0 75.6 80.0 75.2 69.2 75.4 74.0 71.6 60.4
Cartridge B (%) 74.4 70.0 70.2 77.2 75.9 60.0 77.0 76.0 70.0 55.0
See Example 9 for more details.
20 Quality of Analytical Measurements: Statistical Methods for Internal Validation
The analyst decides that a repeatability is admissible until 2.0 times the initial one, 1.40 mg L−1, and assumes the risks a ¼ b ¼ 0.05.
The sample size needed to guarantee the requirements of the analyst, which is formally a one-tail hypothesis test on the variance, is
obtained from Eq. (46).
b ¼ pr w2n − 1 <
k
l2
(46)
k is the value such that a ¼ pr{wn−1
2 > k} and l ¼ s/s0. As l ¼ 2.0, Eq. (46) gives that for n ¼ 14, b ¼ 0.0402, whereas for n ¼ 13,
b ¼ 0.0511 (calculations can be seen in Example A12 of Appendix). Therefore, the analyst decides to carry out 14 determinations on
aliquot parts of a sample with 400 mg L−1 obtaining a variance of 3.10 (s ¼ 1.76 mg L−1).
The statistic related to the decision (row 14 in Table 4) is wcalc
2 ¼ (14 − 1) 3.10/1.96 ¼ 20.56. As the critical region is CR ¼
{wcalc
2 > w0.05, 13
2 ¼ 22.36}, the conclusion is that there is not enough experimental evidence to conclude that the precision has
worsened. In this case, the acceptance of the null hypothesis, that is, to maintain the repeatability below 2.0 times the initial one,
will be erroneous 5% of the times because b was fixed at 5%. The decision rule is equally protected against type I and II errors.
Hypothesis Test on the Difference in Two Means
Case 1: Known variances
We assume that X1 is normal and has unknownmean m1 and known variance s2
1, and that X2 is also normal with unknownmean m2
and known variance s2
2. We will be concerned with testing the hypothesis that the means m1 and m2 are equal. The two-sided
alternative hypothesis is in row 7 of Table 4 and the one-sided in line 8 when we have a random sample of size n1 of X1 and another
random sample of size n2 of X2.
Example 11: A solid-phase microextraction (SPME) procedure to extract triazines fromwastewater was carried out. The results must
be compared with previous ones where extraction was made by means of SPE. The repeatability of both procedures is known to be
5.36% for SPME procedure and 3.12% for SPE. The mean recovery rate for 10 replicated samples (Table 6) is 85.9% for SPME and
81.8% for SPE. We want to decide, at a 0.05 significance level, if the recovery rate of SPME procedure is greater than the one of SPE.
As the standard repeatability of both procedures is known, a test to compare two means with normal distribution and known
variances is adequate. The hypotheses are
H0 : mSPME ¼ mSPE recovery rates are the same for both proceduresð Þ
H1 : mSPME > mSPE the recovery using SPME procedure is greater than the one using SPEð Þ (47)
Following row 8 of Table 4, for a significance level a ¼ 0.05, is CR ¼ {Zcalc > za ¼ 1.645}. The statistic is
Zcalc ¼ �xSPME − �xSPEð Þ=
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s2SPME
n1
+
s2SPE
n2
s
¼ 85:9 − 81:8ð Þ=
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
28:73
10
+
9:73
10
r
¼ 2:091:
As the value of the statistic 2.0912 CR, the null hypothesis is rejected, and we conclude that the mean recovery rate with SPME is
greater than with SPE.
The next question could be related to the risk b for this hypothesis test, provided a difference in recovery of 3% is enough in the
analysis. To answer this question, a simple modification of Eq. (40) shows that
b ¼ pr Z > za −
dj jffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s21
n1
+
s22
n2
q
8><
>:
9>=
>; (48)
By substituting our data in Eq. (48), one obtains b ¼ 0.55. That means that in 55% of the cases, we will incorrectly accept that the
recovery is the same for both procedures.
Table 6 Recovery rates for triazines in wastewater using solid-phase microextraction (SPME) and solid-phase
extraction (SPE).
Recovery rate (%)
SPME 91 85 90 81 79 78 84 87 93 91SPE 86 82 85 86 79 82 80 77 79 82
See Example 11 for more details.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 21
It is also possible to derive formulas to estimate the sample size required to obtain a specified b for given d and a. For the one-
sided alternative, the sample size n ¼ n1 ¼ n2 is
n ’ za + zb
� �2
d2
s21 + s22
(49)
Again, with the data of Example 11 and b ¼ 0.05, Eq. (49) gives 46.25, that is, 47 aliquot samples should be analyzed for each
procedure so that a ¼ b ¼ 0.05.
For the two-sided alternative, the sample size n ¼ n1 ¼ n2 is
n ’ za=2 + zb
� �2
d2
s21 + s22
(50)
Case 2: Unknown variances
As in Section “Confidence Interval on the Difference in TwoMeans” (Case 2), there are two possibilities: (1) the unknown variances
are equal s1
2 ¼ s2
2 ¼ s2, although the numerical values differ just by chance, and (2) the variances are different s1
2 6¼ s2
2. The
question of deciding between (1) and (2) will be approached in Section “Hypothesis Test on the Variances of Two Normal
Distributions”.
Let X1 and X2 be two independent normal random variables with unknown but equal variances. The statistics and critical region
for the two-tail test are in row 9 of Table 4 and in row 10 we can see the one-tail case.
For the two-sided alternative, with risks a and b, we additionally consider that the two means are different if their difference is at
least a quantity d ¼ |m1 − m2|. As the variances are unknown, the comments in Case 2 of Section “Confidence Interval on the Mean
of a Normal Distribution” are also applicable. If we have samples from a pilot experiment with respective sizes n10 and n20, and sp
2 is
the pooled variance as defined in Eq. (23), the sample size needed n ¼ n1 ¼ n2 is
n ’ ta=2, n + tb, n
� �2
d2
�
2
s2p
(51)
where n ¼ n10 + n20 − 2 are the d.f. of the Student’s t distribution. If this is not possible, the difference to be detected should be
expressed in standard deviation units, that is, d ¼ |m1 − m2| ¼ ks and the following expression applies:
n ’ za=2 + zb
� �2
k2=2
(52)
where za/2 and zb are the corresponding upper percentage points of the standard normal distribution Z.
Example 12: An experimenter wishes to compare the mean of two procedures, stating that they should be considered different if
they differ by 2 or more standard deviations (k ¼ 2), and defining assumable risks of, at most, a ¼ 0.05 and b ¼ 0.10.
As z0.025 ¼ 1.960 and z0.10 ¼ 1.282, Eq. (52) gives n ¼ 5.25, then six samples must be determined using each procedure. If the
experimenter had wanted to distinguish 1 standard deviation (k ¼ 1), then n ¼ 21.01, that is, 22 determinations with each
procedure would have been necessary.
Although it is preferable to always take equal sample sizes, it may be more expensive or laborious to collect data from X1 than
from X2. In this case, there are weighted sample sizes to be considered.36
In case that the equality of variances s1
2 and s2
2 cannot be admitted, there is no completely justified solution for the test.
However, approximations exist with good power and easy to use tests, such as Welch test. This method consists of substituting the
known variances in the expression of Zcalc in rows 7 and 8 of Table 4 by their sample estimates, in such a way that the statistic
becomes
tcalc ¼
�x1 − �x2ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s21
n1
+
s22
n2
q (53)
which follows a Student’s t with the degrees of freedom n in Eq. (25). The critical region for the two-tail test is CR ¼ {tcalc < − ta/2,n}
[ {tcalc > ta/2,n}. The critical region for the one-tail test is CR ¼ {tcalc < − ta,n} if H1: m1 < m2.
As the variances are different, it seems reasonable to take the sample sizes, n1 and n2, also different. If s2 ¼ r � s1, similar to
Eq. (52), one obtains the expression in Eq. (54)
n1 ’ za=2 + zb
� �2
k2= r + 1ð Þ (54)
Once n1 is determined, n2 is obtained as n2 ¼ r � n1. The computation of the sample sizes with different variances when pilot
samples are at hand can be found in Schouten.36
22 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Test Based on Intervals
The problem of deciding “the equality” of the means of two distributions discussed in the previous section shows the fact that the
result of interest (the two means are equal) is obtained by accepting the null hypothesis. Hence, the type II error becomes very
important. To compute it, it is necessary to decide which is the least difference between the means that is to be detected,
d ¼ |m1 − m2|. A more natural framework is to define the null and alternative hypotheses in such a way that the decision of
accepting the equality of means is made by rejecting the null hypothesis, that is, the test should be posed as
H0 : m1 − m2j j � d the means are differentð Þ
H1 : m1 − m2j j < d the means are considered to be equalð Þ
Contrary to the tests so far, the hypotheses of this test, called interval hypotheses, are not made by one point but an interval. The two
one-sided tests (TOST) procedure consists of decomposing the interval hypotheses H0 andH1 into two sets of one-sided hypotheses:
H01 : m1 − m2 � − d
H11 : m1 − m2 > − d
And
H02 : m1 − m2 � d
H12 : m1 − m2 < d
The TOST procedure consists of rejecting the interval hypothesis H0 (and thus concluding equality of m1 and m2) if and only if
both H01 and H02 are rejected at a chosen level of significance a.
If two normal distributions with the same unknown variance, s2, are assumed and two samples of size n1 and n2 are taken
from each one, the two sets of one-sided hypotheses will be tested with ordinary one-tail test (row 10 of Table 4). Thus, the critical
region is
CR ¼ �x1 − �x2ð Þ + d
sp
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
n1
+ 1
n2
q � ta, n and
d − �x1 − �x2ð Þ
sp
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
n1
+ 1
n2
q � ta, n
8><
>:
9>=
>; (55)
where sp
2 is the pooled variance and n ¼ n1 + n2 − 2 its d.f.
The TOST procedure turns out to be operationally identical to the procedure of declaring equality only if the usual confidence
interval at 100(1 − 2a)% on m1 − m2 is completely contained in the interval [−d, d].
As we are supposing that the variances are equal, the expression that relates the sample sizes n1 ¼n2 ¼n with a and b is written in
Eq. (56)
n ’ za + zb=2
d
� �2
2s2 (56)
Again, s is unknown in Eq. (56), so it should be adapted as in Case 2 of Section “Hypothesis Test on the Difference in Two Means”.
When comparing Eq. (56) with those corresponding to the two-tail t-test on the difference of means, one observes that it is
completely analogous by exchanging the two risks (see Eqs. (50), (52)). That is, the significance level and the power of the t-test
become the power and significance level, respectively, of the TOST procedure, which completely agrees with the exchange of the
hypotheses.
The tests based on intervals have a long tradition in Statistics; see for example the book (very technical) by Lehmann.37 The TOST
procedure is a particular case that has also been used under the name bioequivalence test.38 Mehring39 has proposed some technical
improvements to obtain optimal interval hypothesis tests, including equivalence testing. It is shown that TOST is always biased, in
particular, the power tends to zero for increasing variances independently of the difference in means. As a result, an unbiased test40
and a suitable compromise between the most powerful test and the shape of its critical region41 have been proposed. In Chemistry
the use of TOST has been suggested to verify the equality of two procedures.42,43 Kuttatharmmakull et al.44 provide a detailed
analysis of the sample sizes necessary in a TOST procedure to compare methods of measurement. There are different versions of
TOST for ratio of variables and for proportions; the details of the equations for these cases can be consulted in Section 8.13 of the
book by Martin Andrés and Luna del Castillo45 and in the book by Wellek.46 The latter is a comprehensive reviewof inferential
procedures that enable one to “prove the null hypothesis” for many areas of applied statistical data analysis.
Hypothesis Test on the Variances of Two Normal Distributions
Suppose that two procedures follow normal distributions X1 and X2 with unknown means and variances. We wish to test the
hypothesis on the equality of the two variances, that is, H0: s1
2 ¼ s2
2. In practice, this is a relevant problem because this hypothesis is
Table 7 Data for analysis of stability (arbitrary units).
Control sample 46.31 44.90 44.12 36.07 39.20 36.39 50.71 47.85 45.60
Test sample 43.12 43.00 44.75 39.66 37.74 37.50 54.79 53.08 55.07
See Example 13 for more details.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 23
related to the equality of the precision of the two procedures, and also as a previous step to decide about the equality of variances
before applying the test on the equality of means (Case 2 of Section “Hypothesis Test on the Difference in Two Means”) or
to compute a confidence interval on the difference of means (Case 2 of Section “Confidence Interval on the Difference in
Two Means”).
Assume that two random samples of size n1 of X1 and of size n2 of X2 are available and let s1
2 and s2
2 be the sample variances.
To test the two-sided alternative, we use the statistic and CR of row 15 of Table 4. The probability b can be computed as a function of
the ratio of variances l2 ¼ s1
2/s2
2 that is to be detected by Eq. (57)
b ¼ pr
F1 − a=2, n1 − 1, n2 − 1
l2
< Fn1 − 1, n2 − 1 <
Fa=2, n1 − 1, n2 − 1
l2
(57)
where Fn1−1, n2−1 denotes an F-distribution with n1 − 1 and n2 − 1 d.f. and Fa/2, n1−1, n2−1 its upper a/2 point, so that pr{Fn1−1, n2−1 >
Fa/2, n1−1, n2−1} ¼ a/2. Similarly, F1−a/2, n1−1, n2−1 is the upper 1 − a/2 point.
Example 13: Aliquot samples have been analyzed in random order under the same experimental conditions to carry out a stability
test. The results are given in Table 7 and must be compared for assessing the test material stability. Different questions can be asked:
(1) Is there experimental evidence of instability in the material?, (2) Taking into account that the analyst considers that the material
is not stable if the mean of the test sample differs from the mean of the control sample in two standard deviations, what is the
probability of accepting the null hypothesis when it is in fact wrong?, (3) What should be the sample size if just one standard
deviation is needed for fit for purpose of this analysis (with a ¼ b ¼ 0.05)?
The answers would be:
(1) Stability: As we only know the estimates of the variance, we should use a t-test to compare means.
The first step is to test if the variances can be considered equal by using a two-tail F-test:
H0 : s21 ¼ s22
H1 : s21 6¼ s22
Following row 15 in Table 4, CR ¼ {Fcal < F1−a/2, n1−1, n2−1or Fcal > Fa/2, n1−1, n2−1} with Fa/2, n1−1, n2−1 ¼ F0.025, 8, 8 ¼ 4.43 and
F1−a/2, n1−1, n2−1 ¼ F0.975, 8, 8 ¼ 1/F0.025, 8, 8 ¼ 1/4.43 ¼ 0.23. As Fcalc ¼ s1
2/s2
2 ¼ 50.75/26.72 ¼ 1.93, there is no experimental
evidence to conclude that the variances differ.
Therefore, a hypothesis t-test on the difference of the two means with equal variances is being formulated (Case 2 of
Section “Hypothesis Test on the Difference in Two Means”). The statistic and the CR are given in row 9 of Table 4.
H0 : m1 ¼ m2 the test material is stableð Þ
H1 : m1 6¼ m2 the test material is not stableð Þ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiq� �
The “pooled” variance, sp
2 ¼ 38.49, so sp ¼ 6.20 with 9 + 9 − 2 ¼ 16 d.f. tcalc ¼ �x1 − �x2ð Þ= sp
1
n1
+ 1
n2
¼
45:41 − 43:46ð Þ=6:20 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1=9 + 1=9
p ¼ 0:67 and t0.025, 16 ¼ 2.12. Therefore, the critical region is the set of values of tcalc less
than −2.12 or greater than 2.12, which does not contain 0.67. Hence, there is no evidence to reject the null hypothesis, that is,
with these data there is no experimental evidence of instability.
(2) Power of the test: With the condition imposed by the analyst, Eq. (52) with k ¼ 2 gives b � 0.05, so power greater than 0.95.
(3) Sample size : In this case, the analyst is interested in computing the sample size under the assumption that only 1 standard
deviation is admissible for fit for purpose of this analysis. Therefore, k ¼ 1 and Eq. (52) gives n ¼ 25.99, so n1 ¼ n2 ¼ 26.
Sample sizes are greater than in point (2) reflecting the fact that the analyst is now interested in distinguishing a much smaller
quantity.
Regarding the sample size of the F-test in the answer to question (1), when the aim is to detect a standard deviation that is twice as
the one of the control samples, Eq. (57) gives a probability b for this test of 0.56. That means that 56% of the times the null
hypothesis will be wrongly accepted, and in this case, we have accepted the null hypothesis. When the F-test is used as a previous
step to the one on equality of means, and we decided b ¼ 0.05 for the latter (t-test), it is common to use b ¼ 0.10 for the former
(F-test). Eq. (57) gives n ¼ n1 ¼ n2 ¼ 24 with b ¼ 0.098 (the closest to the intended 0.1), and by maintaining that a change of 2
times the standard deviation of the control samples is to be detected (all calculations of b can be seen in Example A13 of Appendix).
In general, the F-tests on the equality of variances are very conservative and large sample sizes are needed to assure an adequately
small probability of type II error.
Table 8 Determination of the degree of acidity of a vinegar by means of an acid-base titration.
Group 1 Group 2 Group 3 Group 4 Group 5
6.028 5.974 5.886 6.132 5.916
6.028 6.004 5.970 6.120 6.123
5.998 6.005 5.880 6.131 6.034
6.089 5.852 5.910 6.072 6.004
6.059 5.944 5.910 6.071 6.152
Mean �xi 6.040 4 5.955 8 5.911 2 6.105 2 6.045 8
Variance si
2 1.203 � 10−3 3.997 � 10−3 1.267 � 10−3 0.969 � 10−3 8.993 � 10−3
See Example 14 for more details.
p
H
Group
1 2 3 4 5
5.8
5.9
6.0
6.1
6.2
Fig. 6 pH obtained by different group of students (Example 14) depicted to visually inspect the equality of variances.
24 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Hypothesis Test on the Comparison of Several Independent Variances
When the hypothesis of equality of variances of several groups of data coming from normal and independent distributions is to be
tested, a good practice is to draw the data for a visual inspection of their dispersion.
Example 14: Table 8 shows the results of the determination of the degree of acidity by means of an acid-base titrimetry, employing
sodium hydroxide as the titrant. These data are adapted from the practice “Analysis and comparison of the acetic grade of a vinegar”
included in Ortiz et al.,47 and each series is a replicated determination carried out by a group of students on the same vinegar
sample. The means and variances obtained by each group are also included in Table 8.
Fig. 6 shows the plot of the results obtained by the five different groups of students. Some differences are observed.
The most commonly used tests to compare several variances are Cochran’s, Bartlett’s, and Levene’s tests. In all the cases, the
hypotheses to be tested are
H0 : s21 ¼ . . . ¼ s2i ¼ . . . ¼ s2k
H1 : at least one s2i is different
(58)
The sample size of each group is denoted as ni, i ¼ 1,2,. . .,k, and N ¼Pk
i ¼ 1ni.
Case 1: Cochran’s test
The null hypothesis is that the variances within each of the k groups of data are the same. This test detects if one variance is greater
than the rest. The statistic is
Gcal ¼ max s2iPk
i¼1s
2
i
(59)
The critical region at significance level a is given by
CR ¼ Gcal > Ga, k, n − 1
�
(60)
where Ga, k, n is the value tabulated in Table 9 for n d.f. In the case ni ¼ n for all i, is n ¼ n − 1.
With the data of Example 14 in Table 8, Gcalc ¼ 8.993 � 10−3/(16.429 � 10−3) ¼ 0.5474 and G0.05, 5,5-1 ¼ 0.5441 (Table 9).
Thus, at 0.05 significance level, the null hypothesis should be rejected and the variance ofgroup 5 should be considered different
from the rest.
Table 9 Critical values for Cochran’s test for testing homogeneity of several variances at 5% significance level.
k n
1 2 3 4 5 6 7 8 9 10
2 0.9985 0.9750 0.9392 0.9057 0.8772 0.8534 0.8332 0.8159 0.8010 0.7880
3 0.9669 0.8709 0.7977 0.7457 0.7071 0.6771 0.6530 0.6333 0.6167 0.6025
4 0.9065 0.7679 0.6841 0.6287 0.5895 0.5598 0.5365 0.5175 0.5017 0.4884
5 0.8412 0.6838 0.5981 0.5441 0.5065 0.4783 0.4564 0.4387 0.4241 0.4118
6 0.7808 0.6161 0.5321 0.4803 0.4447 0.4184 0.3980 0.3817 0.3682 0.3568
7 0.7271 0.5612 0.4800 0.4307 0.3974 0.3726 0.3939 0.3384 0.3259 0.3154
8 0.6798 0.5157 0.4377 0.3910 0.3595 0.3362 0.3185 0.3043 0.2926 0.2820
9 0.6385 0.4775 0.4027 0.3584 0.3286 0.3067 0.2901 0.2768 0.2659 0.2568
10 0.6020 0.4450 0.3733 0.3311 0.3029 0.2823 0.2666 0.2541 0.2439 0.2353
k, number of levels; n, degrees of freedom.
Adapted from Sachs, L. Applied Statistics. A Handbook of Techniques; Springer-Verlag: New York, 1982.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 25
Case 2: Bartlett’s test
This test is appropriate to detect groups of similar variance but that differs from one group to another. The statistic is defined using
the following equations:
w2calc ¼ 2:3026
q
c
(61)
q ¼ N − kð Þ log 10 s2p
� �
−
Xk
i¼1
ni − 1ð Þ log 10 s2i
� �
(62)
c ¼ 1 +
Pk
i¼1
1
ni − 1
h i
− 1
N − k
3 k − 1ð Þ (63)
In Eq. (62), “log10”means the decimal logarithm and sp
2 is the pooled variance that, extending Eq. (23) for k variances, is sp
2 ¼ [
Pk
i¼1
(ni − 1)si
2]/(N − k).
The critical region is
CR ¼ w2calc > w2a, k − 1
n o
(64)
In Example 14, c ¼ 1.10, q ¼ 3.43, and wcalc
2 ¼ 7.19, which does not belong to the critical region defined in Eq. (64) because
w0.05,4
2 ¼ 9.49. Consequently, we have no evidence of difference in variances.
Cochran’s and Bartlett’s tests are very sensitive to the normality assumption. Levene’s test, particularly when it is based on the
medians of each group, is more robust to the lack of normality of data.
Case 3: Levene’s test
For each j-th group of replicates, compute the absolute deviations of the values xij from its corresponding mean.
lij ¼ xij − �xi
�� ��, i ¼ 1, 2, . . . , ni (65)
Consider the data arranged as in Table 8 and compute the usual F statistics for the deviations lij
Fcalc ¼
Pk
i¼1
ni �li − �lð Þ2
k − 1Pk
j¼1
Pni
i¼1
lij − �lið Þ2
N − k
(66)
where �li is the mean of the i-th group and �l is the overall mean. Note that the numerator of Eq. (66) is the pooled variance of the
deviations and the denominator is the overall variance of these deviations. The critical region at 100(1 − a)% confidence level is
CR ¼ Fcalc > Fa, k − 1,N − k
�
(67)
Computing the differences in Eq. (65) with the data of Table 8, Fcalc ¼ (2.205 � 10−3)/(0.905 � 10−3) ¼ 2.44. As F0.05,4,20 ¼ 2.866,
there is no evidence to reject the null hypothesis (the variances are equal).
Levene’s test is more recommendable using group medians instead of group means. The adaptation is simple; one has to
consider the absolute value of the differences but to the median, ~xi, of each group
26 Quality of Analytical Measurements: Statistical Methods for Internal Validation
lij ¼ xij − ~xi
�� ��, i ¼ 1, 2, . . . , ni (68)
The statistics is again the one of Eq. (66) and it is applied similarly.
With the same data of Table 8, but the values in Eq. (68), one obtains Fcalc ¼ (2.146 � 10−3)/(1.360 � 10−3) ¼ 1.58, and the
conclusion is the same. The variance of the five groups should be considered as equal.
It often happens that the three tests do not agree in the result, as is the case here. But the joint interpretation clarifies the situation:
In the data of Example 14, the variance of group 5 is greater than the variance of other groups, as Cochran’s test shows. When
Levene’s test is applied, a large difference between both statistics is observed when using the median. This makes one think that the
increase in the variance of the last group is caused by some data being different from the others which is graphically seen in Fig. 6.
Goodness-of-Fit Tests: Normality Tests
The test on distributions or goodness-of-fit tests are designed to decide whether the experimental data are compatible with a
predetermined probability distribution, generally characterized by one or several parameters, such as the normal, the Student’s t, the
F or uniform distributions. Almost all the inferential procedures proposed in this article are based on normality; thus, in most of the
cases, it is necessary to check whether the data are compatible with this assumption. In this section, we will show the chi-square test
that is used for any distribution and the D’Agostino test that is advised for evaluating the normality of a data set.
Case 1: Chi-square test
The test is designed to detect frequencies inadequate for a specified probability distribution F0. Given a sample x1, x2, . . ., xn from a
random variable, one is interested in testing the hypothesis
H0 : The distribution of the random variable is F0
H1 : This is not the case
(69)
To compute the statistics, the n sample values are grouped into k classes (intervals). Denote byOi, i ¼ 1,. . .,k, the frequency observed
in each class and by Ei the expected frequency for the same class provided the distribution is exactly F0. Then, the statistic in Eq. (70)
w2calc ¼
Xk
i¼1
Oi − Eið Þ2
Ei
(70)
follows a wk−p−1
2 distribution, that is used to define the critical region at (1 − a)100% confidence level as
CR ¼ w2calc > w2a, k − p − 1
n o
(71)
where w2a, k−p−1 is the value such that pr{wk−p−1
2 > wa, k−p−1
2 } ¼ a and p is a number that depends on the distribution F0, for instance
p ¼ 2 for a normal, p ¼ 1 for a Poisson, and p ¼ 0 for a uniform distribution. The test requires that the expected frequencies are not
too small. If this is so, the data are regrouped into bigger classes. In the practice of chemical analysis, the sample sizes are not large
and when grouping the data the d.f. of the chi-square statistics are few, the critical value of Eq. (71) becomes large, and it is necessary
to have a large discrepancy between the estimated and observed frequencies to reject the null hypothesis. That means that the test is
very conservative.
Example 15: To show the validity of the use of the crystal violet (CV) as an internal standard in the determination by LC-MS-MS of
malachite green (MG) in trout, a sample of trout was spiked with 1 mg kg−1 of CV and increasing concentrations of MG between 0.5
and 5.0 mg kg−1. The areas of the CV-specific peak (transition 372 > 356) in these calibration standards resulted: 1326, 1384, 1419,
1464, 1425, 1409, 1387, 1449, 1311, 1338, 1350 and 1345. To verify whether the signal of CV is constant and independent of the
concentration of MG in every standard, we can test the null hypothesis, H0,
H0 : The distribution of the random variable is uniform
H1 : This is not the case
Table 10 shows the calculation of both observed and expected frequencies under the uniform distribution in the interval [1311,
1464], the endpoints being respectively the minimum and maximum values in the sample.
By summing up the values of the last column of Table 10, the statistics is wcalc
2 ¼ 0.51, which does not belong to the critical region
because it is not greater than w0.05,5-0-1
2 ¼ 9.49. Therefore, there is no evidence to reject the hypothesis that the data come from a
uniform distribution.
Case 2: D’Agostino normality test
The problem of checking the normality of a set of data has been extensively treated. When the empirical and theoretical histograms
are compared, the most commonly used tests are those of the chi-square and the Kolmogoroff–Smirnov. However, there are several
Table 11 Significance limits for the D’Agostino normality test.
Sample size
Significance level
a ¼ 0.05 a ¼ 0.01
DL DU DL DU
10 0.2513 0.2849 0.2379 0.2857
12 0.2544 0.2854 0.2420 0.2862
14 0.2568 0.2858 0.2455 0.2865
16 0.2587 0.2860 0.2482 0.2867
18 0.2603 0.28620.2505 0.2868
20 0.2617 0.2863 0.2525 0.2869
22 0.2629 0.2864 0.2542 0.2870
24 0.2639 0.2865 0.2557 0.2871
26 0.2647 0.2866 0.2570 0.2872
28 0.2655 0.2866 0.2581 0.2873
30 0.2662 0.2866 0.2592 0.2873
Adapted from Martín Andrés, A.; Luna del Castillo, J. D. Bioestadística para las ciencias de la salud; Spain: Norma Capitel Madrid, 2004.
Table 10 w2 Goodness-of-fit test to a uniform distribution applied to assess the validity of crystal violet as
internal standard; data of Example 15.
Class Observed frequency (Oi) Expected frequency (Ei)
Oi − Eið Þ2
Ei
[1311.0, 1341.6) 3 2.40 0.15
[1341.6, 1372.2) 2 2.40 0.07
[1372.2, 1402.8) 2 2.40 0.07
[1402.8, 1433.4) 3 2.40 0.15
[1433.4, 1464.0) 2 2.40 0.07
Quality of Analytical Measurements: Statistical Methods for Internal Validation 27
characteristics that are specific for the pdf of a normal distribution; for example, the skewness and the kurtosis, which are statistics
related to higher than two order moments for the normal pdf. A very powerful test is D’Agostino test, with hypotheses
H0 : The distribution of the random variable is normal
H1 : This is not the case
To apply the test, the data are sorted in increasing order, so that x1 � x2 �⋯ � xn. The statistic is
Dcalc ¼
Pn
i¼1
i xi −
n + 1
2
Pn
i¼1
xi
n
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
n
Pn
i¼1x
2
i
� �
−
Pn
i¼1xi
� �2
=n
h in or (72)
Index i in Eq. (72) refers to the ordered data. Table 11 shows some of the critical values of the statistics Da,n with the two values,
DLa,n and DUa,n, for each sample size n and significance level a. The critical region of the test is
CR ¼ Dcalc < DLa,nf g [ Dcalc > DUa,nf g (73)
For further details, consult the work by D’Agostino and Stephens.48
As with the confidence intervals, a Bayesian approach exists for the construction of the hypothesis tests that several statisticians
prefer because of its internal coherence. For a recent comparative analysis of both approaches, see Moreno and Girón.49
One-Way Analysis of Variance
Sometimes, more than twomeans must be compared. One can think in comparing, say, five means, applying the test of comparison
of two means of Section “Hypothesis Test on the Difference in Two Means” to each of the 10 pairs of means that can be formed by
taking them two by two. This option has a serious drawback: it requires enormous sample sizes, because to test the null hypothesis
“the five means are equal” with a ¼ 0.05 and assuming that the 10 tests are independent, each one of the hypothesis “the means �xi
Table 12 Arrangement of data for an ANOVA.
Factor
Level 1 Level 2 Level 3 Level k
x11 x21 x31 ⋯ xk1
x12 x22 x32 ⋯ xk2
x13 x23 x33 ⋯ xk3
⋮ ⋮ ⋮ ⋱ ⋮
x1n x2n x3n ⋯ xkn
28 Quality of Analytical Measurements: Statistical Methods for Internal Validation
and �xj are equal” should be tested with a significance level of 0.0051 to obtain a confidence equal to (1 − 0.0051)10 0.95. The
appropriate procedure for testing the equality of several means is the analysis of variance (ANOVA).
The ANOVA has many more applications; it is particularly useful in the validation of a model fit to some experimental data and,
hence, in an analytical calibration or in the analysis of response surfaces as can be seen in the corresponding chapters of the
present book.
Table 12 shows how the data are usually arranged in a general case, in columns the k levels of a factor (e.g., five different
extraction cartridges) and in rows the n data obtained (e.g., four determinations with each cartridge). Each of theN ¼ k � n values xij
(i ¼ 1,2,. . .,k, j ¼ 1,2,. . .,n) is the result obtained, in our example, when using the i-th cartridge with the j-th aliquot sample.
In general, in each level i, a different number of replicates are available ni, with N ¼ P
i¼1
k ni. To make the notation easier, we will
suppose that all ni are equal, that is, ni ¼ n for each level.
Suppose that the data in Table 12 can be described by the model
xij ¼ m + ti + eij with i ¼ 1, 2, . . . , k, j ¼ 1, 2, . . . , n (74)
where m is a parameter common to all treatments, called the overall mean, ti is a parameter associated with the i-th level, called the
factor effect, and eij is the random error component. In our example, m is the content of the sample and ti is the variation in this
quantity caused by the use of the i-th cartridge. Note that in model of Eq. (74), the effect of the factor is additive; this is an
assumption that may be unacceptable in some practical situations.
The ANOVA is posed to test some hypotheses about the treatment effects and to estimate them. In order to support the
conclusion when testing the hypothesis, the model errors, eij, are assumed to be normally and independently distributed random
variables, with mean zero and variance s2, NID(0,s). Besides, the variance s2 is assumed to be constant for every level of the factor.
The model of Eq. (74) is called the one-way ANOVA, because only one factor is studied. The analysis for two or more factors can
be seen in the chapter about factorial techniques in this book. Furthermore, the data of Table 12 are required to be obtained in
random order to reduce the effect of other uncontrolled factors.
There are two ways for choosing the k levels of the factor in the experiment. In the first case, the k levels are specifically chosen by
the researcher, as the cartridges in our example. In this case, we wish to test the hypothesis about the magnitude of ti and
conclusions will apply only to the levels of the factor explicitly considered in the analysis and they cannot be extended to similar
levels that were not considered. This is called the “fixed effects model”.
Alternatively, the k levels could be a random sample from a larger population of levels. In this case, we would like to be able to
extend the conclusions based on the sample to all levels in the population, regardless of whether they have been explicitly
considered in the analysis or not. Hence, each ti is a random variable and information about the specific values included in the
analysis is useless. Instead, we test the hypothesis about the variability. This is called the “random effects model”. This model is used
to evaluate the repeatability and reproducibility of a method and also the laboratory bias when the method of analysis is being
tested by a proficiency test. In the same experiment, and provided there are at least two factors, fixed and random effects can
simultaneously appear.50,51 The reader can easily combine them, when appropriate, from the explanations and examples in the
following subsections.
Also for this section, all the computations for the examples can be followed with the live-script in the supplementary material
named ANOVA_section1024_live.mlx.
The Fixed Effects Model
In this model, the effect of the factor ti is defined as the difference between the mean in each level and the overall mean with the
constraint:
Xk
i¼1
ti ¼ 0 (75)
Quality of Analytical Measurements: Statistical Methods for Internal Validation 29
From the individual data, the mean value per level is defined as
�xi ¼
Pn
j¼1
xij
n
, i ¼ 1, 2, . . . , k (76)
And the overall mean is
�x ¼
Pk
i¼1
Pn
j¼1
xij
N
(77)
A simple calculation gives
Xk
i¼1
Xn
j¼1
xij − �x
� �2 ¼ n
Xk
i¼1
�xi − �xð Þ2 +
Xk
i¼1
Xn
j¼1
xij − �xi
� �2
(78)
Eq. (78) shows that the total variability of the data, measured by the sum of squares of the difference of each datum and the overall
mean, can be partitioned into a sum of squares of differences between level means and the overall mean and a sum of squares of
differences of individual values and their level mean. The term n
Pk
i¼1 �xi − �xð Þ2measures the differences between levels, whereasPk
i¼1
Pn
j¼1 xij − �xi
� �2
is due to random error alone. It is common to write Eq. (78) as
SST ¼ SSF + SSE (79)
where SST is the total sum of squares, SSF is the sumof squares due to change levels of the factor, which is called sum of squares
between levels, and SSE is the sum of squares due to random error, which is called sum of squares within levels. There are N
individual values, thus SST hasN − 1 d.f. Similarly, as there are k levels of the factor, SSF has k − 1 d.f. Finally, SSE hasN − k d.f. We are
interested in testing
H0 : t1 ¼ t2 ¼ t3 ¼ . . . ¼ tk ¼ 0 there is no effect due to the factorð Þ
H1 : ti 6¼ 0 for at least one i
Because of the assumption that the errors Eij are NID(0, s), the values xij are NID(m + ti, s), and therefore SST/s
2 is distributed as a
wN−1
2 . Cochran’s theorem guarantees that, under the null hypothesis, SSF/s
2 and SSE/s
2 are independent chi-square distributions
with k − 1 and N − k d.f., respectively. Therefore, under the null hypothesis, the statistic
Fcalc ¼
SSF
k − 1
SSE
N − k
¼ MSF
MSE
(80)
follows an Fk−1, N−k distribution, whereas under the alternative hypothesis, it follows a noncentral F with the same d.f.50 The
quantities MSF and MSE are called mean squares. Their expected values are E(MSF) ¼ s2 + n(
Pk
i¼1 ti
2)/(k − 1) and E(MSE) ¼ s2,
respectively. Therefore, under the null hypothesis, both are unbiased estimators of the residual variance, s2, whereas under the
alternative hypothesis, the expected value of MSF is greater than s2. The critical region of the test at significance level a is in Eq. (81)
and reflects the idea that, if the null hypothesis is false, the numerator of Eq. (80) is significantly greater than the denominator.
CR ¼ Fcalc > Fa, k − 1,N − k
�
(81)
Usually, the test procedure is summarized in a table (called ANOVA table) like the one in Table 13, except that we have added a
column, the one corresponding to E(MS), just to emphasize the values each MS estimate and their relation with the previous
discussion.
Example 16: To investigate the influence of the composition of some fibers on a SPME procedure, an experiment was performed
using five different fibers. The data shown in Table 14 are the results of four replicated analyses carried out after extraction with each
Table 13 Skeleton of an ANOVA of fixed effects.
Source of variation Sum of squares d.f. Mean squares E(MS) Fcalc
Factor (between levels) SSF k − 1 MSF s2 +
n
Pk
i¼1
t2
i
k − 1
MSF
MSE
Error (within levels) SSE N − k MSE s2
Total SST N − 1
Table 14 Experimental results (mg L−1 of triazine), means and variances obtained in the study of the effect of
the type of fiber in a SPME procedure.
Type of fiber
Fiber 1 Fiber 2 Fiber 3 Fiber 4 Fiber 5
Replicates 490 612 509 620 490
478 609 496 601 502
492 599 489 580 495
499 589 500 603 479
Mean �xi 489.75 602.25 498.50 601.00 491.50
Variance si
2 76.25 108.92 69.67 268.67 93.67
Table 15 Results of the ANOVA for data of Table 14.
Source Sum of squares d.f. Mean squares Fcalc
Between fibers 56 551.3 4 14 137.8 114.54
Error (within fibers) 1 851.5 15 123.4
Total 58 402.8 19
30 Quality of Analytical Measurements: Statistical Methods for Internal Validation
fiber on a sample spiked with 1000 mg L−1 of triazine. All the analyses were carried out in random order maintaining the rest of
experimental conditions controlled.
In the last two rows of Table 14, the means and variances for each fiber are given. Before conducting the ANOVA, the hypothesis
of equality of variances should be tested:
H0 : s21 ¼ . . . ¼ s2i ¼ . . . ¼ s2k
H1 : At least one s2i is different
With the variances in Table 14, the statistics of Cochran’s test (Eq. (59)) is Gcalc ¼ 268.67/617.168 ¼ 0.435. As G0.05,k,n−1 ¼ 0.5981
(see Table 9), the statistic does not belong to the critical region (Eq. (60)) and there is no evidence to reject the null hypothesis at 5%
significance level.
The statistics of Bartlett’s test is wcalc
2 ¼ 1.792 (Eq. (61)) and the critical value is w0.05,4
2 ¼ 9.488, so there is no evidence to reject the
null hypothesis either (Eq. (64)). The same happens with the Levene’s test; computing the absolute values, according to Eq. (65), of
the data of Table 14, Fcalc ¼ 14.70/44.01 ¼ 0.33, and F0.05,4,15 ¼ 3.06, so there is no evidence to reject the null hypothesis on the
equality of variances. By using the median instead of the mean (Eq. (68)), Fcalc ¼ 15.13/46.23 ¼ 0.33, and the conclusion is the
same. From the analysis of the equality of variances, we can conclude that the variances of the five levels should be considered
as equal.
The ANOVA of the experimental data gives the results in Table 15. Considering the critical region defined in Eq. (81), as
Fcalc ¼ 114.54 is greater than the critical value F0.05,4,15 ¼ 3.06, we reject the null hypothesis and hence the conclusion is that there is
a significant effect of “fiber composition” on the extracted amount.
Power of the Fixed Effects ANOVA model
The power of the ANOVA is computed by the following expression:
1 − b ¼ pr F∗
k − 1,N − k, d < Fa, k − 1,N − k
n o
(82)
where Fa, k−1, N−k is the critical value of Eq. (81), F
k−1, N−k, d is a noncentral F distribution with k − 1 and N − k d.f. of the numerator
and denominator, respectively, and d is the noncentrality parameter, whose value is given by
d ¼ n
Pk
i¼1t
2
i
s2
(83)
The noncentrality parameter d depends on the number of replicates n and also on the difference in means that we wish to detect in
terms of
P
i¼1
k ti
2. When the error variance is unknown, which is usually the case, we must define the differences to be detected in
terms of ratios
P
i¼1
k ti
2/s2. As the power, 1 − b, of the test increases with d, the next question would be about the minimum d needed
Table 16 Probability of type II error, b, as a function of the number n of replicates in the ANOVA for comparing
fiber types.
n 4 5 6 7 8 9
b 0.347 0.203 0.111 0.058 0.029 0.014
Quality of Analytical Measurements: Statistical Methods for Internal Validation 31
(for a given b) to distinguish differences of at least D in two of the t’s. This minimum d can be computed, provided that two of the ti
differ by D and the remaining k − 2 are kept at the mean of these two50 and is given by
Xk
i¼1
t2i ¼ D2
2
(84)
For example, with the data of Example 16 (Table 14), we are now interested in the risk of falsely affirming that the type of fiber is not
significant for the recovery.
The answer consists of evaluating the probability b by Eq. (82). Suppose that we want to discriminate effects greater than twice
the MSE, that is,
Pk
i¼1 ti
2/s2 2, and thus d ¼ n � 2 ¼ 8. Notice that, by substituting Eq. (84) into Eq. (83), this value of d means
that we want to discriminate a difference D between the two types of fiber of at least 2s. In these conditions, F0.05, 4, 15 ¼ 3.06 and
b ¼ 0.54 (calculations can be seen in Example A14 of Appendix and in live-script in the supplementary material ANOVA_sec-
tion1024_live.mlx). In other words, 54 out of 100 times we will accept the null hypothesis (there is no effect of the composition of
the fiber) when it is wrong. This is not good enough for a suitable decision rule.
Eq. (82) can also be used to determine the sample size before starting an experiment, so that risks a and b are both good enough.
For example, we want to know how many replicates we need to carry out in the experiment for a ¼ b ¼ 0.05 and maintaining the
ratio
Pk
i¼1 ti
2/s2 3. Note that, in this case, the analyst considers “effect of fiber type” if it is greater than 3 times s2, which is
equivalent, using Eq. (84), to detect a difference between two fibers at least equal to D ¼ ffiffiffi
6
p
s 2:5s.
To calculate the sample size, a table must be made to write b as a function of n in Eq. (82) with k, a, and d fixed at 5, 0.05, and
3 � n, respectively. Following the results shown in Table 16, computed with the code in the mentioned live-script, we need n ¼ 8
replicates with each fiber to achieve b � 0.05, but in practice n ¼ 7 would be enough.
Uncertainty and Testing of the Estimated Parameters in the Fixed Effects Model
It is possible to derive estimators for the parameters m and ti (i ¼ 1,.. .,k) in the one-way ANOVA modeled by Eq. (74). The
normality assumption on the errors is not needed to obtain an estimate by least squares; however, the solution is not unique, so the
constraint of Eq. (75) is imposed. Using this constraint, we obtain the following estimates
m̂ ¼ �x and t̂i ¼ �xi − �x, i ¼ 1, . . . , k (85)
where �xi and �x have been defined in Eqs. (76), (77), respectively. If the number of replicates, ni, in each level is not equal
(unbalanced ANOVA), then the constraint in Eq. (75) should be changed by
P
i¼1
k niti ¼ 0 and the weighted average of �xi should
be used instead of the unweighted average in Eq. (85).
Now, if we assume that the errors are NID(0,s) and ni ¼ n, i ¼ 1,. . .,k, the estimates of Eq. (85) are also the maximum likelihood
ones. For unbalanced designs, the maximum likelihood solution is better because the least squares solution is biased. The reader
interested on this subject should consult statistical monographs that describe this matter at a high level, such as Milliken and
Johnson52 and Searle.53
Themean of the i-th level is mi ¼ m + ti, i ¼ 1,. . .,k. In our case, with a balanced design an estimator of miwould be m̂i ¼ m̂ + t̂i ¼ �xi
and, as errors are NID(0,s), then �xi is NID(mi, s/
ffiffiffi
n
p
). Using MSE as an estimator of s2, Eq. (16) gives the confidence interval at
(1 −a)100% level:
�xi − ta=2,N − k
ffiffiffiffiffiffiffiffiffi
MSE
n
r
; �xi + ta=2,N − k
ffiffiffiffiffiffiffiffiffi
MSE
n
r" #
(86)
A (1 − a)100% confidence interval on the difference in the means of any two levels, say mi − mj, would be
�xi − �xj
� �
− ta=2,N − k
ffiffiffiffiffiffiffiffiffiffiffiffiffi
2 MSE
n
r
; �xi − �xj
� �
+ ta=2,N − k
ffiffiffiffiffiffiffiffiffiffiffiffiffi
2MSE
n
r" #
(87)
With the data in Example 16 (Table 14), a 95% confidence interval on the difference between fibers 1 and 2 is given by
489:75 − 602:25ð Þ � 2:131
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2� 123:43=4
p
which is [−129.24, −95.76].
32 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Finally, the (1 − a)100% confidence interval on the overall mean is
�x − ta=2,N − k
ffiffiffiffiffiffiffiffiffiffiffi
MSE
n� k
r
; �x + ta=2,N − k
ffiffiffiffiffiffiffiffiffiffiffi
MSE
n� k
r" #
(88)
In the example, as there is effect of the type of fiber, it makes no sense to compute this interval.
Rejecting the null hypothesis in the fixed effect model of the ANOVA implies that there are differences between the k levels, but
the exact nature of the differences is not specified. To address this question, two procedures are used: orthogonal contrasts and
multiple tests.
Case 1: Orthogonal contrasts
For example, with the data of Table 14, we would like to test the hypothesis H0: m4 ¼ m5. A contrast is a linear combination, in this
case, the linear relation related with this hypothesis is �x4 − �x5 ¼ 0. Then the contrast is tested by comparing its sum of squares to the
mean square error by a statistic that follows an F distribution, with 1 and N − k d.f.
Each contrast is defined by the coefficients of the linear combination, in the previous case (0,0,0,1,−1). Two contrasts C ¼ (c1,
c2,. . .,ck) and D ¼ (d1,d2,. . .,dk) are orthogonal if
P
i¼1
k cidi ¼ 0. There are numerous ways to choose the orthogonal contrast
coefficients for a set of levels. Usually, something in the experiment should suggest which comparison(s) will be of interest.
To illustrate the procedure, and with purely didactic purpose, we are going to pose a fictitious case with the data of Table 14. In each
problem, its peculiarities and the previous knowledge of the analyst will suggest the contrast to be studied. The comparisons
between the means per fiber type and the associated orthogonal contrasts proposed are
H0 : m4 ¼ m5 C1 ¼ �x4 − �x5
H0 : m1 + m3 ¼ m4 + m5 C2 ¼ �x1 + �x3 − �x4 − �x5
H0 : m1 ¼ m3 C3 ¼ �x1 − �x3
H0 : 4m2 ¼ m1 + m3 + m4 + m5 C4 ¼ −�x1 + 4�x2 − �x3 − �x4 − �x5
The sum of squares associated with each contrast C is
SSC ¼
n
Pk
i¼1
ci�xi
� �2
Pk
i¼1
c2i
(89)
For example, SSC1
¼ 4(−1 � 601.00 + 1 � 491.50)2/2 ¼ 23981 with 1 d.f. These sums of squares are incorporated into the ANOVA
table as shown in Table 17.
Now, to test each of the hypotheses, it suffices to compare the corresponding Fcalc in Table 17 with the critical value
F0.05,1,15 ¼ 4.54. The conclusion is that, except for C3, the rest of the contrasts are significant according to Eq. (81). Thus, we should
reject the hypothesis that fibers 4 and 5 give the same recovery. The hypothesis that the mean of the recovery rates for fibers 1 and 3
is the same as for fibers 4 and 5 is also rejected. Also, fiber 2 differs significantly from the mean of the other four, whereas there is no
experimental evidence to reject that fibers 1 and 3 provide the same recovery. It is just an example of the wide range of possibilities
for analyzing experimental results.
Case 2: Comparison of several means
Many different methods have been described that were specifically designed for the comparison of several means. Here, we will
describe the method of Newman-Keuls. The hypothesis test is the following:
H0: All the differences two by two are equal to zero
H1: At least one difference is non-null
Table 17 ANOVA table with orthogonal contrasts for composition of fibers for SPME.
Source Sum of squares d.f. Mean squares Fcalc
Between fibers 56551.3 4 14137.8 114.54
C1: m4 ¼ m5 23981.0 1 23981.0 194.33
C2: m1 + m3 ¼ m4 + m5 10868.0 1 10868.0 88.07
C3: m1 ¼ m3 153.1 1 153.1 1.24
C4: 4m2 ¼ m1 + m3 + m4 + m5 21550.0 1 21550.0 174.63
Error (within fibers) 1851.5 15 123.4
Total 58402.8 19
Table 18 Results of Newman-Keuls for multiple comparison test; data of SPME fibers.
Levels Rank Mean Homogeneous groups
2 1 602.25 �
4 2 601.00 �
3 3 498.50 �
5 4 491.50 �
1 5 489.75 �
The symbols “�” aligned in columns indicate that the corresponding means are all equal two by two.
Table 19 Skeleton for using the corresponding tabulated values for the Newman-Keuls procedure.
t ¼ k t ¼ k − 1 . . . t ¼ 2
�xr 1ð Þ − �xr kð Þ �xr 1ð Þ − �xr k − 1ð Þ . . . �xr 1ð Þ − �xr 2ð Þ
�xr 2ð Þ − �xr kð Þ . . . �xr 2ð Þ − �xr 3ð Þ
⋱ ⋮
�xr k − 1ð Þ − �xr kð Þ
qa(k,k(n − 1)) qa(k − 1,k(n − 1)) . . . qa(2,k(n − 1))
t denotes the difference of ranks plus one; the subscript r(i ) indicates the i-th rank. k is the number of levels in the ANOVA and qa
are the tabulated values at significance level a.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 33
The procedure consists of the following steps:
1. To sort the means per level, �xi, i ¼ 1,2,. . .,k, in decreasing order, �xr 1ð Þ � �xr 2ð Þ � ⋯ � �xr kð Þ. The subindex r(i) refers to the rank of
the corresponding mean, that is, the position that it occupies in the ordered list. For example, the means of Table 14 have the
following ranks: r(1) ¼ 2, r(2) ¼ 4, r(3) ¼ 3, r(4) ¼ 5, and r(5) ¼ 1. That means that the first one, which is 489.75, has rank 5,
that is, is in the fifth position in the decreasing ordered list. Table 18 shows the ordered means and the ranks in the second
column.
2. To create a table with the differences between the means from greatest to lowest in columns identified by t that is equal to the
difference of ranks plus one. Table 19 contains all the possible contrasts two by two of the means.
3. Finally, each of the following hypotheses is tested:
H0 : �xr ið Þ − �xr i + k − 1ð Þ ¼ 0
H1 : �xr ið Þ − �xr i + k − 1ð Þ > 0
with the statistic in Eq. (90)
Rt ¼ qa t; k n − 1ð Þð Þ
ffiffiffiffiffiffiffiffiffi
MSE
n
r
(90)
The values q (t,k(n − 1)) in Eq. (90) are tabulated. Table 20 shows some of them. They depend, as usual, on the significance level a,
a
on t, and on the d.f. N − k of MSE. Further, the first term in Rt changes with the difference of ranks, t. The corresponding values are
written in the last row in Table 19.
The critical region is made up by
CR ¼ �xr ið Þ − �xr i + k − 1ð Þ � Rt
�(91)
The results obtained when applying the method of Newman-Keuls to the data of Example 16 are given in Table 21. The first column
contains the means to be compared, for example, 1–2 indicates that the comparison is between �x1 and �x2. The second column
contains the differences (without sign) between the means. The values of t (difference of ranks plus one) are in the third column, for
example, t ¼ 5 in the first row because, with the ranks in Table 18, �x1 has rank 5 and the rank of �x2 is 1. The next column contains the
critical value (Rt) computed with the value of q in Table 20 and Eq. (90). The critical value Rt defines the critical region so that the
analyst can decide whether the estimated difference is significant or not. At 5% significance level, the resulting decision of rejecting
or not rejecting the null hypothesis is shown in the last column of Table 21.
Usually, the result of this multiple comparison is presented as in the last column of Table 18 that is more graphic. The columns
with the symbols “�” aligned indicate that the corresponding means are all equal two by two. In our example, on the one hand, the
means �x2 and �x4 and, on the other hand, any other pair among �x1, �x3, and�x5, according to the decisions in Table 21. It is possible to
conclude that there are two groups of fibers; as far as the recovery is concerned, fibers 2 and 4 provide results that are significantly
equal and greater than the recovery rate obtained with the other three fibers, that are similar to each other.
Table 20 Values of qa(t,n), the upper percentage points of the studentized range for a ¼ 0.05.
n t
2 3 4 5 6 7 8 9 10
1 17.969 26.98 32.82 37.08 40.41 43.12 45.50 47.36 49.07
2 6.085 8.33 9.80 10.88 11.74 12.44 13.03 13.54 13.99
3 4.501 5.91 6.82 7.50 8.04 8.48 8.85 9.18 9.46
4 3.926 5.04 5.76 6.29 6.71 7.05 7.35 7.60 7.83
5 3.635 4.60 5.22 5.67 6.03 6.33 6.58 6.80 6.99
6 3.460 4.34 4.90 5.30 5.63 5.90 6.12 6.32 6.49
7 3.344 4.16 4.68 5.06 5.36 5.61 5.82 6.00 6.16
8 3.261 4.04 4.53 4.89 5.17 5.40 5.60 5.77 5.92
9 3.199 3.95 4.41 4.76 5.02 5.24 5.43 5.59 5.74
10 3.151 3.88 4.33 4.66 4.91 5.12 5.30 5.46 5.60
11 3.113 3.82 4.26 4.57 4.82 5.03 5.20 5.35 5.49
12 3.081 3.77 4.20 4.51 4.75 4.95 5.12 5.27 5.39
13 3.055 3.73 4.15 4.45 4.69 4.88 5.05 5.19 5.32
14 3.033 3.70 4.11 4.41 4.64 4.83 4.99 5.13 5.25
15 3.014 3.67 4.08 4.37 4.59 4.78 4.94 5.08 5.20
16 2.998 3.65 4.05 4.33 4.56 4.74 4.90 5.03 5.15
17 2.984 3.63 4.02 4.30 4.52 4.70 4.86 4.99 5.11
18 2.971 3.61 4.00 4.28 4.49 4.67 4.82 4.96 5.07
19 2.960 3.59 3.98 4.25 4.47 4.65 4.79 4.92 5.04
20 2.950 3.58 3.96 4.23 4.45 4.62 4.77 4.90 5.01
t, difference of ranks plus one; n, degrees of freedom of MSE.
Adapted from Sachs, L. Applied Statistics. A Handbook of Techniques; Springer-Verlag: New York, 1982.
Table 21 Results of the Newman-Keuls test applied to data of SPME fibers.
Contrast levels Differences �xi − �xj t q0.05,t,15 Critical values, Rt Decision according to CR ¼ �xr ið Þ − �xr i + k − 1ð Þ � Rt
�
1–2 112.50 5 4.37 4.37 � 5.555 ¼ 24.27 Reject H0
1–3 8.75 3 3.67 3.67 � 5.555 ¼ 20.39 No evidence to reject H0
1–4 111.25 4 4.08 4.08 � 5.555 ¼ 22.66 Reject H0
1–5 1.75 2 3.01 3.01 � 5.555 ¼ 16.72 No evidence to reject H0
2–3 103.75 3 3.67 3.67 � 5.555 ¼ 20.39 Reject H0
2–4 1.25 2 3.01 3.01 � 5.555 ¼ 16.72 No evidence to reject H0
2–5 110.75 4 4.08 4.08 � 5.555 ¼ 22.66 Reject H0
3–4 102.50 2 3.01 3.01 � 5.555 ¼ 16.72 Reject H0
3–5 7.00 2 3.01 3.01 � 5.555 ¼ 16.72 No evidence to reject H0
4–5 109.50 3 3.67 3.67 � 5.555 ¼ 20.39 Reject H0
H0: the difference is null, �xi − �xj ¼ 0; H1: �xi − �xj 6¼ 0; a ¼ 0.05.
34 Quality of Analytical Measurements: Statistical Methods for Internal Validation
The Random Effects Model
In many cases, the factor of interest is a random variable as well, so that the chosen levels are in fact a sample of this random
variable and we want to extract conclusions about the population from which the sample comes. For example, in the case of
validating an analytical method, several laboratories will apply it to aliquot samples so that it is possible to decide what part of
the variability of the results is attributable to the change of laboratory and what part is due to the repetition of the procedure
inside the same laboratory. These are the concepts of reproducibility and repeatability. The same happens in the analytical control
of processes: It is necessary to split the variability observed between the one due to the measurement procedure and the one
assignable to the process.
The linear statistical model is
xij ¼ m + ti + eij with i ¼ 1, 2, . . . , k, j ¼ 1, 2, . . . , n (92)
where ti and eij are independent random variables. Note that the model is identical in structure to the fixed effect case (Eq. (74)), but
the parameters have a different interpretation. If V(ti) ¼ st
2, then the variance of any observation is
V xij
� � ¼ s2t + s2 (93)
Quality of Analytical Measurements: Statistical Methods for Internal Validation 35
The variances of Eq. (93) are called variance components, and the model, Eq. (92), is called components of variance or the
random effects model. To test hypotheses in this model, we require that the eij are NID(0,s), all of the ti are NID(0,st), and ti and eij
are independent to one another.
The sum of squares equality SST ¼ SSF + SSE still holds. However, instead of testing the hypothesis about individual levels effects,
we test the hypothesis
H0 : s2t ¼ 0
H1 : s2t > 0
If st
2 ¼ 0, all levels are identical; if st
2 > 0, then there is variability between levels. Thus, under the null hypothesis, the ratio
Fcalc ¼
SSF
k − 1
SSE
N − k
¼ MSF
MSE
(94)
is distributed as an F with k − 1 and N − k d.f. The expected values (means) of MSF and MSE are
E MSFð Þ ¼ s2 + ns2t (95)
and
E MSEð Þ ¼ s2 (96)
Therefore, the critical region is
CR ¼ Fcalc > Fa, k − 1,N − k
�
(97)
Power of the Random Effects ANOVA model
The power of the random effects ANOVA model is obtained from
1 − b ¼ pr Fk − 1,N − k >
Fa, k − 1,N − k
l2
(98)
where l2 ¼ 1 + nst
2/s2. As s2 is usually unknown, we may either use a prior estimate or define the value of st
2 that we are interested
in detecting in terms of the ratio st
2/s2. An application to determine the number of replicates in a proficiency test can be seen in
Example A15 of Appendix and in ANOVA_section1024_live.mlx in the supplementary material.
Confidence Intervals for the Estimated Parameters in the Random Effects Model
In general, the mean value per level �xi does not have more statistical meaning than being a sample of the random factor. But
sometimes, as in the case of proficiency tests, this mean value is of interest for each participating laboratory. The variance of the
mean value per level is theoretically equal to V �xið Þ ¼ s2t + s2=n. From Eqs. (95), (96),MSF/n (with k − 1 d.f.) estimates the variance
of the mean per level. As a consequence, the 100(1 − a)% confidence interval is
�xi − ta=2, k − 1
ffiffiffiffiffiffiffiffiffi
MSF
n
r
; �xi + ta=2, k − 1
ffiffiffiffiffiffiffiffiffi
MSF
n
r" #
(99)
When calculating the variance for the overall mean, it is necessary to consider the variability provided by the factor, as the factor
always acts. For example, when evaluating an analytical method, the results without the variability attributable to the factor
laboratory are not conceivable. The variance of the overall mean is V �xð Þ ¼ Pk
i¼1V �xið Þ=k2, which is estimated byMSF/(nk), with k − 1
d.f., so that the 100(1 − a)% confidence interval is
�x − ta=2, k − 1
ffiffiffiffiffiffiffiffiffi
MSF
n k
r
; �x + ta=2, k − 1
ffiffiffiffiffiffiffiffiffi
MSF
n k
r" #
(100)
The random effects ANOVA is a model of practical interest because it allows attributing real meaning to many statements that seem
evident. For example, the samples distributed to laboratories in a proficiency test must be homogeneous. Strictly speaking, in most
of the occasions, it is impossible to assure homogeneity, but it is enough that the variability attributableto the change of sample is
significantly smaller than the one attributable to the procedure of analysis. This can be guaranteed by means of an ANOVA of
random effects.
36 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Statistical Inference and Validation
Trueness
The trueness is a key concept; several international organizations are unifying its definition. For example, the definition “The
closeness of agreement between the average value obtained from a large series of test results and an accepted reference value” has
been adopted by the IUPAC (Inczédy et al.11, Chapter 18). The definition of the ISO7 exactly coincides with it, and it is the definition
accepted by the European Union in the Decision 2002/657/EC3 as far as the operation of the analytical methods and the
interpretation of results are concerned. The trueness is usually expressed in terms of bias, which combines all the components of
the systematic error, denoted by D in Eq. (1).
The decisions on the trueness of a method are made by hypothesis testing on the central value of a distribution; in case the
random error can be assumed to have a zero mean, they are in fact tests on the mean because, according to Eq. (1), the expected
mean value for a series of measurements will be m + D and the question is reduced to test whether D is zero or not (equivalently, to
test whether �x is significantly equal to m).
To use one or another test depends only on the information available about the distribution of the random error—its type
(normal, parametric, or unknown) and, in the case of normality, whether the variance is known or not. Some common cases are
given below:
1. To decide whether an analytical procedure fulfills trueness using a reference sample whose value is assumed to be true. If normal
data with known variance s2 are supposed, then the tests of Section “Hypothesis Test on the Mean of a Normal Distribution”
will be used.
2. To decide whether an analytical procedure has bias specifically positive (or negative) by using a reference sample whose value is
assumed to be true. If normal data are assumed, the one-tail test versions of Cases 1 and 2 of Section “Hypothesis Test on the
Mean of a Normal Distribution” will be of use.
3. In other occasions, the question of trueness is considered comparatively between two methods: “To decide if the difference in
means between them is significant or not, when they are applied to the same reference sample”. It is the two-tail test. The one-tail
case is “to decide whether one has bias of specific sign against the other”. In these tests (Section “Hypothesis Test on the
Difference in Two Means”), two experimental means are compared, one coming from applying n1 times the first method to the
reference sample (aliquot parts) and the other one coming from applying n2 times the second method. Under the normality
assumption, we will have to know whether the variances of both methods are known (for applying tests in rows 7 and 8 of
Table 4) or it is necessary to estimate them from the samples. In this second case we have to decide, with the test in row 15 of
Table 4, whether they are equal or different, for applying test in rows 9 or 10 in Table 4, or the statistic in Eq. (53), respectively.
4. Sometimes it is impossible to use similar enough samples. A solution is the use of the “test on the difference of means with
paired data” (Case 3 of Section “Hypothesis Test on the Mean of a Normal Distribution”). For example, in an on-line system, say
in flow analysis, before introducing a new faster method to indirectly determine the content of an analyte in wastewater, the test
can be used to decide if the new method maintains the trueness at the same level as the previous one. Once the new method is
ready and validated with reference samples, real samples must be measured. The difficulty is that we cannot be sure about the
amount that is to be found because this may vary from day to day. In order to eliminate the effect of the sample (the factor
“sample”), paired designs are used: Both methods are applied to aliquot parts of the same sample, and two series of paired
results x1i and x2i are obtained when applying the old and new methods, respectively. Individual means here make no sense
because in that case we would be introducing variability due to the change of sample in each series. The correct procedure is to
subtract the values di ¼ x1i − x2i so that now the differences are caused exclusively by the change of method and their mean will
estimate the bias attributable to the new method. It will be enough to apply a test on the mean; thus, the normality and
independence hypotheses must be evaluated on the differences, which are also used to estimate the standard deviation.
This test for paired data is frequently used to evaluate the improvement achieved in a procedure by a technical variation, as is
the case of Example 9. The effect of the change on the trueness must always be evaluated in the range of concentrations in which
the procedure will be used. An alternative to the use of this test is the analysis of the pairs of data by a linear regression; in this
case, the regression method used should consider the existence of error in both axes.
The hypotheses of normality (Section “Goodness-of-Fit Tests: Normality Tests”), and the equality of variances, when applicable,
will have to be tested with the appropriate tests (Section “Hypothesis Test on the Comparison of Several Independent Variances”).
When a hypothesis test is to be posed, one can think about omitting some known data, for example, the variance. The effect is a
loss of power, that is, with the same value of a and the same sample size, there is a greater probability of type II error. Said otherwise,
to maintain power, larger sample sizes are needed to get the same experimental evidence; a calculation on this matter is in Case 2 of
Section “Hypothesis Test on the Mean of a Normal Distribution”. The same applies about the use of one-tail tests respect to the
respective two-tail tests; or about the use of nonparametric tests that do not impose any type of distribution a priori.
Also, it is important to remember that the presence of outlier data tends to greatly increase the variance so that the tests become
insensitive, that is to say, larger experimental evidence is needed to reject the null hypothesis. The nonparametric alternative has, in
general, a high cost in terms of power for the same significance level (or in terms of sample size). For this reason, its use is not
advised unless it is strictly necessary. In addition, some nonparametric tests also assume hypothesis on the distribution of the values,
for example, to be symmetric or unimodal.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 37
Precision
The other very important criterion in the validation of a method is the precision. In the ISO 5725,7 the IUPAC (Inczédy et al.11,
Sections 2 and 3), and the 2002/657/EC European Decision,3 we can read “Precision, the closeness of agreement between
independent test results obtained under stipulated conditions”.
The precision usually is expressed as imprecision. The lesser the dispersion of the random component in Eq. (1), the more
precise the procedure. It must be remembered that the precision depends solely on the distribution of the random errors and is not
related to the value of reference or the value assigned to the sample. In a first approach, it is computed as a standard deviation of the
results; nevertheless, even the ISO 5725-5 recommends the use of a robust estimation.
Two measures, limits in a certain sense, of the precision of an analytical method are the reproducibility and the repeatability.
Repeatability is defined as precision under repeatability conditions. Repeatability conditions means conditions where indepen-
dent test results are obtained with the same method on identical test items in the same laboratory by the same operator using the
same equipmentin short intervals of time. Repeatability as standard deviation is denoted as sr.
The repeatability limit, r, is the value below which lies, with a probability of (1 − a)100%, the absolute value of the difference
between two results of a test obtained under repeatability conditions. The repeatability limit is given by
r ¼ za=2
ffiffiffi
2
p
sr (101)
where za/2 is the a=2 upper percentage point of the standard normal distribution.
Reproducibility is defined as precision under reproducibility conditions. Reproducibility conditions means conditions where
test results are obtained with the same method on identical test items in different laboratories with different operators using
different equipment. Reproducibility as standard deviation is named sR.
The reproducibility limit, R, defined in Eq. (102), is the value below which is, with a probability of (1 − a)100%, the absolute
value of the difference between two results of a test, results obtained under reproducibility conditions.
R ¼ za=2
ffiffiffi
2
p
sR (102)
When estimating sr (or sR) with n < 10, a correction factor7 should be applied to Eq. (6).
Notice that both R and r define, in fact, two-sided tolerance intervals for the difference of two measurements.
The ISO introduces the concept of intermediate precision when only some of the factors described in the reproducibility
conditions are varied. A particular interesting case is when the “internal” factors of the laboratory (analyst, instrument, day) are
varied, which in Commission Decision3 is called intralaboratory reproducibility.
One of the causes of the ambiguity when defining precision is the laboratory bias. When the method is applied only in a
laboratory, the laboratory bias is a systematic error of that laboratory. If the analytical method is evaluated in general, the laboratory
bias becomes a part of the random error: to change the laboratory contributes to the variance expected for a determination
conducted with that method in any laboratory.
The most eclectic position is the one described in the ISO 5725 that declares “The laboratory bias is considered constant when
the method is used under repeatability conditions but is considered as a random variable if series of applications of the method are
made under reproducibility conditions”.
With these premises, we can realize that to evaluate the precision of an analytical method is equivalent to estimating the variance
of the random error in the results and that the discrepancies that can appear when establishing the sources of variability must be
explicitly identified, for example, the laboratory bias.
The precision of two methods can be compared by a hypothesis test on the equality of variances, under the normality
assumption, that is, an F-test (Section “Hypothesis Test on the Variances of Two Normal Distributions”).
Another usual problem is to decide whether the variance observed can be considered significantly equal or not to an external
value, which is decided by using a w2 test (Section “Hypothesis Test on the Variance of a Normal Distribution”).
It is common that the lack of control on a concrete aspect of an analytical procedure is the origin of a great variability. If the
experimental conditions are not stable, we will have an additional variability in the determinations. The F-test permits to decide
whether the precision improves significantly when an optimization is carried out.
In fact, many improvements in the procedures are the consequence of acting after the identification of some causes of variability
in the results and their quantification. More details about this aspect of control and improvement of the precision are given in the
section dedicated to the ruggedness of chemical analysis.
The technique used in the random effects ANOVA is also the adequate technique to split the variance of each experimental data
into addends, which in turn are specially adapted to estimate the repeatability and the reproducibility of an analytical method when
an interlaboratory test comparison has been carried out. In the following, the use of an ANOVA to estimate reproducibility and
repeatability in a proficiency test is briefly explained.
There is no doubt that a good analytical procedure has to be insensitive to the laboratory where it is conducted. To decide
whether the “change of laboratory” has any effect, k laboratories apply a procedure to aliquot samples; each laboratory makes n
determinations. In the terminology of the ANOVA, we have a random factor (the laboratory) at k levels and n replicates in each level.
It has already been said that in general, it is not necessary to have the same number of replicates in all the levels.
38 Quality of Analytical Measurements: Statistical Methods for Internal Validation
We denote by xij the experimental results, where i ¼ 1,. . .,k identifies the laboratory and j ¼ 1,. . .,n the replicate.
Fig. 7 is a skeleton of Eqs. (93)–(96) and shows how to compute an estimate of the variance of the random variable E in Eq. (92).
If the analytical procedure is well defined, the k estimates si
2 are expected to be approximately equal and to gather the variability due
to the use of the analytical method by only one laboratory. In these conditions, the pooled variance sp
2 is a joint (“pooled”) estimate
of the same variance, that is, by definition, the repeatability of the method expressed as standard deviation (ISO 5725)
sr ¼
ffiffiffiffiffiffiffiffiffi
V eð Þ
p
sp (103)
Fig. 7 ANOVA of random effects for an interlaboratory study.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 39
From the same data we can obtain k estimates of the bias Di (Fig. 7, top) and then the variance of the laboratory bias, considering
this bias as a random variable. Taking into account the quantities estimated by the variances described in Fig. 7, one obtains the
following expression for the interlaboratory variance:
V Dð Þ s�x2 −
s2p
n
(104)
which, linked to Eq. (1), provides the following estimate of the reproducibility as standard deviation (ISO 5725):
sR
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
V Dð Þ + V eð Þ
p
(105)
In the ANOVA, the null hypothesis is that V(D) ¼ 0 (i.e., there is no effect of the factor), and the alternative is that at least one
laboratory has non-null bias (there is effect of the factor).
The conclusion of the ANOVA is obtained by deciding whether both variances, ns�x2 and sp
2, can be considered significantly equal.
To decide it, an F-test is applied. The logic is clear: If there is no effect (laboratory effect), V(D) should be significantly zero and, thus,
both variances are equal or, in other words, they estimate the same quantity.
In practice, the expression for the computation of the power of the ANOVA with random effects (Eq. (98)) is useful in deciding
the number of laboratories that should participate, k, and the number of replicated determinations, n, that each one must conduct.
It is essential to remember that an ANOVA requires normal distribution of the residuals and equality of the variances s1
2, s2
2, . . ., sk
2.
When the number of replicates is two (n ¼ 2), a common way of the interlaboratory analysis is the use of the graph of Youden54
to show the trueness and precision of each laboratory. Actually, Youden’s graph is nothing but the graphical representation of an
ANOVA shown in Kateman and Pijpers.55 In addition to being used for comparing the quality of the laboratories, Youden’s graph
can be used to compare two methods of analysis in terms of the laboratory bias they have.
An approach for the comparison of two methods in the intralaboratory situation has been proposed by Kuttatharmmakull
et al.56 Instead of the reproducibility, as included in Fig. 7 and ISO guidelines, the (operator + instrument + time) different
intermediate precision is considered in the comparison.
In the case of precision, the effect of outlier data is really devastating; hence, a very delicate taskof analysis to detect those outlier
data is essential. In general, more than one test is needed (usual ones are those of Dixon, and Grubbs and Cochran), especially to
accept the hypotheses of the ANOVA made for the determination of repeatability and reproducibility. In view of the difficulties, the
AMC5,6 advises the use of robust methods to evaluate the precision and trueness and for proficiency testing. This path is also
followed in the new ISO norm about reproducibility and repeatability.
Statistical Aspects of the Experiments to Determine Precision
The analysis of data implies three steps:
1. Critical examination of the data, in order to identify outliers or other irregularities, and to verify the suitability of the model.
2. To compute for each level of concentration the preliminary values of precision and mean.
3. To establish the final values of precision and means, including the establishment of a relation between precision and the level of
concentration when the analysis indicates that such relation may exist.
The analysis includes a systematic application of statistical tests for detecting outliers, and a great variety of such tests are available
from the literature and could be used for this task.
Consistency Analysis and Incompatibility of Data
From the data collected in a specific number of levels, a decision must be taken about certain individual results or values that seem
to be “different” from those of the rest of laboratories or that can modify the estimations. Specific tests are used for the detection of
these outlier numerical results.
Case 1: Elimination of data
It is the classic procedure based on detecting and, if this is the case, eliminating the outlier data. The tests are of two types. Cochran’s
test is related to the interlevel variability of the factor and should be applied first; its objective is to detect an anomalous variance in
one or several of the levels of the factor. Cochran’s test has already been described in Section “Hypothesis Test on the Comparison of
Several Independent Variances”.
Later Grubbs’ test will be applied. It is basically a test on the intralevel variability to discover possible outlier individual data.
It can be used (if ni > 2) for those levels in which Cochran’s test has led to the suspicion that the interlevel variation is attributable to
an individual result. It is applied in two stages:
1. Detection of a unique outlying observation (single Grubbs’ test)
In a data set xi (i ¼ 1, 2, . . ., n) sorted in increasing order, to prove whether the greatest observation, xn, is incompatible with
the rest, the following statistic is computed:
Table 22 Critical values for Grubbs’ test.
n One largest or one smallest Two largest or two smallest
a ¼ 0.05 a ¼ 0.01 a ¼ 0.05 a ¼ 0.01
4 1.481 1.496 0.0002 0.0000
5 1.715 1.764 0.0090 0.0018
6 1.887 1.973 0.0349 0.0116
7 2.020 2.139 0.0708 0.0308
8 2.126 2.274 0.1101 0.0563
9 2.215 2.387 0.1492 0.0851
10 2.290 2.482 0.1864 0.1150
11 2.355 2.564 0.2213 0.1448
12 2.412 2.636 0.2537 0.1738
13 2.462 2.699 0.2836 0.2016
14 2.507 2.755 0.3112 0.2280
15 2.549 2.806 0.3367 0.2530
16 2.585 2.852 0.3603 0.2767
17 2.620 2.894 0.3822 0.2990
18 2.651 2.932 0.4025 0.3200
19 2.681 2.968 0.4214 0.3398
20 2.709 3.001 0.4391 0.3585
Adapted with permission from ISO-5725–2. Accuracy, Trueness and Precision of Measurement Methods and Results; Gèneve, 1994; p. 22.
40 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Gn, calc ¼ xn − �x
s
(106)
On the contrary, to verify whether the smallest observation, x1, is significantly different from the rest, the statistic G1 is
computed as
G1, calc ¼
�x − x1
s
(107)
In Eqs. (106), (107), �x and s are, respectively, the mean and standard deviation of xi.
To decide whether the greatest or smallest value is significantly different from the rest at 100a% significance level, the values
obtained in Eqs. (106), (107) are compared to the corresponding critical values written down in Table 22.
The decision includes two “anomaly levels”:
(a) If Gi,calc < G0.05,i, with i ¼ 1 or i ¼ n, accept that the corresponding x1 or xn is similar to the rest.
(b) If G0.05,i < Gi,calc < G0.01,i, with i ¼ 1 or i ¼ n, the corresponding x1 or xn is considered an straggler.
(c) If G0.01,i < Gi,calc, with i ¼ 1 or i ¼ n, the corresponding x1 or xn is incompatible with the rest of data of the same level
(statistical outlier).
2. Detection of two outlying observations (double Grubbs’ test)
Sometimes it is necessary to verify that two extreme data (very large or very small) incompatible with the others do not exist.
In the case of the two greatest observations, xn and xn−1, the statistic G is computed as
G ¼ s2n − 1, n
s20
(108)
where s20 ¼ Pn
i¼1 xi − �xð Þ2 and s2n-1, n ¼ Pn-2
i¼1 xi −
1
n − 2
Pn − 2
i¼1 xi
� �2
.
Similarly, it is possible to jointly decide on the two smallest observations, x1 and x2, by means of the following statistic:
G ¼ s21, 2
s20
(109)
where s21, 2 ¼ Pn
i¼3 xi −
1
n − 2
Pn
i¼3xi
� �2
.
The decision rule is analogous to the one of the case of an extreme value but with the corresponding critical values in
Table 22.
In general, norms, like ISO 5725,7 propose the inspection of the origin of the anomalous results and, if assignable cause does not
exist, eliminate the incompatible ones and leave the straggler ones indicating their condition with an asterisk.
Table 23 Data of Example 17.
Series 1 Series 2 Series 3 Series 4 Series 5
13.50 13.50 13.70 13.04 13.48
13.40 13.51 13.71 13.03 13.47
13.47 13.35 13.76 15.93 13.92
13.49 13.35 13.80 13.04 13.46
Table 24 Robust and nonrobust estimates of the centrality and dispersion parameters (data of Table 23).
With all data (n ¼ 20) Without 15.93 (n ¼ 19)
Nonrobust procedures
Mean, �x 13.60 13.47
Standard deviation, s 0.60 0.25
Robust procedures
Median 13.49 13.48
H15, centrality parameter 13.50 13.48
MAD/0.6745 0.26 0.21
H15, dispersion 0.27 0.24
Quality of Analytical Measurements: Statistical Methods for Internal Validation 41
Example 17: With didactic purpose to apply Grubbs’ test and to verify the effect of outliers, the data of Table 23 have been
considered as a unique series of 20 results.
The greatest value is 15.93 and the lowest is 13.03, with s ¼ 0.60 and �x ¼ 13.60. Eq. (106) gives G20,calc ¼ 3.889 and Eq. (107)
gives G1,calc ¼ 0.942. By consulting the critical values in Table 22, G0.05,20 ¼ 2.709 and G0.01,20 ¼ 3.001; therefore, according to the
decision rule in Case 1 (single Grubbs’ test), the value 15.93 should be considered different from the rest.
Applying again the test, with 19 data, the greatest value now is 13.92 and the lowest is still 13.03, with G19,calc ¼ 1.804 and
G1,calc ¼ 1.785. As the tabulated values are G0.05,19 ¼ 2.681 and G0.01,19 ¼ 2.968, there is no evidence to say that any of the extreme
values is different from the rest. Table 24 contains the mean and standard deviation, with and without the value 15.93. A large effect
is observed on the standard deviation, which is reduced in more than 50% when removing the point.
Grubbs’ test can also be applied to the mean values per level. In practice, Grubbs’ test is also used to restore the equality of
variances in the ANOVA when the homogeneity of variances is rejected (section “Hypothesis Test on the Comparison of Several
Independent Variances”). The work by Ortiz et al.47 contains a complete analysis with sequential application of Cochran’s, Bartlett’s,
and Grubbs’ tests.
Case 2: Robust methods
The procedure described in the previous section is focused on the detection of anomalous data within a set of results. Nevertheless,
the elimination of these data is not recommendable when the variability of the analytical procedure is to be evaluated because the
procedure is sensitive to the present values, that is to say, it depends on the data that have been eliminated (Eqs. (106)–(109) can
lead to elimination of data in successive stages because of the reduction of the variance), and because the attainable realvariance is
underestimated.
As previously indicated, the values of repeatability (sr) and reproducibility (sR) are determined by means of an ANOVA whose
validity depends on whether the hypotheses of normality and homogeneity of variances are fulfilled. The robust methodology
proposed in this section avoids these limitations. Its technical details can be found in Hampel et al.57 and Huber.58
An alternative to the procedures based on the elimination of outlier data, as exposed in the ISO 5725-5 norm, consists of using
the H15 estimator proposed by Huber (c ¼ 1.5 and “Proposal 2 Scale”, Huber58), recommended by the Analytical Methods
Committee5,6 and accepted in the Harmonized Protocol.59 It is an estimator whose influence function is a monotone function
that limits the influence of the anomalous data by “moving them” toward the position of the majority, but maintaining for them the
maximum influence. This is carried out by transforming the original data by means of the function
Cm:s:c: xð Þ ¼ max m − cs; min m + cs; xð Þ½ � (110)
where m and s are the centrality and dispersion parameters, which must be iteratively estimated. The function in Eq. (110) is
represented in Fig. 8.
The estimate is exactly the generalization of the maximum-likelihood estimate. It is asymptotically optimum for high-quality
data, that is, data with little contamination and not very different from data following a Student’s t distribution with three d.f.
Remember that Hampel et al.57 have shown that Student’s t distributions with between 3 and 9 d.f. reproduce high-quality
experimental data, and that for t3 the efficiency of the mean and standard deviation is 50 and 0%, respectively. Therefore, in
practice there is need for robust estimates of high-quality empirical data (as those obtained with present analytical methods).
m − cs
m − cs m + csm
m + cs
Ψ(x)
m
x
Fig. 8 Function Cm. s. c.(x).
Table 25 Robust and nonrobust estimates of the repeatability and reproducibility with data of Table 23.
ANOVA With all data (n ¼ 20) Without 15.93 (n ¼ 19) Without series 4 and 13.92 (n ¼ 15)
Nonrobust procedure
Fcalc (P-value) 0.22 (0.92) 17.02 (<5 � 10−5) 24.91 (<5 � 10−5)
SSF (d.o.f.) 0.094 (4) 0.229 (4) 0.083 (3)
SSE (d.o.f.) 0.431 (15) 0.013 (14) 0.003 (11)
sR 0.657 0.260 0.153
sr 0.657 0.116 0.058
P-value, Cochran’s test 8.9 � 10−9 0.001 0.093
P-value, Bartlett’s test 3.9 � 10−8 0.001 0.100
P-value, Levene’s test 0.53 0.61 0.005
Robust procedures
Robust sR 0.281 0.172
Robust sr 0.072 0.072
42 Quality of Analytical Measurements: Statistical Methods for Internal Validation
The H15 estimator provides enough protection against high concentration of data that are abnormally large but near to the
correct data. Nevertheless, the clearly anomalous data are not rejected with the H15 estimator, and they maintain the maximum
influence but bounded. This produces an avoidable loss of the efficiency of the H15 estimator between 5% and 15% when the
proportion of anomalous data present is also between 5% and 15% (rather usual percentages in routine analyses). In order to avoid
this limited weakness, robust estimators as the median and the median of absolute deviation (MAD) (Eq. (111)) are necessary at
least in the first step of the calculation, to surely identify most of the “suitable” data.
MAD ¼ median xi − median xif gj jf g (111)
The robust procedure obtained when adapting the H15 estimator to the problem of the estimation of repeatability and reproduc-
ibility as posed in ISO norm consists of two stages and it has been followed in an identical way to the proposal in Sanz et al.60 As in
the parametric procedure, it uses the mean and standard deviation of the data. Therefore, once the robust procedure is applied, the
data necessary to estimate the reproducibility or the intermediate precision are at hand.
In order to verify the utility of these robust procedures, with the same data of Table 23 considered as a unique series of 20 values,
the median and the centrality parameter of the H15 estimator have been written down in Table 24. These are very similar to the
nonrobust estimates, and for both 20 and 19 values. Nevertheless, when comparing the robust parameters of dispersion, MAD/
0.6745 and H15, they do not differ when considering 20 or 19 data and are similar to the standard deviation obtained after applying
the method of Grubbs and repeating the calculations without the outlier. For this reason, it is a good strategy to apply systematically
robust procedures together with the classic ones. The difference in the results is an indication of the presence of outlier data, in
which case the robust estimations will have to be used.
The effect, and therefore the advantage, of the robust procedures is much more remarkable when a random effects ANOVA is
evaluated, for example, to estimate the reproducibility and repeatability of a method by means of an interlaboratory test as the one
described in Fig. 7. To show this, we will use the data of Table 23, this time considering its structure of levels of the factor (k ¼ 5) and
replicates (n ¼ 4).
The values of reproducibility and repeatability should not be accepted if the homogeneity of variances in the ANOVA
assumption is not fulfilled. In this case, it is necessary to verify whether some of the levels have outlier data. The first column of
Table 25 shows that the ANOVA with all the data is not acceptable because the variances cannot be considered equal (rejection in
the tests on variance homogeneity). In addition, it is observed that the anomaly in the data causes that the estimates sR and sr are
equal and very different from the robust estimates.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 43
Once the value 15.93 of series 5 is removed, the ANOVA (column 2 of Table 25) points to significant effect, but still lack of
variance homogeneity is observed. Nevertheless, the new estimates of sR and sr are more similar to those obtained with the robust
procedure.
The lack of equality of variances forces one to eliminate series 4, which has a very different variance (smaller than the others), and
later the value 13.92 of series 5. The final result of this sequential process is the third column of Table 25. A significant effect with the
ANOVA, the homogeneity of variances can be accepted, and the estimates of the reproducibility and the repeatability are 0.153 and
0.058, similar to the ones obtained with the robust procedure without series 4. The values sR and sr can be too small due to the
elimination of data, with the risk of having underestimations that are not realistic and thus impossible to be fulfilled by the
laboratories. For this reason, it is advised5–7 to avoid reduction of the sample and to maintain the initial robust estimates.
As the presence of outliers in the experimental work is unavoidable, the robust statistical methodology has consolidated as an
essential tool in chemical analysis. Further information can be found, for example, in the chapter of this book dedicated to robust
statistical techniques.
Accuracy
According to the IUPAC (Inczédy et al.11, Sections 2–3), the ISO,7 and the Directive of the European Union (Definition 1.1 of the
Commission Decision3), the accuracy is a concept defined as “Closeness of agreement between a test result and the accepted
reference value”. It is estimated by determining trueness and precision. Evidently, this definition collects together the systematic and
random errors, because for an individual determination it is xi − m ¼ xi − �xð Þ + �x − mð Þ ¼ e + D.
In practice, it is unreasonable to think that an analytical procedure has no bias; experimentally we can decide about the hypothesis
of null bias. If it is significant, it is possible to correct the measurement by subtracting the value D. However, this implies an increase in
the variance of the final result because D is estimated by experimental replicates and therefore it has uncertainty. For this reason, when
the uncertainty of a measurementis expressed, it is usual to include a term that takes into account the bias in a form similar to
Eq. (105). For a detailed treatment of this question, consult the guide of the EURACHEM/CITAC.1
Ruggedness
The ruggedness of a method is defined as its capacity to maintain trueness and precision throughout the time. The same applies for
the robustness of a material of reference or any other reagent.
Ruggedness means susceptibility of an analytical method to changes in experimental conditions, which can be expressed as a list
of sample materials, analytes, storage conditions, environmental conditions, and/or sample preparation conditions under which
the method can be applied as presented or with specified minor modifications.3
The study on ruggedness can be approached using two different statistical methodologies, one of which consists of using the
well-known control charts (they are confidence intervals on the mean, the variance, or the range of the measured parameter) and
continuously writing down the results obtained on known samples throughout the time. This type of “a posteriori” control is
essential to maintain the quality (precision, trueness, capability of detection, etc.) of a measurement method and to establish alarm
mechanisms when an observed drift can alter the quality of the procedure affecting the value of the analytical results. There is also a
chapter in this book that deals specifically with control charts.
The other approach to the problem of ruggedness involves evaluating “a priori” the variability expected in the analytical
procedure and identifying the sources of that variability.
Before routinely using a procedure, the effect of small changes in the reagents, in the conditions of work, or in the specifications
of its protocol must have been verified. It can happen that small changes in the volume of extracting reagent do not lead to great
variations in the response, whereas a small variation in, say, pH does. One way of knowing and controlling this quality criterion is
by making small changes in the potentially influential factors and observing the effect on the response.
The influence of each factor should not be separately analyzed since it is not methodologically adequate and, in addition, it is
not realistic because in practice unforeseeable combinations of all the factors will occur that can affect the results. Instead, the
methodology of the design of experiments should be used, and details about this are in the corresponding chapters of this
collection. As the number of factors that potentially affect the response is large, highly fractional factorial designs for two levels
have to be used (to reduce, e.g., the needed 27 ¼ 128 different experiments of a complete factorial design for seven factors).
Plackett–Burman designs and D-optimal designs have proven to be useful tools in ruggedness analysis.61–67 For more alternatives,
consult the chapter dedicated to these strategies.
Example 18: An analysis of ruggedness of a procedure of extraction of three sulfonamides is carried out. The seven considered
factors are buffer solution, pH, methanol as extracting agent, extraction cycles, petroleum benzin, volume of elution, and
evaporation mode. A Plackett-Burman design has been proposed to estimate the effects of the factors by fitting a linear model for
each sulfonamide:
y ¼ bo + b1x1 + b2x2 + . . . + b7x7 (112)
where xi denotes the i-th factor (Table 26) and y represents the response to be modeled, which is the chromatographic peak area
for each of the three sulfonamides. The details about the experimental domain can be seen in Table 26, where the nominal level is
codified as “−” and the extreme level as “+”.
Table 26 Experimental factors with nominal (−) and extreme (+) levels selected for a Plackett-Burman design
for seven factors (ruggedness analysis of an extraction procedure of sulfonamides).
Factor (units)
Level
− +
x1: buffer solution (mL) 1.0 1.5
x2: pH 4.5 4.8
x3: extracting agent (mL) 20 25
x4: extraction cycles One Two
x5: petroleum benzin (mL) 20 25
x6: volume of elution (mL) 7.0 8.0
x7: evaporation mode To dryness Evaporated until 1 mL
Table 27 Plackett-Burman design for the seven factors of Table 26.
Run
Factors Responses
x1 x2 x3 x4 x5 x6 x7 SDZ SMT SMP
1 1 1 1 1 1 1 1 10.50 10.50 10.50
10.30 10.30 10.30
2 1 1 −1 1 −1 −1 −1 5.87 6.31 8.91
7.56 7.74 8.72
3 1 −1 1 −1 1 −1 −1 7.55 8.88 7.06
8.08 11.01 8.23
4 1 −1 −1 −1 −1 1 1 3.82 6.58 6.11
5.61 7.20 7.53
5 −1 1 1 −1 −1 1 −1 6.74 10.00 9.61
8.19 10.63 10.38
6 −1 1 −1 −1 1 −1 1 6.89 5.35 4.29
7.63 6.99 6.40
7 −1 −1 1 1 −1 −1 1 8.21 7.86 10.24
8.70 8.56 9.94
8 −1 −1 −1 1 1 1 −1 8.56 6.95 7.02
5.85 8.11 7.96
The values of the three responses are the areas under the chromatographic peak (in a.u.) of sulfadiazine (SDZ), sulfamethazine (SMT), and sulfamethoxypyridazine (SMP). All
experiments are replicated twice.
Table 28 Estimated coefficients of the linear model (Eq. (112)) fit for each sulfonamide by means of a Plackett-Burman design.
Sulfadiazine Sulfamethazine Sulfamethoxypyridazine
Coefficient P-value Coefficient P-value Coefficient P-value
b0 7.504 <0.0001a 8.309 <0.0001a 8.325 <0.0001a
b1 −0.093 0.726 0.252 0.274 0.095 0.635
b2 0.456 0.111 0.165 0.465 0.314 0.142
b3 1.030 0.004a 1.409 0.000a 1.208 0.000a
b4 0.690 0.020a −0.021 0.924 0.874 0.002a
b5 0.666 0.039a 0.203 0.374 −0.605 0.014a
b6 −0.057 0.820 0.475 0.058 0.351 0.105
b7 0.204 0.447 −0.391 0.106 −0.161 0.426
aSignificant factor at a 0.05 significance level.
44 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Table 27 shows the experimental runs and the two values (replicates) of the three responses.
Finally, Table 28 contains the estimated coefficients of model in Eq. (112) and their P-values. The conclusion is that only the
extracting agent (x3), the number of cycles in the extraction (x4), and the volume of petroleum benzin (x5) are significant at 5% level.
Hence, special care should be taken with these factors because small changes in any of them can cause large variation in the
response.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 45
Appendix
Some Basic Elements of Statistics
A distribution function (cumulative distribution function (cdf )) in R is any function F, such that
1. F is an application from R to the interval [0, 1]
2. lim x! − 1 F xð Þ ¼ 0
3. lim x!1 F xð Þ ¼ 1
4. F is a monotonously increasing function, that is, a � b implies F(a) � F(b).
5. F is continuous on the left or the right. For example, F is continuous on the left if lim x!a, x<a F xð Þ ¼ F að Þ for each real number a.
Any probability defined in R corresponds to a distribution function and vice versa.
If p is the probability defined for intervals of real numbers, F(x) is defined as the probability that accumulates until x, that is,
F(x) ¼ p(−1,x). It is easy to show that F(x) verifies the above definition of distribution function.
If F is a cdf continuous on the left, its associated probability is defined by
pr a; b½ � ¼ pr a � x � bf g ¼ F bð Þ − F að Þ
pr a; bð � ¼ pr a < x � bf g ¼ F bð Þ − lim x!a, x>a F xð Þ
pr a; b½ Þ ¼ pr a � x < bf g ¼ F bð Þ − F að Þ
pr a; bð Þ ¼ pr a < x < bf g ¼ F bð Þ − lim x!a, x>a F xð Þ
If the distribution function is continuous, then the above limits coincide with the value of the function in the corresponding point.
The probability density function f(x), abbreviated pdf, if it exists, is the derivative of the cdf.
Each random variable X is characterized by a distribution function FX(x).
When several random variables are handled, it is necessary to define the joint distribution function.
FX1,X2,⋯,Xk
a1; a2;⋯; akð Þ ¼ pr X1 � a1 and X2 � a2⋯and Xk � akf g (A1)
If the previous joint probability is equal to the product of the individual probabilities, it is said that the random variables are
independent:
FX1,X2,⋯,Xk
a1; a2;⋯; akð Þ ¼ pr X1 � a1f g � pr X2 � a2f g �⋯� pr Xk � akf g (A2)
Eqs. (A3), (A4) define the mean and varianceof a continuous random variable whose pdf is f. Some basic properties are
E aX + bYð Þ ¼ aE Xð Þ + bE Yð Þ for any X and Y (A3)
V aXð Þ ¼ a2V Xð Þ for any random variable X (A4)
Given a random variable, X, the standardized variable is obtained by subtracting the mean and dividing by the standard deviation,
Y ¼ X − E Xð Þð Þ= ffiffiffiffiffiffiffiffiffiffiffi
V Xð Þp
. The standardized variable has E(Y) ¼ 0, V(Y) ¼ 1.
For any two random variables, the variance is
V X + Yð Þ ¼ V Xð Þ + V Yð Þ + 2Cov X;Yð Þ (A5)
and the covariance is defined as
Cov X;Yð Þ ¼ =2 x − E Xð Þð Þ y − E Yð Þð ÞfX, Y x; yð Þdxdy (A6)
In the definition of the covariance (Eq. (A6)), fX,Y(x,y) is the joint pdf of the random variables. In the case where they are
independent, the joint pdf is equal to the product fX(x)fY(y) and the covariance is zero.
In general, E(XY) 6¼ E(X)E(Y), except if the variables are independent, in which case the equality holds.
In the applications in Analytical Chemistry, it is very frequent to use formulas to obtain the final measurement from other
intermediate ones that had experimental variability. A strategy for the calculation of the uncertainty (variance) in the final result
under two basic hypotheses has been developed. The strategy is to make a linear approach to the formula and then to assimilate the
quadratic terms to the variance of the random variable at hand (see e.g., the “Guide to the Expression of Uncertainty in
Measurement”2). This procedure, called in many texts the method of transmission of errors, can lead to unacceptable results.
Hence, an improvement based on Monte Carlo simulation has been suggested for the calculation of the uncertainty (see the
Supplement 1 to the aforementioned guide).
A useful representation of the data is the so-called box and whisker plot (or simply box plot). It consists of a box built with the
first, Q1, and third, Q3, quartiles, that is, the 0.25 and 0.75 percentiles, so that the box contains half of the central data. The line in
A B
8.2
9.2
7.2
5.2
6.2
Fig. A1 Box and whisker plots computed with A: data of method A in Fig. 2, and B: data of method A with an outlier.
46 Quality of Analytical Measurements: Statistical Methods for Internal Validation
between is the median (Q2 or 0.5 percentile). Then, the whisker extends on both sides of the box up to the maximum andminimum
values, provided they are not further than1.5 times the interquartile range Q3–Q1.
Fig. A1 shows a boxplot of the 100 values of the method A of Fig. 2A, the first on the left. Two values appear like squares,
“disconnected” at the bottom, meaning that these two values are less than 1.5 times the interquartile range below Q1.
The advantage of using box plots is that the quartiles are practically insensitive to outliers. For example, suppose that the
maximum value 7.86 is changed by 8.86; this change does not affect the median or the quartiles, the box plot continues being
similar but with a datum outside the upper whisker, as can be seen in the second box plot on the right in Fig. A1.
The Normal Distribution
A normal distribution withmean m and standard deviation s,N(m,s), has as pdf the following function defined for all real numbers:
f xð Þ ¼ 1
s
ffiffiffiffiffiffi
2p
p e−
1
2
x − m
sð Þ2 (A7)
The normal distribution is a continuous random variable with E(N(m,s)) ¼ m and V(N(m,s)) ¼ s2, and these two parameters
completely define the distribution.
Particularly interesting is the N(0,1), usually called Z, because any other normal distribution N(m,s) is transformed into a Z
when standardizing it, that is, Z ¼ (N(m,s) − m)/s.
The distribution function of a normal random variable does not have an analytical expression; hence it is necessary to use tables
or somewhat complex formulas to calculate the probabilities. As any normal distribution can be transformed into a N(0,1), it is
customary to use only the table of this distribution. Table A1 contains some of its values that, in any case, cover the cases used in this
article. For example, if z ¼ 1.83, from the reading in row 1.8 and column 0.03. p ¼ pr{N(0,1) > 1.83} ¼ 0.0336.
The sum of normal and independent random variables,
Pn
i¼1 N(mi,si), also follows a normal distribution
N
Pn
i¼1mi;
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn
i¼1s
2
i
p� �
.
Student’s t Distribution
If X is a random variableN(m, s) and X1,X2,. . .,Xn are n random variables, independent and with the same distribution as X, then the
random variable �X − mð Þ= s=
ffiffiffi
n
pð Þ is a N(0,1), where �X denotes the random variable
Pn
i¼1 Xi/n.
However, with the sample variance S2 instead, the statistics t ¼ �X − mð Þ= S=
ffiffiffi
n
pð Þ follows a t distribution with n ¼ n − 1 d.f. The
mean and variance of a Student t distribution are respectively E(t) ¼ 0 and V(t) ¼ n/(n − 2), n > 2. The general shape of its pdf is
similar to that of the standard normal distribution, both are symmetrical around zero, unimodal, and defined in (−1,1). However,
the t distribution has heavier tails than the normal; that is, exhibits greater variability. As the number of d.f. tends to infinity, the
Table A1 Selected probabilities of the Z ¼ N(0,1) distribution.
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
1.9 0.0287 0.0281 0.0274 0.0268 0.0282 0.0258 0.0250 0.0244 0.0239 0.0233
2.0 0.0227 5 0.0222 2 0.0216 9 0.0211 8 0.0208 8 0.0201 8 0.0197 0 0.0192 3 0.0187 8 0.0183 1
Values of p such that p ¼ pr{N(0,1) > z}. Up to the first decimal of z in rows, second decimal in columns.
Table A2 Selected points of the t distribution with n degrees of freedom.
n a
0.25 0.10 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005
1 1.000 3.078 6.314 12.706 31.821 63.657 127.321 318.309 636.619
2 0.816 1.886 2.920 4.303 6.965 9.925 14.089 22.327 31.598
3 0.765 1.638 2.353 3.182 4.541 5.841 7.453 10.214 12.924
4 0.741 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610
5 0.721 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869
6 0.718 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959
7 0.711 1.415 1.895 2.365 2.998 3.499 4.929 4.785 5.408
8 0.716 1.397 1.860 2.365 2.896 3.355 3.833 4.501 5.041
9 0.713 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781
10 0.711 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587
20 0.687 1.325 1.725 2.086 2.528 2.845 3.497 3.552 3.850
100 0.677 1.290 1.660 1.984 2.364 2.626 2.871 3.174 3.390
1000 0.675 1.282 1.645 1.962 2.330 2.581 2.813 3.098 3.300
1 0.675 1.282 1.645 1.960 2.326 2.576 2.807 3.090 3.290
Values of t such that a ¼ pr{tn > t }.
Table A3 Selected a percentage points of the w2 distribution with n degrees of freedom.
n a
0.990 0.975 0.950 0.900 0.100 0.050 0.025 0.010
1 0.000 16 0.000 98 0.003 9 0.015 8 2.71 3.84 5.02 6.63
2 0.020 1 0.050 6 0.102 8 0.210 7 4.61 5.99 7.38 9.21
3 0.115 0.216 0.352 0.584 6.25 7.81 9.35 11.34
4 0.297 0.484 0.711 1.064 7.78 9.49 11.14 13.28
5 0.554 0.831 1.15 1.61 9.24 11.07 12.83 15.09
6 0.872 1.24 1.64 2.20 10.64 12.69 14.45 16.81
7 1.24 1.69 2.17 2.83 12.02 14.07 16.01 18.48
8 1.65 2.18 2.73 3.49 13.36 15.51 17.53 20.09
9 2.09 2.70 3.33 4.17 14.68 16.92 19.02 21.67
10 2.56 3.25 3.94 4.87 15.99 18.31 20.48 23.21
100 70.06 74.22 77.93 82.36 118.50 124.34 129.56 135.91
Values of x such that a ¼ pr{w2n > x}.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 47
limiting distribution is the standard normal one. The family of t distributions only depends on one parameter, the degrees of
freedom.
Table A2 contains some values of the t distribution. For example, if n ¼ 5, for a ¼ 0.025, the value t ¼ 2.571 in the table is the
one such that 0.025 ¼ pr{t5 > 2.571}. Compare with the value 1.96 in Table A1 that would correspond, in the same conditions, to a
N(0,1), that is, 0.025¼ pr{N(0,1) > 1.96}.
Because of the symmetry, 0.025 ¼ pr{t5 < −2.571} also holds and, consequently, 0.95 ¼ pr{−2.571 < t5 < 2.571}.
The x2 (Chi-square) Distribution
Under the conditions of the previous section for variables X1,X2,. . .,Xn, the random variable w2 ¼ [(n − 1)S2]/s2 follows a chi-square
distribution with n ¼ n − 1 d.f.
The mean and variance of a w2 distribution with n d.f. are E(wn
2) ¼ n and V(wn
2) ¼ 2n. The chi-square distribution is nonnegative
and the pdf is skewed to the right. However, as n increases, the distribution becomes more symmetric. Some percentage points of the
chi-square distribution are given in Table A3.
For example, if n ¼ 6 and a ¼ 0.025, the value of the chi-square distribution with 6 d.f. that leaves to its right the probability
0.025 is w0.025, 6
2 ¼ 14.45. That is, 0.025 ¼ pr{w6
2 > 14.45}. Analogously, 0.975 ¼ pr{w6
2 > 1.24}, and consequently, 0.95 ¼ pr
{1.24 < w6
2 < 14.45}.
The chi-square distribution has an important property: Let w1
2, w2
2, ⋯, wn
2 be independent chi-square random variables with n1,
n2,. . .,nn d.f., respectively. Then the random variable w1
2 + w2
2 + ⋯ + wn
2 is a chi-square distribution with n ¼ Pn
i ¼1 ni d.f.
This property becomes apparent if we note that if Z1,Z2,. . .,Zn are normally and independently distributed random variables,
then the random variable
P
i¼1
n Zi
2 is a chi-square distribution with n d.f.
Table A4 Selected percentage points of the Fn1,n2 distribution for a ¼ 0.025.
n2 n
1 2 3 4 5 6 7 8 9 10
1 647.79 799.50 864.16 899.58 921.85 937.11 948.22 956.66 963.28 968.63
2 38.506 39.000 39.165 39.248 39.298 39.331 39.355 39.373 39.387 39.398
3 17.443 16.044 15.439 15.101 14.885 14.735 14.624 14.540 14.473 14.419
4 12.218 10.649 9.979 9.605 9.365 9.197 9.074 8.980 8.905 8.844
5 10.007 8.434 7.764 7.388 7.146 6.978 6.853 6.757 6.681 6.619
6 8.813 7.260 6.599 6.227 5.988 5.820 5.696 5.600 5.523 5.461
7 8.073 6.542 5.890 5.523 5.285 5.119 4.995 4.899 4.823 4.761
8 7.571 6.060 5.416 5.053 4.817 4.652 4.529 4.433 4.357 4.295
9 7.209 5.715 5.078 4.718 4.484 4.320 4.197 4.102 4.026 3.964
10 6.937 5.456 4.826 4.468 4.236 4.072 3.950 3.855 3.779 3.717
11 6.724 5.256 4.630 4.275 4.044 3.881 3.759 3.664 3.588 3.526
12 6.554 5.096 4.474 4.121 3.891 3.728 3.607 3.512 3.436 3.374
13 6.414 4.965 4.347 3.996 3.767 3.604 3.483 3.388 3.312 3.250
14 6.298 4.857 4.242 3.892 3.663 3.501 3.380 3.285 3.209 3.147
15 6.200 4.765 4.153 3.804 3.576 3.415 3.293 3.199 3.123 3.060
Values of x such that 0.025 ¼ pr {Fn1,n2 > x}.
48 Quality of Analytical Measurements: Statistical Methods for Internal Validation
The F Distribution
Let X1 and X2 be independent chi-square random variables with n1 and n2 d.f., respectively. Then, the ratio F ¼ (X1/n1)/(X2/n2) is an
F distribution with n1 d.f. in the numerator and n2 d.f. in the denominator. It is usually abbreviated as Fn1, n2. The mean and variance
of Fn1, n2 are E(Fn1, n2) ¼ n2/(n2 − 2), n2 > 2 and V(Fn1, n2) ¼ 2n2
2(n1 + n2 − 2)/[n1(n2 − 2)2(n2 − 4)], n2 > 4.
The F distribution is nonnegative and skewed to the right. Some percentage points of the F distribution are given in Table A4 for
a ¼ 0.025. For, say, n1 ¼ 5 and n2 ¼ 10, F0.025, 5, 10 ¼ 4.24 is the value such that 0.025 ¼ pr{F5, 10 > 4.24}.
The lower percentage points can be found taking into account that F1−a,n1,n2 ¼ 1/Fa,n2,n1. For example, to find F0.975, 5, 10 with
Table A4 is F0.975, 5, 10 ¼ 1/F0.025, 10, 5 ¼ 1/6.62 ¼ 0.15. Therefore, we have 0.95 ¼ pr{0.15 < F5, 10 < 4.24}.
Convergence of Random Variables
Sometimes, it is useful to think how a sequence of random variables converges to another random variable. Let X1,X2,. . . be a
sequence of random variables and let F1(x),F2(x),. . . be the corresponding sequence of distribution functions.
If the distribution functions are more and more similar to the distribution function F of a random variable X when n !1, then
we say that they converge to X in “distribution”. Formally, this means that lim n!1 Fn xð Þ ¼ F xð Þ for each x.
We say that Xn converges to X in “probability” if 8e > 0, lim n!1 pr Xn − Xj j > ef g ¼ 0. This means that the probability of the set
where Xn differs from X is getting smaller and smaller for each e.
Furthermore, we say that Xn converges to X “almost surely” if pr lim n!1 Xn − Xj j ¼ 0f g ¼ 1. Almost sure convergence implies
that the set of ideal measurements such that the outcomes (real values) of Xn are getting closer and closer to X has probability one.
It can be proven that almost sure convergence implies convergence in probability, which then implies convergence in
distribution. The following three fundamental results for convergence are the most widely used in practice.
• The “weak law of large numbers” states that if X1,X2,. . .,Xn,. . . are independent and identically distributed random variables with
finite mean m, then X1 + X2 + ⋯ + Xn
n ! m in probability
• If the random variables also have a finite variance (a weaker condition is also possible), then we have the “strong law of large
numbers”, that is, X1 + X2 + ⋯ + Xn
n ! m almost surely
• The “central limit theorem” states that for independent (or weakly correlated) random variables X1,X2,. . .,Xn, with the same
distribution,
�X − mð Þ
s=
ffiffi
n
p ! Z ¼ N 0;1ð Þ in distribution, where m and s2 are the common mean and variance of the random variables
Xn. This means that the distributional shape of �X is more and more like the one of a standard normal random variable as n
increases.
Some Computational Aspects
The accessibility of personal computers allows doing statistical calculations without tables. It is advisable to use software especially
devoted to Statistics, but in the initial stages of the learning it is worthwhile to do the calculations manually in order to acquire the
intuition necessary to avoid the errors derived from a non-reflexive and automatic use of the software.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 49
The basic distributions (normal, Student’s t, F, chi-square) can be programmed with the algorithms in Abramowitz and Stegun.68
Appendices of the book by Meier and Zünd69 show the necessary numerical approximations and programs in BASIC for the same
distributions. To compute the noncentral F, the needed numerical approximation can be consulted in Johnson and Kotz,70 and
Evans et al.71
All the calculations in this article have been made with the Statistics Toolbox for MATLAB.72 What follows is a list of basic
command instructions used that the reader can also find in live scripts Appendix_1probDistr_live.mlx, Appendix_2power_live.mlx (in
the form of MATLAB mlx-files) in the supplementary material.
Note that all the MATLAB commands referring to cumulative distribution functions, Eqs. (A8)–(A11), compute the
cumulative probability a until the corresponding value of the distribution. However, along the text and in Tables A1–A4,
the calculated probability a is always the upper percentage point, that is, the cumulative probability above the corresponding
value.
Normal distribution
a ¼ pr N m; sð Þ < zaf g (A8)
• z ¼ norminv(a, m, s)
Example A1: a ¼ 0.05, m ¼ 0, s ¼ 1; then norminv(0.05, 0, 1) gives z ¼ −1.645.
• a ¼ normcdf(z, m, s)
Example A2: z ¼ 1.645, m ¼ 0, s ¼ 1; then normcdf(1.645, 0, 1) gives a ¼ 0.95
Student’s t distribution with n degrees of freedom
a ¼ pr tn < ta, nf g (A9)
• t ¼ tinv(a,v)
Example A3: a ¼ 0.05, n ¼ 5; then tinv(0.05,5) gives t ¼ −2.015.
• a ¼ tcdf(t,v)
Example A4: t ¼ 1.645, n ¼ 5; then tcdf(1.645,5) gives a ¼ 0.9196.
x2 distribution with n degrees of freedom
a ¼ pr w2n < w2a, n
n o
(A10)
• x ¼ chi2inv(a,n)
Example A5: a ¼ 0.05, n ¼ 5; then chi2inv(0.05,5) gives x ¼ 1.1455.
• a ¼ chi2cdf(x,n)
Example A6: x ¼ 9.24, n ¼ 5; then chi2cdf(9.24,5) gives a ¼ 0.9001.
Fn1, n2
distribution with n1 and n2 degrees of freedom
a ¼ pr Fn1, n2 < Fa, n1, n2f g (A11)
• x ¼ finv(a,n1,n2)
Example A7: a ¼ 0.95, n1 ¼ 5, n2 ¼ 15; then finv(0.95,5,15) gives x ¼ 2.9013.
• a ¼ fcdf(x,n1,n2)Example A8: x ¼ 2.90, n1 ¼ 5, n2 ¼ 15; then fcdf(2.90,5,15) gives a ¼ 0.9499.
Power for the z-test, Eq. (40)
Example A9: With the data of Example 8, |d | ¼ 0.40, s ¼ 0.55, n ¼ 10, a ¼ 0.05, normcdf(norminv(0.95,0,1) − 0.40∗sqrt(10)/
0.55) gives 0.2562.
50 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Power for the t-test, Eq. (43)
Example A10: With the same data as for the Z-test, the sentence includes the noncentral t-distribution “nctcdf” and the t
distribution “tinv”, both of them with n − 1 degrees of freedom and noncentrality parameter 0:73
ffiffiffiffiffiffi
10
p
:
nctcdf(tinv(0.95,9),9, 0.73∗sqrt(10)) gives b ¼ 0.3137.
Example A11: With data of Example 9, a ¼ 0.05, n ¼ 10, and the noncentrality parameter is 0:57
ffiffiffiffiffiffi
10
p
.
nctcdf(tinv(0.95,9), 9, 0.57∗sqrt(10)) gives b ¼ 0.4918.
Power for the chi-square test, Eq. (46)
Example A12: We have l ¼ 2, a ¼ 0.05, and n ¼ 14 or n ¼ 13 to obtain a value of b � 0.05.
chi2cdf(chi2inv(0.95,13)/(2∗2),13) gives b ¼ 0.0402.
chi2cdf(chi2inv(0.95,12)/(2∗2),12) gives b ¼ 0.0511.
Notice that the d.f. equal 14 − 1 ¼ 13 or 13 − 1 ¼ 12
Power for the F-test, Eq. (57)
Example A13: For a ¼ 0.05, n1 ¼ n2 ¼ 9, l ¼ s1/s2 ¼ 2 which are those of question (3) in Example 13.
fcdf(finv(0.975,8,8)/(2∗2),8,8)-fcdf(finv(0.025,8,8)/(2∗2),8,8) gives b ¼ 0.5558.
If we look for the sample size n ¼ n1 ¼ n2 so that b 0.10, trying some values, we get
fcdf(finv(0.975,22,22)/(2∗2),22,22)-fcdf(finv(0.025,22,22)/(2∗2),22,22) that gives b ¼ 0.1115 and
fcdf(finv(0.975,23,23)/(2∗2),23,23)-fcdf(finv(0.025,23,23)/(2∗2),23,23) that gives b ¼ 0.0981.
Consequently, n ¼ 24.
Power for fixed effects ANOVA, Eq. (82)
Example A14: For a ¼ 0.05, n1 ¼ 4, n2 ¼ 15, and noncentrality parameter d ¼ n
P
ti
2/s2 ¼ 4 � 2 ¼ 8, the command
ncfcdf(finv((0.95),4,15),4,15,8) gives b ¼ 0.5364
Notice that in the ANOVA model the levels of factor are k ¼ 5 and there is n ¼ 4 replicates per level.
Power for random effects ANOVA, Eq. (98)
Example A15: Suppose that 10 laboratories participate in a proficiency test to evaluate a method. Assumed risks are a ¼ b ¼ 0.05
and it is desirable to detect at least an interlaboratory variance equal to the intralaboratory variance, that is, st
2/s2 ¼ 1, so that l2 ¼ 1 +
nst
2/s2 ¼ 1 + n. With these data
k ¼ 10;n ¼ 4;fcdf(finv(0.95,k − 1,k∗(n − 1))/(1 + 1∗n),k − 1,k∗(n − 1)) gives b ¼ 0.099
k ¼ 10;n ¼ 5;fcdf(finv(0.95,k − 1,k∗(n − 1))/(1 + 1∗n),k − 1,k∗(n − 1)) gives b ¼ 0.050
Thus, each laboratory must do five determinations.
References
1. EURACHEM/CITAC, Guide CG4. In Quantifying Uncertainty in Analytical Measurement, 2nd ed.; Ellison, S. L. R., Rosslein, M., Williams, A., Eds.; 2000. ISBN: 0-948926-15-5
Available from the Europchem Secretariat. See http://www.eurochem.org.
2. Evaluation of Measurement Data, Supplement 1 to the ‘Guide to the Expression of Uncertainty in Measurement’—Propagation of Distributions Using a Monte Carlo Method; Joint
Committee for Guides in Metrology, 2008. JCGM 101.
3. Commission Decision (EC), No 2002/657/EC of 12 August 2002 Implementing Council Directive 96/23/EC Concerning the Performance of Analytical Methods and the
Interpretation of Results. Off. J. Eur. Commun. 2002, L221, 8–36.
4. Aldama, J. M. Practicum of Master in Advanced Chemistry; University of Burgos: Burgos, Spain, 2007.
5. Analytical Methods Committee, Robust Statistics-How Not to Reject Outliers, Part 1. Basic Concepts. Analyst 1989, 114, 1693–1697.
6. Analytical Methods Committee, Robust Statistics-How Not to Reject Outliers, Part 2. Inter-laboratory Trials. Analyst 1989, 114, 1699–1702.
7. ISO 5725, Accuracy Trueness and Precision of Measurement Methods and Results, Part 1. General Principles and Definitions, Part 2. Basic Method for the Determination of
Repeatability and Reproducibility of a Standard Measurement Method, Part 3. Intermediate Measures of the Precision of a Standard Measurement Method, Part 4. Basic Methods
for the Determination of the Trueness of a Standard Measurement Method, Part 5. Alternative Methods for the Determination of the Precision of a Standard Measurement Method,
Part 6. Use in Practice of Accuracy Values. Genève, 1994 .
8. Analytical Methods Committee, In Technical Brief No 4; Thompson, M., Ed.; 2006. www.rsc.org/amc/.
9. Silverman, B. W. Density Estimation for Statistics and Data Analysis; Chapman and Hall: London, Great Britain, 1986.
10. Wand, M. P.; Jones, M. C. Kernel Smoothing; Chapman and Hall: London, Great Britain, 1995.
11. Inczédy, J.; Lengyel, T.; Ure, A. M.; Gelencsér, A.; Hulanicki, A. Compendium of Analytical Nomenclature IUPAC, 3rd ed.; Pot City Press Inc.: Baltimore 2nd printing, 2000.
12. Lira, I.; Wöger, W. Comparison Between the Conventional and Bayesian Approaches to Evaluate Measurement Data. Metrologia 2006, 43, S249–S259.
13. Zech, G. Frequentist and Bayesian confidence intervals. Eur. Phys. J. Direct 2002, C12, 1–81.
14. Armstrong, N.; Hibbert, D. B. An Introduction to Bayesian Methods for Analyzing Chemistry Data, Part 1: An Introduction to Bayesian Theory and Methods. Chemom. Intel. Lab.
Syst. 2009, 97, 194–210.
15. Armstrong, N.; Hibbert, D. B. An Introduction to Bayesian Methods for Analyzing Chemistry Data, Part II: A Review of Applications of Bayesian Methods in Chemistry. Chemom.
Intel. Lab. Syst. 2009, 97, 211–220.
16. Sprent, P.; Smeeton, N. C. Applied Nonparametric Statistical Methods, 4th ed.; Chapman & Hall/CRC: Boca Raton, 2007.
17. Patel, J. K. Tolerance Limits. A Review. Commun. Stat. Theory Methods 1986, 15 (9), 2716–2762.
18. Meléndez, M. E.; Sarabia, L. A.; Ortiz, M. C. Distribution Free Methods to Model the Content of Biogenic Amines in Spanish Wines. Chemom. Intel. Lab. Syst. 2016, 155,
191–199.
http://www.eurochem.org
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0015
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0015
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0020
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0020
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0025
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0030
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0035
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0040
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0040
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0040
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0040
http://www.rsc.org/amc/
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0050
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0055
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0060
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0065
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0070
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0075
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0075
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0080
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0080
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0085
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0090
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0095
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0095
Quality of Analytical Measurements: Statistical Methods for Internal Validation 51
19. Reguera, C.; Sanllorente, S.; Herrero, A.; Sarabia, L. A.; Ortiz, M. C. Study of the Effect of the Presence of Silver Nanoparticles on Migration of Bisphenol A From Polycarbonate
Glasses into Food Simulants. Chemom. Intel. Lab. Syst. 2018, 176, 66–73.
20. Wald, A.; Wolfowitz, J. Tolerance Limits for a Normal Distribution. Ann. Math. Stat. 1946, 17, 208–215.
21. Wilks, S. S. Determination of Sample Sizes for Setting Tolerance Limits. Ann. Math. Stat. 1941, 12, 91–96.
22. Kendall, M.; Stuart, A. The Advanced Theory of Statistics, Inference and Relationship; Charles Griffin & Company Ltd.: London, 1979;547–548. Section 32.11; Vol. 2.
23. Willink, R. On using the MonteCarlo Method to Calculate Uncertainty Intervals. Metrologia 2006, 43, L39–L42.
24. Guttman, I. Statistical Tolerance Regions; Charles Griffin and Company: London, 1970.
25. Huber, P.; Nguyen-Huu, J. J.; Boulanger, B.; Chapuzet, E.; Chiap, P.; Cohen, N.; Compagnon, P. A.; Dewé, W.; Feinberg, M.; Lallier, M.; Laurentie, M.; Mercier, N.; Muzard, G.;
Nivet, C.; Valat, L. Harmonization of Strategies for the Validation of Quantitative Analytical Procedures. A SFSTP Proposal – Part I. J. Pharm. Biomed. Anal. 2004, 36, 579–586.
26. Huber, P.; Nguyen-Huu, J. J.; Boulanger, B.; Chapuzet, E.; Chiap, P.; Cohen, N.; Compagnon, P. A.; Dewé, W.; Feinberg, M.; Lallier, M.; Laurentie, M.; Mercier, N.; Muzard, G.;
Nivet, C.; Valat, L.; Rozet, E. Harmonization of Strategies for the Validation of Quantitative Analytical Procedures. A SFSTP Proposal—Part II. J. Pharm. Biomed. Anal. 2007, 45,
70–81.
27. Huber, P.; Nguyen-Huu, J. J.; Boulanger, B.; Chapuzet, E.; Cohen, N.; Compagnon, P. A.; Dewé, W.; Feinberg, M.; Laurentie, M.; Mercier, N.; Muzard, G.; Valat, L.; Rozet, E.
Harmonization of Strategies for the Validation of Quantitative Analytical Procedures. A SFSTP Proposal—Part III. J. Pharm. Biomed. Anal. 2007, 45, 82–86.
28. Feinberg, M. Validation of Analytical Methods Based on Accuracy Profiles. J. Chromatogr. A 2007, 1158, 174–183.
29. Rozet, E.; Hubert, C.; Ceccato, A.; Dewé, W.; Ziemons, E.; Moonen, F.; Michail, K.; Wintersteiger, R.; Streel, B.; Boulanger, B.; Hubert, P. Using Tolerance Intervals in Pre-Study
Validation of Analytical Methods to Predict In-Study Results. The Fit-for-Future-Purpose Concept. J. Chromatogr. A 2007, 1158, 126–137.
30. Rozet, E.; Ceccato, A.; Hubert, C.; Ziemons, E.; Oprean, R.; Rudaz, S.; Boulanger, B.; Hubert, P. Analysis of Recent Pharmaceutical Regulatory Documents on Analytical Method
Validation. J. Chromatogr. A 2007, 1158, 111–125.
31. Dewé, W.; Govaerts, B.; Boulanger, B.; Rozet, E.; Chiap, P.; Hubert, P. Using Total Error as Decision Criterion in Analytical Method Transfer. Chemom. Intel. Lab. Syst. 2007, 85,
262–268.
32. González, A. G.; Herrador, M. A. Accuracy Profiles from Uncertainty Measurements. Talanta 2006, 70, 896–901.
33. Rebafka, T.; Clémençon, S.; Feinberg, M. Bootstrap-Based Tolerance Intervals for Application to Method Validation. Chemom. Intel. Lab. Syst. 2007, 89, 69–81.
34. Fernholz, L. T.; Gillespie, J. A. Content-Correct Tolerance Limits Based on the Bootstrap. Technometrics 2001, 43 (2), 147–155.
35. Cowen, S.; Ellison, S. L. R. Reporting Measurement Uncertainty and Coverage Intervals Near Natural Limits. Analyst 2006, 131, 710–717.
36. Schouten, H. J. A. Sample Size Formulae with a Continuous Outcome for Unequal Group Sizes and Unequal Variances. Stat. Med. 1999, 18, 87–91.
37. Lehmann, E. L. Testing Statistical Hypothesis; Wiley & Sons: New York, 1959.
38. Schuirmann, D. J. A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability. J. Pharmacokinet.
Biopharm. 1987, 15, 657–680.
39. Mehring, G. H. On Optimal Tests for General Interval Hypothesis. Commun. Stat. Theory Methods 1993, 22 (5), 1257–1297.
40. Brown, L. D.; Hwang, J. T. G.; Munk, A. An Unbiased Test the Bioequivalence Problem. Ann. Stat. 1998, 25, 2345–2367.
41. Munk, A.; Hwang, J. T. G.; Brown, L. D. Testing Average Equivalence. Finding a Compromise Between Theory and Practice. Biom. J. 2000, 42 (5), 531–552.
42. Hartmann, C.; Smeyers-Verbeke, J.; Penninckx, W.; Vander Heyden, Y.; Vankeerberghen, P.; Massart, D. L. Reappraisal of Hypothesis Testing for Method Validation: Detection of
Systematic Error by Comparing the Means of Two Methods or of Two Laboratories. Anal. Chem. 1995, 67, 4491–4499.
43. Limentani, G. B.; Ringo, M. C.; Ye, F.; Bergquist, M. L.; McSorley, E. O. Beyond the t-Test. Statistical Equivalence Testing. Anal. Chem. 2005, 77, 221A–226A.
44. Kuttatharmmakull, S.; Massart, D. L.; Smeyers-Verbeke, J. Comparison of Alternative Measurement Methods: Determination of the Minimal Number of Measurements Required
for the Evaluation of the Bias by Means of Interval Hypothesis Testing. Chemom. Intel. Lab. Syst. 2000, 52, 61–73.
45. Martín Andrés, A.; Luna del Castillo, J. D. Bioestadí stica para las ciencias de la salud; Ediciones Norma-Capitel: Madrid, 2004.
46. Wellek, S. Testing Statistical Hypotheses of Equivalence; Chapman & May/CRC Press LLC: Boca Raton, FL, 2003.
47. Ortiz, M. C.; Herrero, A.; Sanllorente, S.; Reguera, C. The Quality of the Information Contained in Chemical Measures; Servicio de Publicaciones Universidad de Burgos: Burgos,
2005. (Electronic Book).
48. D’Agostino, R. B., Stephens, M. A., Eds.; In Goodness-of-Fit Techniques; Marcel Dekker Inc.: New York, 1986.
49. Moreno, E.; Girón, F. J. On the Frequentist and Bayesian Approaches to Hypothesis Testing (with discussion). Stat. Oper. Res. Trans. 2006, 30 (1), 3–28.
50. Scheffé, H. The Analysis of Variance; Wiley & Sons: New York, 1959.
51. Anderson, V. L.; MacLean, R. A. Design of Experiments. A Realistic Approach; Marcel Dekker Inc.: New York, 1974.
52. Milliken, G. A.; Johnson, D. E. Analysis of Messy Data: Designed Experiments; Wadsworth Publishing Co.: Belmont, NJ, 1984; vol. I.
53. Searle, S. R. Linear Models; Wiley & Sons, Inc.: New York, 1971.
54. Youden, W. J. Statistical Techniques for Collaborative Tests; Association of Official Analytical Chemists: Washington, DC, 1972.
55. Kateman, G.; Pijpers, F. W. Quality Control in Analytical Chemistry; Wiley & Sons: New York, 1981.
56. Kuttatharmmakull, S.; Massart, D. L.; Smeyers-Verbeke, J. Comparison of Alternative Measurement Methods. Anal. Chim. Acta 1999, 391, 203–225.
57. Hampel, F. R.; Ronchetti, E. M.; Rousseeuw, P. J.; Stahel, W. A. Robust Statistics. The Approach Based on Influence Functions; Wiley-Interscience: Zurich, 1985.
58. Huber, P. J. Robust Statistics; Wiley & Sons: New York, 1981.
59. Thompson, M.; Wood, R. J. Assoc. Off. Anal. Chem. Int. 1993, 76, 926–940.
60. Sanz, M. B.; Ortiz, M. C.; Herrero, A.; Sarabia, L. A. Robust and Non Parametric Statistic in the Validation of Chemical Analysis Methods. Quím. Anal. 1999, 18, 91–97.
61. García, I.; Sarabia, L.; Ortiz, M. C.; Aldama, J. M. Usefulness of D-optimal Designs and Multicriteria Optimization in Laborious Analytical Procedures. Application to the Extraction
of Quinolones From Eggs. J. Chromatogr. A 2005, 1085, 190–198.
62. García, I.; Sarabia, L. A.; Ortiz, M. C.; Aldama, J. M. Robustness of the Extraction Step When Parallel Factor Analysis (PARAFAC) is Used to Quantify Sulfonamides in Kidney by
High Performance Liquid Chromatography-Diode Array Detection (HPLC-DAD). Analyst 2004, 129 (8), 766–771.
63. Massart, D. L.; Vandeginste, B. G. M.; Buydens, L. M. C.; de Jong, S.; Lewi, P. J.; Smeyers-Verbeke, J. Handbook of Chemometrics and Qualimetrics: Part A; Elsevier:
Amsterdam, 1997.
64. Herrero, A.; Reguera, C.; Ortiz, M. C.; Sarabia, L. A.; Sánchez, M. S. Ad-Hoc Blocked Design for the Robustness Study in the Determination of Dichlobenil and 2,6-
Dichlorobenzamide in Onions by Programmed Temperature Vaporization-Gas Chromatography–Mass Spectrometry. J. Chromatogr. A 2014, 1370, 187–199.
65. Arce, M. M.; Sanllorente, S.; Ortiz, M. C.; Sarabia, L. A. Easy-To-Use Procedure to Optimise a Chromatographic Method. Application in the Determination of Bisphenol-A and
Phenol in Toys by Means of Liquid Chromatography with Fluorescence Detection. J. Chromatogr. A 2018, 1534, 93–100.
66. Oca, M. L.; Rubio, L.; Ortiz, M. C.; Sarabia, L. A.; García, I. Robustness Testing in the Determination of Seven Drugs in Animal Muscle by Liquid Chromatography–Tandem Mass
Spectrometry. Chemom. Intel. Lab. Syst. 2016, 151, 172–180.
67. Rodríguez, N.; Ortiz, M. C.; Sarabia, L. A. Study of Robustness Based on N-Way Models in the Spectrofluorimetric Determination of Tetracyclines in Milk When Quenching Exists.
Anal. Chim. Acta 2009, 651, 149–158.
68. Abramowitz, M.; Stegun, I. A. Handbook of Mathematical Functions; Government Printing Office,1964.
69. Meier, P. C.; Zünd, R. E. Statistical Methods in Analytical Chemistry, 2nd ed.; Wiley & Sons: New York, 2000.
70. Johnson, N.; Kotz, S. Distributions in Statistics: Continuous Univariate Distributions—2; Wiley & Sons: New York, 1970;191. (Equation (5)).
71. Evans, M.; Hastings, N.; Peacock, B. Statistical Distributions, 2nd ed.; Wiley & Sons: New York, 1993; 73–74.
72. The MathWorks, Inc., Statistics and Machine Learning Toolbox for use with MATLABW; version 11.4 (R2018b) The MathWorks, Inc., 2018.
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0100
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0100
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0105
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0110
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0115
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0120
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0125
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0130
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0130
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0135
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0135
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0135
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0140
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0140
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0145
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0150
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0150
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0155
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0155
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0160
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0160
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0165
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0170
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0175
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0180
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0185
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0190
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0195
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0195
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0200
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0205
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0210
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0215
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0215
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0220
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0225
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0225
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0230
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0230
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0230
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0235
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0240
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0240
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0245
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0250
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0255
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0260
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0265
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0270
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0275
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0280
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0285
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0290
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0295
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0300
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0305
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0305
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0305
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0310
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0310
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0315
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0315
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0320
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0320
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0325
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0325
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0330
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0330
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0335
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0335
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0340
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0340
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0345
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0350
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0355
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0360
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0365
http://refhub.elsevier.com/B978-0-12-409547-2.14746-8/rf0365
Quality of Analytical Measurements: Statistical Methods for Internal Val/ce:cross-ref>
Nomenclature
Introduction
Confidence and Tolerance Intervals
Confidence Interval
Confidence Interval on the Mean of a Normal Distribution
Case 1: Known variance
Case 2: Unknown variance
Confidence Interval on the Variance of a Normal Distribution
Confidence Interval on the Difference in TwoMeans
Case 1: Known variances
Case 2: Unknown variances
Case 3: Confidence interval for paired samples
Confidence Interval on the Ratio of Variances of Two Normal Distributions
Confidence Interval on the Median
Joint Confidence Intervals
Tolerance Intervals
Case 1: β-content tolerance interval
Case 2: β-expectation tolerance interval
Case 3: Distribution free intervals
HypothesisTests
Elements of a HypothesisTest
Hypothesis Test on the Mean of a Normal Distribution
Case 1: Known variance
Case 2: Unknown variance
Case 3: The paired t-test
Hypothesis Test on the Variance of a Normal Distribution
Hypothesis Test on the Difference in TwoMeans
Case 1: Known variances
Case 2: Unknown variances
Test Based on Intervals
Hypothesis Test on the Variances of Two Normal Distributions
Hypothesis Test on the Comparison of Several Independent Variances
Case 1: Cochran'stest
Case 2: Bartlett'stest
Case 3: Levene'stest
Goodness-of-Fit Tests: NormalityTests
Case 1: Chi-squaretest
Case 2: DAgostino normalitytest
One-Way Analysis of Variance
The Fixed EffectsModel
Power of the Fixed Effects ANOVAmodel
Uncertainty and Testing of the Estimated Parameters in the Fixed EffectsModel
Case 1: Orthogonal contrasts
Case 2: Comparison of severalmeans
The Random EffectsModel
Power of the Random Effects ANOVAmodel
Confidence Intervals for the Estimated Parameters in the Random EffectsModel
Statistical Inference and Validation
Trueness
Precision
Statistical Aspects of the Experiments to Determine Precision
Consistency Analysis and Incompatibility ofData
Case 1: Elimination ofdata
Case 2: Robust methods
Accuracy
Ruggedness
Appendix
Some Basic Elements of Statistics
The Normal Distribution
Student's t Distribution
The χ2 (Chi-square) Distribution
The F Distribution
Convergence of Random Variables
Some Computational Aspects
Normal distribution
Student's t distribution with nu degrees of freedom
χ2 distribution with nu degrees of freedom
Fnu1, nu2 distribution with nu1 and nu2 degrees of freedom
Power for the z-test, Eq.(40)
Power for the t-test, Eq.(43)
Power for the chi-square test, Eq.(46)
Power for the F-test, Eq.(57)
Power for fixed effects ANOVA, Eq.(82)
Power for random effects ANOVA, Eq.(98)
References