|
WEB SITE |
To use URL Dropper menu:
Click on arrow to open menu |
I have provided an explanatory commentary, followed by an outline I have been developing. Please send comments to jmsiegel.info.research@worldnet.att.net
Deming's view of experimental design is based on a perspective going back
at least to R.A. Fisher. (Edgington claimed that some of the ideas were
actually invented by E.J.G. Pitman and adopted by Fisher later.) Unlike the
underlying theory of variation, key elements of the Fisher approach to
experimental design have been accepted by many prominent statisticians, as
I shall show. But for the reasons I will explain, Fisher's approach is
nonetheless very different from the one most statisticians actually use in
practice.
Most people want to use experiments for the purpose of making some sort of
generalization. We'd like to select a sample from some population, do an
experiment on it, and draw an inference from the sample to the whole. We
want to use statistical theory to shore up the reliability of that
inference. The traditional approach uses sampling theory and draws
inferences based on that theory.
But Fisher noticed that most of the time we can't really do this. Not only
is it rare for experimental subjects to be drawn randomly from the frame of
real interest, it is rare for them to be drawn randomly from any frame at
all. Even in agriculture, we don't randomly sample the soil types or
climate conditions we would like to try our seeds in. In industry,
prototypes are often made under conditions quite different from production
versions. And when we work with people, we can't choose our subjects -- we
have to work with volunteers. The kind of people who volunteer for
experiments tend to be different from the kind who don't. Even when we
survey we have to deal with nonresponse bias. Often our experiment is done
on whoever happens to show up. Of course we can't sample from the future.
But the problem is actually broader even that. Often, we can't even sample
from the present.
In The Design of Experiments (1935), Fisher described a technique to
partially get around this problem, which was later to be called a
randomization test. His example used an experiment on a "lady," a staff
member at the Rothamsted Experimental Station in England who claimed to
able to discern whether her cup of tea had been prepared by pouring the tea
first and then the milk into the tea, or by pouring the milk first and then
the tea into milk. Fisher gave her eight cups of tea, in random order ,and
told her that four were tea-in-milk and four milk-in-tea. If she had no
discerning ability at all, the chance of getting them all correct would be
1 in 70 (recall that 8C4 is 8!/4!4! = 70). Thus, the experiment tested her
individual ability to distingish the two kinds of tea preparation during
the course of the experiment, and provided an exact probability with
respect to this. But the experiment did not test anyone else's ability in
this regard. Any inference that another "lady" would have a similar
ability, or even the same lady would get comparable results with a
different kind of tea or milk, would have to be nonstatistical in nature.
The experiment produced a probability result valid for the specific
conditions at hand, at the cost of losing any claim that an inference to
anything more general has a statistically-justified basis.
From this origin came what statisticians would call the randomization test
(or permutation test). We can take the particular subjects in front of us
and, if we can randomize them with complete freedom, we can use probability
theory to draw a valid inference about the ability of the particular group
experimented on to distinguish the factors being examined under the
specific conditions of the experiment. The use of probability theory is in
this instance perfectly legitimate. But when we do this we can no longer
claim that statistical theory provides a basis for drawing inferences from
the particular experimental subjects and conditions to anything else.
Statistical theory provides no valid basis for doing so. In order to draw
such an inference, we must use nonstatistical expert judgement.
Unlike many other Deming ideas, this idea has been fully accepted -- with
its full implications -- by leading mainstream statisticians for many
years. Kempthorne wrote in 1955 that "Tests of significance in the
randomized experiment have frequently been presented by way of normal law
theory, whereas their validity stems from randomization theory"
(Kempthorne, "The Randomization Theory of Statistical Inference", 1955)
Box, Hunter and Hunter wrote in 1978 "It is curious that the hypothesis of
random sampling is treated in much statistical writing as if it were a law
of nature. In fact, for real data it is a property that can never be relied
on, although special precautions in the design of an experiment
can make the assumption relevant." (p. 31). (Box, Hunter, and Hunter,
Statistics For Experimenters: An Introduction to Design, Data Analysis, and
Model Building, 1978).
This kind of approach basically remained a parlor-conversation piece for
many years. The reason was simple: the computations involved for doing the
necessary calculations for anything other than a simple demonstration case
were astronomical. Fisher's example served something like the same purpose
as Einstein's thought-experiments in critiquing quantum mechanics. They
raised crucial points to think about, but one couldn't actually do them.
Until the advent of computers, statisticians had no choice but to focus on
what they could actually do and hope for the best.
It turns that the complexities of randomization theory calculations could
sometimes be avoided by using an approximation. One was found.
Linear-models theory statistics are a reasonably good approximation to the
behavior of randomization-theory statistics (except at the extreme tails),
even for fairly small samples, as long as standard designs are used and
the standard linear-theory conditions (homoskedasticity and the like) hold.
Both Kempthorne and Box, Hunter, and Hunter discuss this. Under the
assumed conditions, one can legitimately use the machinary of linear models
statistics, as long as one understands that the results need to be
interpreted in a completely different way. Thus the situation was something
like a control chart: One performs calculations based on a calculation
mechanism usually associated with one theory, while interpreting the
results in a manner derived from another. One has to keep in the back of
ones head that the linear-models stuff is just an approximation, a
calculational convenience.
But this theoretical sleight-of-hand has been hard for statisticians to
grasp. Performing calculations based on one theory while interpreting
based on another is a bit like trying to rub your stomach and pat your head
at the same time. It creates cognitive dissonance. Over the years, students
of statistics have tended to remember the theory that explains how to do
the calculations, and forget the theory that explains how to interpret the
results. We naturally gravitate to the theory that seems to apply to what
we actually do, and school has traditionally emphasized calculation. In
addition, linear-models theory supports a nice ability to draw inferences,
while randomization theory suggests otherwise. Linear-models theory was
calculable. It gave results people wanted. It was better in absolutely
every respect except that it happens not to be true. It was what most
statisticians ended up using.
The advent of computers has given us the ability to actually calculate
exact randomization-theory results in ways impossible before. This recent
ability has resulted in renewed interest in randomization-test theory by
mainstream mathematical statisticians, although often in ways quite alien
to Deming's overall approach. The majority of this type of work is perhaps
exemplified by Bradley Efron of Stanford University and colleagues, who
have been attracted to the complex, computer-intensive calculations
involved in actually doing these tests exactly. (See also e.g. J.S. Urban
Hjorth, "Computer Intensive Statistical Methods"; Eric W. Noreen, "Computer
Intensive Methods for Testing Hypothesis"; J.S. Maritz, "Distribution-Free
Statistical Methods") One weakness of this line of thinking is that it has
tended to emphasize calculation problems to the point of not laying much
stress on developing an understanding of what the calculations really mean.
Since drawing any useful inference necessarily involves a judgement call
under the theory, an applied researcher would find precise calculations a
waste of time wherever an approximation will do. But I think this
calculation-oriented work will ultimately prove useful, for two reasons.
The main one is psychological: history suggests that statisticians tend to
think about only the theories they actually use in making calculations. If
we could get people to do calculations based on the real theory, perhaps
they will be more willing to think about it. The second issue is that
linear-models statistics are a good approximation to randomization
statistics only under certain distributional assumptions, such as symmetry
and homoskedasticity. These assumptions are often not true. Exact results
will work even under quirky conditions, where approximations won't.
But there is another line of work that may be of more immediate interest.
The theory has also been developed by researchers who have thought
reflectively about the problems of doing applied research, and adapted the
randomization approach as a result of this thinking, independently of
Deming. The best example I have found is the first full book on the
subject, "Randomization Tests" by Eugene S. Edgington (1st ed., 1980. 3rd
ed., 1995). Edgington, a psychologist, gives careful attention to the
validity problems of standard statistics in doing real-world experiments,
explains how the approach deals with some of these problems, and gives a
very frank discussion of the unavoidable need to use nonstatistical
expertise and judgement under these conditions. The following quote
(emphasis in original) should suffice as an example:
"Statistical inferences cannot be made without random samples from those populations, and random sampling from the populations of interest to the experimenter is likely to be impossible. Thus for practically all experiments valid statistical inferences about a POPULATION of interest cannot be drawn. In the absence of random sampling, STATISTICAL inferences about treatment effects must be restricted to the subjects (or other experimental units) used in an experiment. Inferences about treatment effects for other subjects must be NONSTATISTICAL INFERENCE -- inferences without a basis in probability. We generalize from our experimental subjects to individuals who are quite similar in those characteristics that we consider relevant. ... the main burden of generalizing from experiments always has been, and continues to be, carried by nonstatistical rather than statistical logic."
Deming regularly sought out people to learn from who were working on
similar problems. Alfie Kohn is probably the most notable recent example. I
think we should do the same. In the area of experimental design, as I have
tried to show, a number of other folks have been tying to grapple with some
of the same issues and problems as Deming did. Edgington, for example, used
a somewhat different vocabulary from Deming, but I find the fundamental
issues involved very consistent, and I think these folks have developed
some useful ideas, ideas it would be worth our while to learn something
from.
Deming's critical contribution, as I see it, was to associate the
underlying theory of variation with the theory of experimental design. If
we can get our processes under statistical control, we can perform designed
experiments on the subjects before us and then legitimately draw inferences
on into the future. The combination of the two provides a complete way to
get around the problem of inference in the real world, at least where
control exists or can be instituted. Without understanding both concepts
and how they work together, all we have is a critique, not a solution. It
is this combination that lets us go forward. With just control charts, we
can control, but not easily improve, even if we have the correct theory.
With just experiments, the correct theory of inference gives us no basis
for taking action on the results, since without more we cannot legitimately
draw inferences to anything else. The two have to be taken together to be
fully effective.
1. Entertain for a moment the concept of statistics as an empirical
science of variation rather than a branch of mathematics. As an empirical
science, statistical ideas would be merely theories subject to revision
based on empirical research into natural variation. The laws of probability
would merely be our best current idea about how variation works, subject to
revision based on evidence.
2. As in other sciences, if evidence demonstrated that our theory didn't
explain real-world behavior, we would be obligated to revise our theory to
account for the difference.
3. Areas of study thought to be branches of mathematics have turned out to
be empirical sciences before. Examples of cases where ideas thought to be a
priori turned out to be contradicted by evidence include Geometry (General
Theory of Relativity shows the universe is non-Euclidean) and even logic
(quantum phenomena do not obey classical axioms).
4. Empirical evidence suggests that key assumptions of classical statistics
are invalid.
5. Empirical evidence suggests that independent, identically distributed
phenomena do not exist. Research takes several forms.
5.1. Most complex real-world phenomena behave chaotically rather than
either deterministically or probabilistically. There is a lot of evidence
for this, from Shewhart's early work through current research in everything
from physics and astronomy (the orbits of planets in the solar system turn
out to be chaotic, for example) to social systems.
5.2. Chaotic behavior is fundamentally non-probabilistic.
5.3. Chaotic behavior sometimes approximates probabilistic behavior, but
only under special conditions which must be checked for and can never
validly be simply assumed.
6. Empirical research suggests that chaotic behavior can be approximated
probabilistically only in systems which are stable over time.
6.1. Because of the crucial role time plays in the physics of variation,
tests which ignore the temporal order of data are invalid unless the system
is demonstrably stable over time.
6.2. Complex natural systems tend to be stable over time only if subject to
natural homeostatic processes, including an input of energy. (basic
thermodynamics)
6.3. Human social systems tend to be stable over time only if they are
managed to this end, including an input of information. (Shannon's
information entropy theory -- see Myron Tribus's work in this area).
6.4. The results of an information-gathering process will tend to be
unreliable unless the process is actively managed for stability.
6.5. The Shewhart control chart tests for temporal stability, and hence
for whether (a) future predictions can be made from past behavior, and (b)
probability theory and related methods are approximately applicable.
7. Empirical research suggests that the results obtained from surveys can
be highly dependent on the method of gathering the information (e.g. people
will give different ages to different interviewers; the distribution of red
beads obtained is
affected by the particular paddle used). Just as in quantum mechanics, the
act and method of gathering information affect the information obtained.
Statistical work is subject to an uncertainty principle. There is no true
value of any quantity.
8. Empirical evidence suggests that p-values predicted by classical
probability theory are invalid.
8.1. Even when a chaotic system is stable and probability theory
approximately applies, it does not apply evenly. Shewhart and Deming found
that stable chaotic systems best resemble the center of probability
distributions, and differ more and more from what probability theory would
predict the further one gets towards the extrema. The closer one is to a
tail, the worse probability models tend to be as predictors of actual
system behavior. Because probability theory is least realistic at the
tails, statistical methods based on the tails of distributions are most
likely to be erroneous. The further out on the tails one goes, the more
erroneous it will be.
8.2. All of classical hypothesis testing is based on the tails of
distributions, precisely the place where probability theory least resembles
reality even in stable systems. P-values predicted by standard hypothesis
testing, therefore, have little basis in reality. In fact, the lower the
p-value, the less confidence one can have in it.
8.3. Even though the p-values obtained by classical methods do not
represent a numerical measurement of the probability of events occurring,
it is nonetheless generally true that when a system involved is stable over
time, outcomes which probability theory predicts to be at the tails of a
distribution will generally occur infrequently. This fact permits us to
construct heuristics that permit us to make qualitative judgements of
likelihood, even though we cannot validly quantify them. If the underlying
system is temporarily stable, we can validly say that an event at the
extreme tails is likely to be unusual, even though we cannot validly
quantify exactly how unusual.
8.4. Shewhart's three-standard-deviation rule for setting control chart
limits is an example of an heuristic of this sort. It is a qualitative
judgement of likelihood. It would be error to ascribe a numerical
probability to it.
9. Evidence suggests that enumerative sampling theory is not a valid basis
for most statistical studies.
9.1. It is rarely possible to obtain a random sample in the real world. We
cannot sample from the future, which may not resemble the past. Most
experiments must be performed on subjects who volunteer, or objects which
are available and at-hand. And any two compilations of a frame, or
repetitions of an experiment, will always be under different conditions,
often sufficiently different to meaningfully affect the outcome. For the
same reasons that there is no true value value of a quantity, there is also
no true value of any frame.
9.2. When samples are not randomly obtained, our analysis must consist of
two parts: a statistical analysis, based on the sample, and a
non-statistical expert judgement, which relates the sample to what we
really wish to know about as best we can, always in some degree
qualitatively and heuristically, never exactly.
9.3. The proper statistical analysis of designed experiments on judgement
samples is based on randomization theory, not sampling theory. In
randomization theory, we must have the ability to assign treatments to
subjects in any order we wish (complete control over the order in which
events occur). We must use this control to pick the order randomly. Only
this degree of control permits us to use probability theory. The statistic
we assess is based on the permutation distribution of our randomization,
not on a sampling distribution of any kind. Where we do not have sufficient
control to perform the randomization needed to do a valid designed
experiment, we cannot validly use randomization theory.
9.4. Even when we can use designed experiments, for our statistical
analysis to have any applicability outside the particular subjects and
events studied, we must make an expert judgement that the particular
subjects and events studied are similar to a more general class. This
expert judgment must be based on some kind of qualitative heuristic, not on
any statistical exactitude.
9.5. A control chart serves as a heuristic for permitting extrapolation of
past events to future ones
9.6. The problem of extrapolation is not limited just to time. Whenever we
wish to extrapolate the results of an experiment to a more general
population, we must base our extrapolation on some kind of qualitative,
heuristic expert judgement which will inevitably be inexact. For example,
pharmaceutical clinical trials are generally performed on the first
volunteers meeting selection criteria who show up in hospitals which have
agreed to participate. Not only is no sample ever randomly obtained from
any frame, the kinds of patients and institutions who volunteer to
participate are known to have demographic characteristics different from
the general patient and institutional population. Thus, the judgement to
extrapolate from the population who participated in a clinical trial to a
more general population involves an element that is in necessarily
non-statistical. Such an extraoplation cannot be validly supportable by a
statistical argument alone.
10. Scientific validity comes from doing the best we can in the face of
uncertainty, and being as honest and explicit as possible about what we
have done and can do, and what we have not done and cannot. It is always
better, from a scientific point of view, to admit ignorance when we in fact
do not know, than to cling to a pretended certainty that cannot be validly
sustained.
11. Descartes taught that what can be described exactly is necessarily more
real than what can be described only vaguely. We live, imprisoned, in an
intellectual structure built on the prejudices of this foundational
falsehood. Modernity teaches the opposite. We can describe our illusions
with easy precision. Reality, however, has turned out to be a far more
difficult quarry. The more we really know about something, the more
mysterious it seems to become.
This page was created by Jim Clauson on 04DEC97, and last updated 11JAN98.
Contents, images, and structure Copyrighted by the Deming Electronic Network, 1995-98 (unless otherwise noted). All rights reserved.