Deming/PDSA image
Deming Electronic Network
WEB SITE

To use URL Dropper menu:

Click on arrow to open menu
Click on desired topic
Click on Go!

Shewhart-Deming Critique of Classical Statistics

Jonathan Siegel
Revised December 26, 1997


I have provided an explanatory commentary, followed by an outline I have been developing. Please send comments to jmsiegel.info.research@worldnet.att.net

Deming's view of experimental design is based on a perspective going back at least to R.A. Fisher. (Edgington claimed that some of the ideas were actually invented by E.J.G. Pitman and adopted by Fisher later.) Unlike the underlying theory of variation, key elements of the Fisher approach to experimental design have been accepted by many prominent statisticians, as I shall show. But for the reasons I will explain, Fisher's approach is nonetheless very different from the one most statisticians actually use in practice.

Most people want to use experiments for the purpose of making some sort of generalization. We'd like to select a sample from some population, do an experiment on it, and draw an inference from the sample to the whole. We want to use statistical theory to shore up the reliability of that inference. The traditional approach uses sampling theory and draws inferences based on that theory.

But Fisher noticed that most of the time we can't really do this. Not only is it rare for experimental subjects to be drawn randomly from the frame of real interest, it is rare for them to be drawn randomly from any frame at all. Even in agriculture, we don't randomly sample the soil types or climate conditions we would like to try our seeds in. In industry, prototypes are often made under conditions quite different from production versions. And when we work with people, we can't choose our subjects -- we have to work with volunteers. The kind of people who volunteer for experiments tend to be different from the kind who don't. Even when we survey we have to deal with nonresponse bias. Often our experiment is done on whoever happens to show up. Of course we can't sample from the future. But the problem is actually broader even that. Often, we can't even sample from the present.

In The Design of Experiments (1935), Fisher described a technique to partially get around this problem, which was later to be called a randomization test. His example used an experiment on a "lady," a staff member at the Rothamsted Experimental Station in England who claimed to able to discern whether her cup of tea had been prepared by pouring the tea first and then the milk into the tea, or by pouring the milk first and then the tea into milk. Fisher gave her eight cups of tea, in random order ,and told her that four were tea-in-milk and four milk-in-tea. If she had no discerning ability at all, the chance of getting them all correct would be 1 in 70 (recall that 8C4 is 8!/4!4! = 70). Thus, the experiment tested her individual ability to distingish the two kinds of tea preparation during the course of the experiment, and provided an exact probability with respect to this. But the experiment did not test anyone else's ability in this regard. Any inference that another "lady" would have a similar ability, or even the same lady would get comparable results with a different kind of tea or milk, would have to be nonstatistical in nature. The experiment produced a probability result valid for the specific conditions at hand, at the cost of losing any claim that an inference to anything more general has a statistically-justified basis.

From this origin came what statisticians would call the randomization test (or permutation test). We can take the particular subjects in front of us and, if we can randomize them with complete freedom, we can use probability theory to draw a valid inference about the ability of the particular group experimented on to distinguish the factors being examined under the specific conditions of the experiment. The use of probability theory is in this instance perfectly legitimate. But when we do this we can no longer claim that statistical theory provides a basis for drawing inferences from the particular experimental subjects and conditions to anything else. Statistical theory provides no valid basis for doing so. In order to draw such an inference, we must use nonstatistical expert judgement.

Unlike many other Deming ideas, this idea has been fully accepted -- with its full implications -- by leading mainstream statisticians for many years. Kempthorne wrote in 1955 that "Tests of significance in the randomized experiment have frequently been presented by way of normal law theory, whereas their validity stems from randomization theory" (Kempthorne, "The Randomization Theory of Statistical Inference", 1955) Box, Hunter and Hunter wrote in 1978 "It is curious that the hypothesis of random sampling is treated in much statistical writing as if it were a law of nature. In fact, for real data it is a property that can never be relied on, although special precautions in the design of an experiment can make the assumption relevant." (p. 31). (Box, Hunter, and Hunter, Statistics For Experimenters: An Introduction to Design, Data Analysis, and Model Building, 1978).

This kind of approach basically remained a parlor-conversation piece for many years. The reason was simple: the computations involved for doing the necessary calculations for anything other than a simple demonstration case were astronomical. Fisher's example served something like the same purpose as Einstein's thought-experiments in critiquing quantum mechanics. They raised crucial points to think about, but one couldn't actually do them. Until the advent of computers, statisticians had no choice but to focus on what they could actually do and hope for the best.

It turns that the complexities of randomization theory calculations could sometimes be avoided by using an approximation. One was found. Linear-models theory statistics are a reasonably good approximation to the behavior of randomization-theory statistics (except at the extreme tails), even for fairly small samples, as long as standard designs are used and the standard linear-theory conditions (homoskedasticity and the like) hold. Both Kempthorne and Box, Hunter, and Hunter discuss this. Under the assumed conditions, one can legitimately use the machinary of linear models statistics, as long as one understands that the results need to be interpreted in a completely different way. Thus the situation was something like a control chart: One performs calculations based on a calculation mechanism usually associated with one theory, while interpreting the results in a manner derived from another. One has to keep in the back of ones head that the linear-models stuff is just an approximation, a calculational convenience.

But this theoretical sleight-of-hand has been hard for statisticians to grasp. Performing calculations based on one theory while interpreting based on another is a bit like trying to rub your stomach and pat your head at the same time. It creates cognitive dissonance. Over the years, students of statistics have tended to remember the theory that explains how to do the calculations, and forget the theory that explains how to interpret the results. We naturally gravitate to the theory that seems to apply to what we actually do, and school has traditionally emphasized calculation. In addition, linear-models theory supports a nice ability to draw inferences, while randomization theory suggests otherwise. Linear-models theory was calculable. It gave results people wanted. It was better in absolutely every respect except that it happens not to be true. It was what most statisticians ended up using.

The advent of computers has given us the ability to actually calculate exact randomization-theory results in ways impossible before. This recent ability has resulted in renewed interest in randomization-test theory by mainstream mathematical statisticians, although often in ways quite alien to Deming's overall approach. The majority of this type of work is perhaps exemplified by Bradley Efron of Stanford University and colleagues, who have been attracted to the complex, computer-intensive calculations involved in actually doing these tests exactly. (See also e.g. J.S. Urban Hjorth, "Computer Intensive Statistical Methods"; Eric W. Noreen, "Computer Intensive Methods for Testing Hypothesis"; J.S. Maritz, "Distribution-Free Statistical Methods") One weakness of this line of thinking is that it has tended to emphasize calculation problems to the point of not laying much stress on developing an understanding of what the calculations really mean. Since drawing any useful inference necessarily involves a judgement call under the theory, an applied researcher would find precise calculations a waste of time wherever an approximation will do. But I think this calculation-oriented work will ultimately prove useful, for two reasons. The main one is psychological: history suggests that statisticians tend to think about only the theories they actually use in making calculations. If we could get people to do calculations based on the real theory, perhaps they will be more willing to think about it. The second issue is that linear-models statistics are a good approximation to randomization statistics only under certain distributional assumptions, such as symmetry and homoskedasticity. These assumptions are often not true. Exact results will work even under quirky conditions, where approximations won't.

But there is another line of work that may be of more immediate interest. The theory has also been developed by researchers who have thought reflectively about the problems of doing applied research, and adapted the randomization approach as a result of this thinking, independently of Deming. The best example I have found is the first full book on the subject, "Randomization Tests" by Eugene S. Edgington (1st ed., 1980. 3rd ed., 1995). Edgington, a psychologist, gives careful attention to the validity problems of standard statistics in doing real-world experiments, explains how the approach deals with some of these problems, and gives a very frank discussion of the unavoidable need to use nonstatistical expertise and judgement under these conditions. The following quote (emphasis in original) should suffice as an example:

"Statistical inferences cannot be made without random samples from those populations, and random sampling from the populations of interest to the experimenter is likely to be impossible. Thus for practically all experiments valid statistical inferences about a POPULATION of interest cannot be drawn. In the absence of random sampling, STATISTICAL inferences about treatment effects must be restricted to the subjects (or other experimental units) used in an experiment. Inferences about treatment effects for other subjects must be NONSTATISTICAL INFERENCE -- inferences without a basis in probability. We generalize from our experimental subjects to individuals who are quite similar in those characteristics that we consider relevant. ... the main burden of generalizing from experiments always has been, and continues to be, carried by nonstatistical rather than statistical logic."

Deming regularly sought out people to learn from who were working on similar problems. Alfie Kohn is probably the most notable recent example. I think we should do the same. In the area of experimental design, as I have tried to show, a number of other folks have been tying to grapple with some of the same issues and problems as Deming did. Edgington, for example, used a somewhat different vocabulary from Deming, but I find the fundamental issues involved very consistent, and I think these folks have developed some useful ideas, ideas it would be worth our while to learn something from.

Deming's critical contribution, as I see it, was to associate the underlying theory of variation with the theory of experimental design. If we can get our processes under statistical control, we can perform designed experiments on the subjects before us and then legitimately draw inferences on into the future. The combination of the two provides a complete way to get around the problem of inference in the real world, at least where control exists or can be instituted. Without understanding both concepts and how they work together, all we have is a critique, not a solution. It is this combination that lets us go forward. With just control charts, we can control, but not easily improve, even if we have the correct theory. With just experiments, the correct theory of inference gives us no basis for taking action on the results, since without more we cannot legitimately draw inferences to anything else. The two have to be taken together to be fully effective.


Outline:

1. Entertain for a moment the concept of statistics as an empirical science of variation rather than a branch of mathematics. As an empirical science, statistical ideas would be merely theories subject to revision based on empirical research into natural variation. The laws of probability would merely be our best current idea about how variation works, subject to revision based on evidence.

2. As in other sciences, if evidence demonstrated that our theory didn't explain real-world behavior, we would be obligated to revise our theory to account for the difference.

3. Areas of study thought to be branches of mathematics have turned out to be empirical sciences before. Examples of cases where ideas thought to be a priori turned out to be contradicted by evidence include Geometry (General Theory of Relativity shows the universe is non-Euclidean) and even logic (quantum phenomena do not obey classical axioms).

4. Empirical evidence suggests that key assumptions of classical statistics are invalid.

5. Empirical evidence suggests that independent, identically distributed phenomena do not exist. Research takes several forms.

5.1. Most complex real-world phenomena behave chaotically rather than either deterministically or probabilistically. There is a lot of evidence for this, from Shewhart's early work through current research in everything from physics and astronomy (the orbits of planets in the solar system turn out to be chaotic, for example) to social systems.

5.2. Chaotic behavior is fundamentally non-probabilistic.

5.3. Chaotic behavior sometimes approximates probabilistic behavior, but only under special conditions which must be checked for and can never validly be simply assumed.

6. Empirical research suggests that chaotic behavior can be approximated probabilistically only in systems which are stable over time.

6.1. Because of the crucial role time plays in the physics of variation, tests which ignore the temporal order of data are invalid unless the system is demonstrably stable over time.

6.2. Complex natural systems tend to be stable over time only if subject to natural homeostatic processes, including an input of energy. (basic thermodynamics)

6.3. Human social systems tend to be stable over time only if they are managed to this end, including an input of information. (Shannon's information entropy theory -- see Myron Tribus's work in this area).

6.4. The results of an information-gathering process will tend to be unreliable unless the process is actively managed for stability.

6.5. The Shewhart control chart tests for temporal stability, and hence for whether (a) future predictions can be made from past behavior, and (b) probability theory and related methods are approximately applicable.

7. Empirical research suggests that the results obtained from surveys can be highly dependent on the method of gathering the information (e.g. people will give different ages to different interviewers; the distribution of red beads obtained is affected by the particular paddle used). Just as in quantum mechanics, the act and method of gathering information affect the information obtained. Statistical work is subject to an uncertainty principle. There is no true value of any quantity.

8. Empirical evidence suggests that p-values predicted by classical probability theory are invalid.

8.1. Even when a chaotic system is stable and probability theory approximately applies, it does not apply evenly. Shewhart and Deming found that stable chaotic systems best resemble the center of probability distributions, and differ more and more from what probability theory would predict the further one gets towards the extrema. The closer one is to a tail, the worse probability models tend to be as predictors of actual system behavior. Because probability theory is least realistic at the tails, statistical methods based on the tails of distributions are most likely to be erroneous. The further out on the tails one goes, the more erroneous it will be.

8.2. All of classical hypothesis testing is based on the tails of distributions, precisely the place where probability theory least resembles reality even in stable systems. P-values predicted by standard hypothesis testing, therefore, have little basis in reality. In fact, the lower the p-value, the less confidence one can have in it.

8.3. Even though the p-values obtained by classical methods do not represent a numerical measurement of the probability of events occurring, it is nonetheless generally true that when a system involved is stable over time, outcomes which probability theory predicts to be at the tails of a distribution will generally occur infrequently. This fact permits us to construct heuristics that permit us to make qualitative judgements of likelihood, even though we cannot validly quantify them. If the underlying system is temporarily stable, we can validly say that an event at the extreme tails is likely to be unusual, even though we cannot validly quantify exactly how unusual.

8.4. Shewhart's three-standard-deviation rule for setting control chart limits is an example of an heuristic of this sort. It is a qualitative judgement of likelihood. It would be error to ascribe a numerical probability to it.

9. Evidence suggests that enumerative sampling theory is not a valid basis for most statistical studies.

9.1. It is rarely possible to obtain a random sample in the real world. We cannot sample from the future, which may not resemble the past. Most experiments must be performed on subjects who volunteer, or objects which are available and at-hand. And any two compilations of a frame, or repetitions of an experiment, will always be under different conditions, often sufficiently different to meaningfully affect the outcome. For the same reasons that there is no true value value of a quantity, there is also no true value of any frame.

9.2. When samples are not randomly obtained, our analysis must consist of two parts: a statistical analysis, based on the sample, and a non-statistical expert judgement, which relates the sample to what we really wish to know about as best we can, always in some degree qualitatively and heuristically, never exactly.

9.3. The proper statistical analysis of designed experiments on judgement samples is based on randomization theory, not sampling theory. In randomization theory, we must have the ability to assign treatments to subjects in any order we wish (complete control over the order in which events occur). We must use this control to pick the order randomly. Only this degree of control permits us to use probability theory. The statistic we assess is based on the permutation distribution of our randomization, not on a sampling distribution of any kind. Where we do not have sufficient control to perform the randomization needed to do a valid designed experiment, we cannot validly use randomization theory.

9.4. Even when we can use designed experiments, for our statistical analysis to have any applicability outside the particular subjects and events studied, we must make an expert judgement that the particular subjects and events studied are similar to a more general class. This expert judgment must be based on some kind of qualitative heuristic, not on any statistical exactitude.

9.5. A control chart serves as a heuristic for permitting extrapolation of past events to future ones

9.6. The problem of extrapolation is not limited just to time. Whenever we wish to extrapolate the results of an experiment to a more general population, we must base our extrapolation on some kind of qualitative, heuristic expert judgement which will inevitably be inexact. For example, pharmaceutical clinical trials are generally performed on the first volunteers meeting selection criteria who show up in hospitals which have agreed to participate. Not only is no sample ever randomly obtained from any frame, the kinds of patients and institutions who volunteer to participate are known to have demographic characteristics different from the general patient and institutional population. Thus, the judgement to extrapolate from the population who participated in a clinical trial to a more general population involves an element that is in necessarily non-statistical. Such an extraoplation cannot be validly supportable by a statistical argument alone.

10. Scientific validity comes from doing the best we can in the face of uncertainty, and being as honest and explicit as possible about what we have done and can do, and what we have not done and cannot. It is always better, from a scientific point of view, to admit ignorance when we in fact do not know, than to cling to a pretended certainty that cannot be validly sustained.

11. Descartes taught that what can be described exactly is necessarily more real than what can be described only vaguely. We live, imprisoned, in an intellectual structure built on the prejudices of this foundational falsehood. Modernity teaches the opposite. We can describe our illusions with easy precision. Reality, however, has turned out to be a far more difficult quarry. The more we really know about something, the more mysterious it seems to become.


Back to Top Acknowledgements:


The URL for this page is http://deming.ces.clemson.edu/pub/den/deming_seigel1.htm

 This page was created by Jim Clauson on 04DEC97, and last updated 11JAN98.

Contents, images, and structure Copyrighted by the Deming Electronic Network, 1995-98 (unless otherwise noted). All rights reserved. Back to Top