• whether the population from which the sample is to be drawn has been properly constructed or scoped to address a question raised by the legal theory;
• whether the salient features are common to members of the population; and
• whether the sampling unit (i.e., the individual sampled observation) is defined in a way that the sample embodies the important features of the larger population.
Obtaining a valid sample begins by creating a testable hypothesis that is based on some aspect of the legal case. There should be a direct connection from a case issue to the testable hypothesis and then to the target population.
Next, there should be an assessment regarding the extent to which the relevant features are common to members of the target population.
It may be the case that there really are no relevant features—no “glue” as the Wal-Mart Court called it—that are common to members of the population. In Wal-Mart, the putative class alleged that the company had discriminated against female employees with regard to promotions and pay. The Court concluded that it was impossible to draw a valid sample because the reasons for promotions seemed as varied as the individuals who received or were denied them. Hence, there was no glue to bind together a class.
Typically, there is commonality among observations at some high enough level of classification. However, this level may be too far removed from the relevant hypothesis to be useful. This point was an issue in Tyson Foods. In that case, Tyson Foods was sued for not paying certain of its workers (the class) overtime related to the time required to put on protective equipment at the beginning of their shifts (the donning time) or to remove the gear at the end of their shifts (the doffing time). Because actual times had not been recorded, an expert study was offered that provided average donning and doffing times for different processing departments based on observations of 57 employees. The debate centered on whether the variability in times and variety of gear available to the employees precluded the identification of common features. The Court concluded that the population of employees had in common the fact that they worked at the same plant and performed the same type of work. The dissent conceded that those common features existed but maintained that the particular features were not relevant to the hypothesis itself, which had to do with donning and doffing times.
From the Target Population to the Sample
Sample validity begins by rooting the testable hypothesis to the theory of the case. Issues of validity continue as we move from the determination of the target population to the creation of the sample itself.
As noted in the third bullet point above, the sampled unit should embody the characteristics of the larger population. This can be compromised for a number of reasons. Two of them are
• differences between the target population and the actual sampled population; and
• failure of the sampling process to capture all of the relevant characteristics.
With regard to the first of these two points, the target population is the proposed ideal population. The sample is actually drawn from something called the sampling frame or sample population. Mismatches between the ideal and actual are bound to occur, and they contribute to inaccuracies. The severity depends on the nature of the mismatch. For example, the target population in a hypothetical survey may be all adults in the United States, but the sampled population may be a list of residential landline telephone numbers. In such a case, there is an obvious potential for the sampled unit not to embody the characteristics of the larger target population because many households no longer have (or answer) landlines. Those who retain their landlines may be different in an important and systematic way from those who do not.
An inaccurate sample can also occur even if the sampled population accurately represents the target population but, for some reason related to the sampling process itself, the sample contains only part of the population’s pattern. Absent other information, one cannot make valid inferences about the population related to the portion of the pattern that is missing. Increasing the sample size, but still failing to capture the entirety of the pattern, does not help with making correct inferences about the missing portion or the totality of the pattern itself. Inaccuracies caused by incomplete patterns in the sample are apt to arise if a characteristic that is important to the legal case has a low prevalence in the general population. Stratification is sometimes used to produce a valid sample from such patterns while maintaining economy of sample size.
Statistical Precision
Having determined a valid target population and sampled population, we can turn to the issue of precision. Statistical precision is stated as a margin of error and its related confidence level. The term “margin of error” itself may have a familiar ring to it insofar as it is footnoted in many political polls, for example.
Precision refers to how tightly clustered we would expect to see the sample estimates in repeated experiments. Precision is independent of whether the population is valid. One can observe tightly clustered estimates of an inaccurate (i.e., invalid) number. While validity is associated with scoping the population and creating the sampling process, statistical precision is a feature of the sample itself. A sample size equal to the entire population will be perfectly precise, but it need not be accurate.
The margin of error is expressed as +/− 10 percent of the sample’s average value. The confidence level (e.g., 95 percent) reflects the proportion of times in which the true but unknown parameter value would be expected to fall within the margin of error if somehow one were to repeat the experiment many times.
In Tyson Foods, the expert reported average donning and doffing times of 18 minutes for two of the three departments and 21 minutes for the third department. Intuitively, we would have more confidence in the results of the study if the donning and doffing times for most of the observations were very close to 18 or 21 minutes. Substantial variation in donning and doffing times around the averages may indicate that a deeper investigation—and possible modifications to the sampling methodology—was needed to ensure that that the resulting sample or samples actually represented the true patterns of the population.
I will make some assumptions to keep the discussion going. Suppose that individuals donned and doffed their protective gear with an average of 18 minutes and a standard deviation of 4.0 minutes. The dissent in Tyson Foods noted that in some departments the times for donning equipment was 0.583 to over 10 minutes, so perhaps the actual standard deviation was larger than the 4.0 assumed here, but it will suffice for a discussion.
Next, I specify the precision that we want the sample itself to exhibit. This is done by selecting a margin of error of the sample and a confidence level related to this margin of error.
There is no rule of thumb for selecting margin of error and confidence level. From a purely economic perspective, the user would want a sample that strikes a balance between the incremental cost of an additional observation and the incremental benefits of more precise results. Such a balance will depend in part on how sensitive are the effects of the downstream analyses (e.g., damages computations or potential fines) to the sampled results. This is case-specific. It involves some prior assessment of the effects that the anticipated sample results will have on the downstream analyses and inferences. If the results of the downstream analyses that use the sample are not particularly sensitive to the sample values, then a smaller and less costly sample size—and a less precise estimate—may be sufficient.
Such a purely economic cost-benefit rule may also need to be adjusted to account for other costs or factors, such as the legal requirement of due process. Striking a balance in using fewer observations than the totality of the population, if such a balance exists, is the issue that courts must address when evaluating the adequacy of sample size.
To continue with an example, in MBIA Insurance Corp. v. Countrywide Home Loans, Inc., 958 N.Y.S.2d 647, 2010 WL 5186702 (N.Y. Sup. Ct. Dec. 22, 2010), the judge approved MBIA’s request to use statistical sampling to present evidence regarding alleged breaches of representations and warranties of certain loans. The proposed sample was to have a 5 percent margin of error at 95 percent confidence.
Had the precision metrics for MBIA been applied to Tyson Foods, the required sample size would have been 74 (I will not explain the formula used for the computation), assuming that the standard deviation of the sample was 4.0 as I discussed in the setup of the hypothetical. If the assumed standard deviation were higher than 4.0, so too would be the required sample size. In either instance, the observed sample size of 57 would have been insufficient to provide the desired level of precision.
On the other hand, in cases involving federal health services overpayments, the Office of Inspector General (OIG) at the U.S. Department of Health and Human Services has required precision of 20 percent at a 90 percent confidence level. Had these precision metrics mechanistically been applied to Tyson Foods, the required sample size would be 30. Actually, the sample size produced by the OIG’s parameter values is far less than 30, but the OIG establishes a floor at 30 for technical reasons beyond the scope of this discussion.
These two examples illustrate that the required sample size is sensitive to the assumptions made about the margin of error and level of confidence. In general, holding other factors constant, the following can be said:
• The lower the margin of error (e.g., 1 percent margin of error versus 5 percent margin of error), the greater the sample size needs to be (all else the same).
• The higher the required confidence level (e.g., 90 percent, 95 percent, or 99 percent), the greater the sample size needs to be (all else the same).
In Tyson Foods, both Justice Roberts in his concurring opinion and the dissent expressed concerns about the effects on the damages estimate that different average donning and doffing times could have made. Addressing such concerns would involve revisiting the precision of the sample estimates.
Summary and Additional Thoughts
Using a sample can provide useful and reliable information at a lower cost than using the entire population. However, care must be taken to ensure that the sample can reliably and accurately answer the question that it is being asked to address. This means that there should be a direct logical connection between the hypothesis that is being tested and the population whose characteristics are being sampled. The sample must measure what it is supposed to measure. A very large sample from an invalidly specified population does not produce accurate results. The degree of precision exhibited by the sample is determined by the needs of the downstream analysis. If the results in the subsequent analysis are not sensitive to variations in sample results, one can typically use a smaller (and therefore less costly but also less precise) sample size.
To make the computations used in the hypotheticals, I used a program called RAT STATS. This freeware program was created and maintained by the OIG in the Department of Health and Human Services and is used by practitioners to develop samples to evaluate federal health services payment issues.
Keywords: litigation, expert witnesses, statistics, statistical sampling, class certification, Tyson Foods, Wal-Mart
Francis (Frank) X. Pampush, PhD, CFA, is a director and principal at Navigant Economics in Chicago, Illinois.
Navigant Consulting is the Litigation Advisory Services Sponsor of the ABA Section of Litigation. This article should be not construed as an endorsement by the ABA or ABA Entities.