Of course, there is more to science and to legal investigations than significance testing, and this gives rise to some confusion that we will turn to in a moment. However, the evaluation of the statistical reliability of empirical results (and especially projections) is, if anything, still underemployed in many financial accounting and valuation disputes. When an accountant or appraiser divides one number by another to form a ratio, or "averages" several disparate value estimates, the statistical properties of the estimate are rarely carefully inspected. For example, if the inherent mathematical, error-propagation properties of discounted cash flow analysis were better understood, it might enhance the evaluation of valuations.
Significance Testing and Expert Testimony
Given that my academic research is available, I will merely summarize the conclusions here. There are several important points for attorneys working with experts.
By way of background, the basic modern statistical tests of reliability consist of "significance tests" and "hypothesis tests" or "confidence tests." These methods were developed in the 1920s and 1930s by two rival groups of scholars. My data show that significance testing (as a collective term for both methods) has been adopted by a majority of scientific practitioners for over 50 years and today is virtually universal. Some academics worry that the methods are misunderstood, misinterpreted, and misapplied, but no one disputes that significance testing is widely used and a standard scientific method.
For a decade or more following the introduction of confidence tests, a point of view persisted that one or the other of these must be superior in principle or in practice, or both, a divide that was sharpened by the personal animosities of leading proponents of each view—R. A. Fisher for significance testing; Jerzy Neyman and Egon Pearson for confidence intervals. However, by the 1950s the data show that both methods were widely used, often together. Prior to the 1980s, a disposition by researchers to adopt one had little bearing one way or the other on whether they also reported on results from the other tests. In more recent decades, a researcher who adopted either one was statistically more likely to also report the other set of measurements. Neither side of this once-spirited debate "won" the argument, and it is largely a non-issue for practical purposes today.
The importance of this research result for attorneys is, first, there is no question as to whether significance testing generally is used as often as possible in peer review and accepted by the community of scientists. It is. Second, technical debates between opposing experts about the superiority of one method over another would only rarely be germane to the issues being tested.
A related issue has to do with the numerical value, the threshold or critical level of the test statistic, used to reach the conclusion that the results are "significant." As the reader may know, this is ordinarily set at 95 percent (or p=0.05). Where does this come from? Is it really the standard? The economist F. Y. Edgeworth used a 95 percent threshold, which he termed "significant" in his lectures to the Royal Society in 1885. His methods predated modern significance testing, however. A p value of 0.05 was first recommended by the founder of significance testing (R. A. Fisher) in his correspondence with a researcher. It soon made its way into print in his various writings. All this occurred in the 1920s, but there were antecedents.
There are two, maybe more, intuitively appealing arguments in favor of that number. The oldest—it was present in the work of Sir Isaac Newton at the Royal Mint, beginning around 1696—is that 1 chance in 20 is about the lowest sensible threshold for making an error in measurement. One in 10 seems unacceptably risky to risk-averse people for important subjects. The next higher round number is arguably 1 in 50, which was actually advocated by Jacob Bernoulli (mentioned earlier) as meeting the standard of "moral certainty" of being correct. This level, equivalent to a 98/2 percent standard, never caught on, and even Bernoulli found it impractically high in some of his empirical work. Other levels were suggested in the years before modern testing, but the point is that 95/5 percent resonated and other levels did not.
The more modern argument is that 95/5 percent (more formally, 1.96 times the standard error of the estimate) corresponds to two standard deviations (a reference enshrined in case law, by the way) and therefore has natural appeal to modern statistically trained individuals. However, Fisher, who invented statistical testing, also warned that no single number could constitute an absolute bright line to be used in all times and places. Textbooks and reference books are seldom very helpful in this regard, maintaining it is a choice for the investigator to make. And some scientists report p levels and standard errors numerically and leave the conclusion to the reader. But this is often a bit too ambiguous. In science, it is common to see higher levels of 99 percent and even 99.9 percent highlighted by researchers. In courts, the question is sometimes asked whether the number could be lower—perhaps 90 percent or even less.
It is with regard to this issue that my research is perhaps most interesting. In over 500 published articles, there was exactly one reference to a lower test threshold than 95 percent—90 percent—and when 99 percent could be highlighted, it often was. The 95 percent level may be the offspring of an obscure birth and childhood, but empirically it is without doubt the established and virtually universal practice.
Significance Testing and Materiality
Materiality is an important subject in securities fraud and other cases, and over the years there has been regular progress in analyzing this issue scientifically and introducing the results in courtrooms. The Supreme Court's decision in Matrixx did not repudiate significance testing, but it appears to have caused some confusion that should be cleared up.
The confusion arises from the difference between the potential materiality of corporate announcements, when the content of the announcements involves scientifically inconclusive data, and the use of scientific methods of analyzing, in retrospect, whether a given announcement really was material to the stock market. These two issues are distinct, yet both involve the terminology of statistical significance and materiality. The Supreme Court was concerned with the former. Ironically enough, the Matrixx decision confirms the Court's abiding interest in stock market reactions to news, which is the more common application of significance testing.
When public companies and their auditors consider the materiality of potential disclosures, they do so with no hard evidence of how the market will react to the news, because that event still lies in the future. Accounting principles—much like legal doctrine established in Basic Inc. v. Levinson, 485 U.S. 224 (1998), and elsewhere—and SEC practice eschew bright lines and may consider but do not rely on rules of thumb. This situation changes in the courtroom, because by the time of trial, the market's reaction to the news is a matter of record. The market's reaction is still ambiguous, because there can be many influences on share price movements, and the market does not necessarily react to news the way the company, auditors, and regulators might have anticipated, ex ante.
The impact of the news on the market's valuation of the company can be evaluated using statistical and econometric methods. The oldest and best known of these is the so-called "event study." The usefulness of event studies has been recognized by the courts in many cases. More recently, a body of empirical literature referred to as "value relevance" research has become established and, for certain types of investigations, offers additional, powerful analytical results. Because these methods are scientific, they are evaluated scientifically, and that means they are amenable to significance testing.
The Court's decision in Matrixx seems clear enough: "This is not to say that statistical significance (or the lack thereof) is irrelevant—only that it is not dispositive of every case." In point of fact, the majority opinion carefully noted the share price movement when news and corporate announcements were made. It noted the share price decline from $13.55 to $11.97 following a January 20, 2004, press report about a Food and Drug Administration inquiry and product liability lawsuits. This was an 11.6 percent drop. It noted the price recovery following a corporate press release and the subsequent "plummet" of the share price to $9.94 following a national television news story. To what extent, if any, were these carefully noted price movements due to the natural, underlying volatility of the share price or to unrelated shifts of stock market mood? Such questions are answered through the use of event studies to establish the statistical significance of the price movement. The majority ruling shows the Court's continuing interest in such price movements as a guide to materiality.
Instead, in Matrixx the majority was commenting on statistical significance in a quite different application: whether the frequency of adverse event reports had to be statistically significant. Because virtually all therapies are associated with a non-zero number of such reports, physicians, regulators, and companies seek evidence (statistical and otherwise) as to whether the true causes of these adverse events might have been unobserved factors, rather than the therapeutic regime. A number of specialists in health sciences are disappointed the ruling did not clarify the disclosure requirements for that industry, but that is a separate issue from the question of whether the Supreme Court was taking on significance testing more generally. It clearly was not.
Finally, there is the issue of semantics. In Matrixx, the Court quotes a famous passage from the landmark Basic decision: "In Basic, we held that this materiality requirement is satisfied when there is '"a substantial likelihood that the disclosure of the omitted fact would have been viewed by the reasonable investor as having significantly altered the 'total mix' of information made available."'" The Court later refers also to information that "would otherwise be considered significant to the trading decision of a reasonable investor."
Significance testing is a powerful tool in evaluating the "total mix" of information, but that does not mean that tests of statistical significance are using the word "significant" with exactly the same meaning as it is used in these passages. When the phrase "statistical significance" was coined, the intended meaning was that the statistical measure signified a given result, that is, it conveyed a conclusion. The more common meaning of the word today, that something is significant if it is important, was the secondary connotation in the late 1800s. Ordinarily, of course, a scientist is interested in statistically significant results that are also important to the body of scientific knowledge, but there can be some confusion as to whether statistical significance is a direct test of the importance of the phenomenon tested. Of course, attorneys must often link the two. This confusion lies at the root of the seeming ambiguity of Matrixx, that statistically insignificant adverse events can be significant to the stock market, but it does not mean that significance tests should not be applied to the stock price movements themselves. Significance testing is no doubt here to stay in the sciences and in the courts.
Keywords: litigation, expert witnesses, statistical significance, significance testing, materiality, semantics, Matrixx Initiatives, Inc. v. Siracusano, stock market, admissibility
David A. Gulley teaches at Columbia University in the City of New York.