chevron-down Created with Sketch Beta.

The Antitrust Source

Antitrust Magazine Online | December 2020

Keanu Reeves, Simpson’s Paradox, Nicholas Cage, and Dead Salmon: Four Insights into Econometrics for Antitrust Counsel

Ai Deng

Summary

  • Economic analyses are often a key component of effective antitrust advocacy. As a result, it is essential that attorneys understand basic economic principles and can effectively convey these concepts to fact finders. 
  • In this article, Ai Deng uses real-world examples to explain four key economic concepts to a non-expert audience.
  • Specifically, this article explains (1) what a “but-for” analysis is and describes some common approaches; (2) what a regression is and why it is different from a simple correlation; (3) the key difference between causal and predictive analysis; and (4) the danger of “data mining” and why attorneys should care about this risk.
Keanu Reeves, Simpson’s Paradox, Nicholas Cage, and Dead Salmon: Four Insights into Econometrics for Antitrust Counsel
S847 via Getty Images

Jump to:

This is not nitpicking. [The expert’s] regression had as many bloody wounds as Julius Caesar when he was stabbed 23 times by the Roman Senators led by Brutus . . . .

If a party’s lawyer cannot understand the testimony of the party’s own expert, the testimony should be withheld from the jury. Evidence unintelligible to the trier or triers of fact has no place in a trial.

Judge Richard Posner’s words send a powerful message: it is critical that lawyers understand the economic analyses presented by their experts. If they do not, they will lose credibility and, in turn, risk losing their case. Even if a lawyer does not have a deep understanding of the underlying mathematics and statistics used by the expert, the lawyer can learn and apply four powerful insights drawn from everyday life to obtain a better understanding of an expert’s quantitative analyses and areas of vulnerability. Specifically, this article explains (1) what a “but-for” analysis is and describes some common approaches, (2) what a regression is and why it is different from a simple correlation, (3) the key difference between causal and predictive analysis, both of which are widely used in litigation, and (4) the danger of “data mining” and why attorneys should care about this risk.

The Problem of Counterfactuals: In Search of a Parallel Universe

In 2017, Keanu Reeves reprised his role as John Wick, retired hitman, in John Wick Chapter 2. During one interview on his promotional tour, Reeves was asked the interesting question of whether he thought his years of acting experience had helped him better understand human relationships and made him a better person. The actor paused before responding: “[I]t is hard for me to say, to contrast and compare, because I don’t have anything to contrast with . . . . I’ve been working as an actor since I was 15, 16, so I don’t know what it is like not to be [an actor], really, you know?”

This exchange captures a question that we commonly face in merger reviews and litigation. How do you know what would have happened if things were different? For example, a key question in a cartel case is whether the prices would have been lower absent the cartel conduct. In a no-poach case, the critical question is whether wages would have been different absent the agreement. This also applies to predictions of the future: for mergers, will prices increase (or would output decrease) if the parties were allowed to merge?

All of these questions ask about the counterfactual, or “but-for,” world. The challenge that Keanu Reeves pointed out in his response is closely related to the “fundamental problem of causal inference”: we can observe what has actually happened but not what would have happened. What can be done about this challenging problem?

One common approach used by economists (and others) is based on an intuitive process that economists refer to as benchmarking. For example, to understand the impact of certain conduct, one could use lived experience, i.e., history, as a benchmark. Indeed, after noting the challenging nature of the interviewer’s question, Keanu Reeves stated that he had learned many things that he would not have known were he not an actor, referring in particular to his experience in making the movie Little Buddha. By doing so, he used his own before and after history as the benchmark.

This simple and yet powerful idea has been widely used in litigation. Figure 1 illustrates the use of this type of benchmarking to calculate cartel overcharges in In re Vitamins Antitrust Litigation. Figure 1 shows the actual prices of Vitamin E acetate oil as well as the but-for or counterfactual prices absent the cartel. The approach used price history, shown as a time series, and the historical relationship between the price and supply and demand factors outside the cartel period as a benchmark against which to compare prices during the cartel period.

Figure 1: Benchmarking Approach in In re Vitamins Antitrust Litigation

Figure 1: Benchmarking Approach in In re Vitamins Antitrust Litigation

 

History is not the only source of a benchmark. When history is not available or deemed too different from the period of interest, information from other sources unaffected by the event or conduct in question can also provide a benchmark. This is also intuitive. If, for example, a certain anticompetitive conduct affects only one geographic area, then the observations from adjacent areas might be used as benchmarks. Datasets that measure different geographic areas (or individuals) at a single point in time are commonly known as “cross sections” in statistics and econometrics.

In some other cases, both the history and cross sections may be available as benchmarks. Such data are known as panel or longitudinal data. Detailed transactional data for different customers over time are an example of panel data that is common in antitrust class certification cases.

Unsurprisingly, for any benchmarking to yield reliable results, several conditions must be met. And because of the complexity, as explained below, it will often be necessary to use analytical tools, such as regressions, when benchmarking.

Regressions: What, Why, and How?

In modern terms, regressions are mathematical models used to understand how things are related to each other. Economists use them widely in antitrust litigation and merger investigations. For example, in a price-fixing case, economists often use regressions to understand the impact of the cartel conduct on prices. In a merger case, economists often use regressions to understand the effect of competition on prices. These examples are illustrated in the figure below. Note that, in both cases, regressions take into account not only the variable of interest (cartel conduct in the cartel case; some measure of competition in the merger case) but also the other demand and cost factors. This is an important point to which we will return.

Illustration of a how a regression could be used in cartel case.

Illustration of a how a regression could be used in cartel case.

Illustration of a how a regression could be used in merger case.

Illustration of a how a regression could be used in merger case.

The wide use of regressions in these cases is not surprising: because benchmarking is about understanding how things are similar to or different from each other, regressions, a tool to study relationships among different variables, are often used for that purpose. Regressions are no panacea, however. As the discussion that follows will make clear, because regressions rely on data to reveal the relationship of interest, they are not applicable when data are insufficient or simply unavailable.

But why is this type of modeling called a regression? The term dates back to the empirical observation of “regression to the mean.” One of the earliest articles documenting this phenomenon was published in 1886 by Sir Francis Galton. Galton reported his study of the relationship between a father’s height and the height of the father’s first full-grown son. Galton found that a tall father would have a son, on average, shorter than himself and vice versa. To put it differently, if a father’s height is either above or below the average height, the son’s height tends to “regress” to the mean, or the average height. While Galton’s original “regression” was about a biological phenomenon (the relationship between individual and average heights), others, such as Pearson, have extended it to a general approach to statistical relationships between different variables, which ultimately led to the modern formulation of regressions.

Linear Regression Made Simple

Economic experts commonly use linear regressions. The simplest linear regression involves only two variables—call them X and Y. For example, variable Y may be the price of an allegedly price-fixed product and X may be a demand or cost variable. A linear regression can be expressed as a simple equation: Y = c + bX + e. That is, variable Y (the price of the product) equals the sum of c, a constant, plus b, another constant that is multiplied by variable X, plus another mysterious e.

Economists use a fair amount of jargon when discussing regression. The variable of interest, Y, is sometimes referred to as the “dependent variable,” “outcome variable,” or even the esoteric “regressand,” and X is sometimes referred to as the “independent variable,” “explanatory variable,” “predictors,” or the “regressor.” X and Y are typically observed data: you can observe the price, Y, during and outside the cartel period, and you may also observe the cost, X, during the same period.

The constants, c and b are often referred to collectively as the “coefficients,” although the constant c is also known as the “intercept” and the constant b is also known as the “slope coefficient.” Specifically, because X is multiplied by b in the regression model, the slope coefficient, b, reflects a mathematical relationship between X and Y that, when plotted on a graph, will generate a straight line. You can see this graphically in Figure 2. Note that the data (the blue dots) are identical in both charts.

Figure 2: How Regression Works with Only One Variable

Figure 2: How Regression Works with Only One Variable

Why are the lines in the graphs in Figure 2 different, and why does one appear to be “closer” to the data points than the other? This is the role of the coefficients, c and b. Unlike the observed data, Y and X, the coefficients c and b are unknown numbers that the economist needs to “estimate.” Specifically, “estimation” refers to the process of finding the values of c and b that minimize the difference or “distance” between and the line described by the formula c + bX. Because c + bX is a straight line on this chart, “estimation” can also be thought of as the process of finding coefficients that shift and tilt the line c + bX so that it will, on average, track as closely as possible to the observed data points of Y.

What do economists mean by “close”? The most popular definition of closeness is measured by the average (squared) difference between our straight line, c + bX, and the observed values of Y across all the data points. This definition of closeness is the basis of the Ordinary Least Squares (OLS) estimation method. In fact, the OLS method estimates the values of c and b as those that minimize this measure of “closeness.” It is one of the most widely used methods for regression estimation, thanks largely to its tractability and desirable statistical properties.

Another term that economists use often is “regression fit.” In Figure 2, the “closer” the line is to all the data on average, the better the regression fit is, i.e., the closer c + bX is to Y. Therefore, the regression in the right panel has a better fit than the one in the left panel. Intuitively, if the values of c and b produce a better regression fit, then they do a “better” job describing the relationship between X and Y. This is why, for example, the regression model in the right panel is often preferred to the one in the left panel. Economists use the R-squared (R 2) to measure how well a regression fits the data. Intuitively speaking, R 2 measures the percentage of the variations in Y that is accounted for by c + bX. The higher the R 2, the better the regression fit. The regression model in the right panel has a higher R 2 than the one in the left.

Finally, that mysterious term e also has many names, including the “error term” or the “disturbance.” For our purposes here, it suffices to think of it as some “residual,” unaccounted for by Y, to account for whatever the differences are between and the model c + bX.

So far, we have focused on the instance of a single X variable—a single independent variable, such as a cost. How does a regression work when there are instead multiple independent variables? Certainly, in terms of the prices of some products or services, it is not difficult to think of many variables that could affect them. It turns out that multiple variables do not change the intuition of the mechanics of how we estimate those unknown “coefficients,” such as c and b. Figure 3 illustrates the situation of two independent variables—call them X1 and X2. In this case, instead of a flat graph with an x and y axes and a straight line, the model will be three-dimensional, generating a plane rather than a line. More precisely, for those comfortable with this type of 3-D chart, c + b1 X1 + b2 X2 represents a plane (the dark-colored plane in Figure 3) in a 3-D space. In this case, the OLS-estimated coefficients c, b1, and b2 are chosen by shifting and tilting the plane (as opposed to moving a straight line, as in Figure 2) so that it is closest to the variable Y (the light blue dots).

Figure 3: Regression Plane with Two Variables

Figure 3: Regression Plane with Two Variables

Why Do We Need Multiple Variables?

As illustrated above, the mechanics of a regression are quite simple. In fact, when there is a single X and Y, a regression is conceptually similar to the simple correlation between the two variables. Then why bother with multiple variables? The following illustration should drive home the importance of the use of multiple variables.

Consider two medical treatments of a certain disease, treatment A and treatment B. To assess which treatment is more effective, a group of patients is given either of these two treatments. You just received data on 700 patients and whether they have each recovered after the treatments. Table 1 shows the recovery rates. The numbers in parentheses are the number of recovered patients over the total number of patients receiving a given treatment, respectively.

Table 1: Overall Recovery Rates

Treatment A Treatment B
78% (273/350) 83% (289/350)

 

Table 1 tells us that 83 percent of the patients recovered after receiving treatment B, which is 5 percent higher than that of the patients receiving treatment A. Is treatment B the better option for this disease? It is tempting to say yes, but the answer is not that simple. You find out from a medical expert that the treatment may very well have a different effect on men and women, so you decide to tabulate the recovery rates by gender, shown in Table 2.

Table 2: Recovery Rates by Gender

  Treatment A Treatment B
Female 93% (81/87) 87% (234/270)
Male 73% (192/263) 69% (55/80)

 

Table 2 tells us that treatment A actually has a higher recovery rate than treatment B, regardless of the patient’s gender. To put it another way, including—or in statistical jargon, “controlling for” —gender in the analysis completely reverses the result. The realization that including more variables in the analysis could change the results was documented by Edward H. Simpson 70 years ago and is referred to as Simpson’s Paradox. It remains a powerful reminder of the importance of carefully considering additional variables in a statistical analysis. Note that considering additional variables does not mean more variables are always better. Among other reasons, the parsimony principle, a topic outside the scope of this article, advises against including too many variables in certain types of regression analysis.

Simpson’s Paradox applies to the types of questions antitrust practitioners face every day. To see an example, all we need to do is to change the labels in Tables 1 and 2. Imagine that we are assessing whether a merged grocery store has raised its prices on more products relative to another remaining competitor. Using exactly the same numbers, Table 1 could be telling us that the merged store raised prices on 83 percent of its products (289 out of 350), which is 5 percent more than the competitor. Table 2, however, reveals that the merged store sells more organic foods than the other competitor, and when we look at organic and non-organic foods separately, the other competitor actually raised prices on higher percentages of its products than the merged store. As in the medical treatment example, failing to recognize the importance of another factor (here, product type) would have led to misleading inferences.

To Explain or To Predict? That Is the Question

In his 1969 book Theory Building, Robert Dubin made it clear that theories of social and human behavior have two distinct goals: prediction and understanding. The same is true for statistical analysis where the goal of “prediction” is addressed by predictive statistical modeling and the goal of “understanding” by causal or explanatory modeling.

While this distinction is almost never explicitly recognized, both types of analyses appear in antitrust litigation. Examples of explanatory modeling include the “dummy variable” models used in damages estimation and any other case where a regression coefficient is interpreted as a “causal” effect. Examples of predictive modeling include forecasting models used in damages estimation and the event study techniques widely used in securities litigation. Why bother making this distinction? Important but different methodological issues arise depending on the goal of a statistical analysis. Failing to recognize this distinction could lead to misplaced and wasted effort and, worse, inferior and misleading empirical findings.

To Explain: Interpreting A Linear Regression Is Not Always Simple

Let us start with explanatory modeling. This is a situation where we want to directly measure the effect of a given variable on another. The dummy variable model often used to estimate antitrust damages is such an example.

Suppose we have price data both during and outside the cartel period, much like in the case In re Vitamins Antitrust Litigation illustrated in Figure 1 above. For simplicity, assume that the cartel had a constant, time-invariant impact on the prices. How do we estimate this cartel effect? Consider the following strategy: calculate the average prices both during and outside the cartel period and measure the cartel effect as the difference between these two average prices. This is a simple benchmarking process, an idea described earlier.

But even if this comparison shows that the average price during the cartel is ten dollars higher than that outside the cartel period, is the actual cartel overcharge ten dollars? Before we answer this question, it is interesting to note that this comparison of averages can be cast into a simple regression of the following form:

pt = c + b×Cartel + et

Here, denotes the prices, i.e., the dependent variable that we have previously denoted by Y. Cartel is a “dummy variable” that is equal to one during the cartel period and zero otherwise. It turns out that mathematically, using the OLS method, the estimated coefficient is simply equal to the average price outside the cartel period, and the estimated coefficient is equal to the difference between the average prices during and outside the cartel period. Therefore, in our hypothetical example, b is precisely equal to ten dollars. Figure 4 illustrates this simple regression model graphically. The two horizontal bars represent the average prices during the non-cartel and cartel periods.

Figure 4: Graphic Illustration of the Simple Regression Model

Figure 4: Graphic Illustration of the Simple Regression Model

Yet intuition suggests that the cartel impact may not be ten dollars because the comparison does not account for anything else that could also have affected the price differences. To give two concrete examples, suppose, unrelated to the cartel conduct, there were some never-before-seen supply disruptions due to natural disasters during the cartel period or that data indicated there was a weaker demand for the product while the production costs remained largely constant over time. Even without complicated econometrics, intuition tells us that the ten dollar difference that results from the simple comparison of the average prices could overstate the cartel effect in the first instance (decreased supply could independently lead to higher prices) and understate it in the second (weaker demand could independently lead to lower prices).

To Predict: Better Regression Fit Does Not Always Lead to Better Prediction

Economists often make predictions in the antitrust context either to understand the effect of a merger upon a relevant market (forward-looking) or to estimate the effect of a cartel on prices (retrospective). When the goal is to obtain the best prediction, rather than to interpret a coefficient as a causal effect, we need to adopt a different focus. To help drive home the idea, let’s play a betting game.

Figure 5 shows a time series, which could be anything––for example, the price for widgets in Wisconsin. For the purpose of this illustration, we are going to assume that an economist only has the price data in the unshaded area and wants to predict the prices in the shaded area. Because we know the actual prices that the economist is trying to predict, we can assess how good any predictions are.

Figure 5: Historical Data and the Prediction Exercise

Figure 5: Historical Data and the Prediction Exercise

Three regression models are estimated using the historical data in the unshaded area shown in Figure 5. The question is which model would generate the most accurate predictions. Figure 6 shows the regression fit of each of the three models. In Model 1, the regression fit appears to be a straight line. It captures the upward trend in the data, but does not capture any of the actual ups and downs that were observed in the historical data.

Figure 6: Regression Fit of Three Regression Models

Figure 6: Regression Fit of Three Regression Models

Model 2 seems to do a better job than Model 1 in capturing the movements in the historical data. It captures not only the upward trend but also the appearance of a small downward trend at the very beginning of the historical data. Model 3, however, does even better. It seems to capture most of the movements in the historical data.

Before we reveal the winner, take out a $100 bill and put it on the model that you would use to make a prediction. Now look at Figure 7, which shows the predictions generated by each of the three models.

Figure 7: Predictions of the Three Models

Figure 7: Predictions of the Three Models

There are several noteworthy observations. First, over the entire forecast horizon, Model 1 appears to be on balance the best, despite being the worst-fitting model over the historical data. On the other extreme, Model 3, despite being the best-fitting model, performs poorly. Its predictions completely overshoot the actual values. Model 2 is somewhere in between: its predictions are relatively close to the actual values in the first few time periods before they diverge from them. In fact, Model 2 does even better on average than Model 1 in the first few time periods. Over the entire forecasting horizon, however, Model 2 performs much worse than Model 1. Those of you who bet on Model 1 are fortunate—you can keep your $100 bill.

These results may appear surprising or even puzzling. How can a model that fits the historical data so well (Model 3, in particular) predict so poorly? We will not concern ourselves with a technical discussion, which inevitably relies on statistical concepts such as estimation bias, variance, mean squared errors, and others. But the intuition behind these observations is simple and powerful.

Imagine that you are buying a new dress or a suit for an upcoming event next month, say a wedding reception or a job interview where you want to impress. What size of dress or suit are you going to buy? Unless you aspire to lose or gain ten pounds for the event, you are probably going to get one that fits you well, nothing too small or too big. But if you are buying a sweater that you want to wear for the next ten years, you would buy something overlarge and shapeless to make sure it will fit comfortably forever.

The intuition in using regressions to make predictions is the same: the near future has fewer unknowns, so you can “afford” to have the regression more closely “fit” the historical data; on the other hand, the longer into the future we want to predict, the greater the number of unknowns, requiring a looser “fit” to leave room for them to play out.

That intuition explains what happens in our example. Model 3 fits the historical data too well to generate useful predictions. Model 2 fits the historical data less well than Model 3 but better than Model 1. Consequently, it performs reasonably well in the near-term prediction but fails in the long term. Model 1 does not fit the historical data as well as either Model 2 or Model 3 but performs the best on average over the entire forecast horizon.

Of course, this does not mean that the best prediction model must have the worst fit. Theoretical studies have shown that there is a delicate balance between the fit of the model and the predictive power of the model. A good predictive model needs to take into account this important insight. But there is one lesson that the readers can apply immediately: be skeptical if someone tries to convince you that their regression model can generate a reliable prediction just because the model fits the historical data well. Using the concept of R 2 introduced above, another way to state this is that high (better regression fit) alone should not be used as a justification for a prediction model.

Explanatory vs. Predictive Modeling

As discussed above, explanatory and predictive modeling have different focuses. It is useful to look at some real-life examples to crystalize the difference.

Figure 8 compares the number of films Nicolas Cage appeared in from 1999 to 2009 against the number of people who drowned by falling into a pool during the same period.

Figure 8: Spurious Correlation - Nicholas Cage Films and U.S. Drownings in Swimming Pools (1999–2009)

Figure 8: Spurious Correlation - Nicholas Cage Films and U.S. Drownings in Swimming Pools (1999–2009)

It does not take an econometrician to see the strong positive correlation between the two. Yet intuition tells us there cannot really be a causal or explanatory relationship between the two, despite the apparent high correlation. It is highly unlikely that people were so upset a new Nicholas Cage film was released that they preferred to drown themselves or that Nicholas Cage was so saddened by how many people drowned that year that he decided to do more films as a result.

But if neither variable explains the other, can they be used to predict each other? Specifically, should public health officials use Cage’s films to predict the number of drownings given the historically high correlation? Figure 9 below shows that the predictions based on Cage’s films fail miserably post-2009. In fact, except for a lucky (or unlucky, if you will) coincidence in 2017, the predictions (the blue dotted line) persistently overestimate the number of drownings (the blue line). It is safe to conclude that Cage’s films and swimming pool death is one example where the correlation is neither causal nor predictive.

Figure 9: Can Cage Films Predict Drownings?

Figure 9: Can Cage Films Predict Drownings?

Yet prediction without direct causation is entirely possible. As an example, empirically, ice cream production is also positively correlated with swimming pool drownings. Figure 10 shows the monthly values of these two variables. The correlation is 0.57, which is slightly lower than the correlation between Cage’s films and drownings between 1999 and 2009.

Figure 10: Ice Cream vs. Drownings

Figure 10: Ice Cream vs. Drownings

Do we think producing or consuming more ice cream leads to more drownings or vice versa? Probably not. But there is something about the correlation in this example that makes sense. As the temperature rises in the summer, people tend to consume more ice cream, and hence manufacturers produce more. At the same time, and unrelated to their enjoyment of ice cream, the hot weather will lead more people to head to pools, resulting in more pool drownings. In other words, it is not that ice cream consumption or production itself explains swimming pool drowning or vice versa. It is the season, specifically the temperature, which explains both. It is this common cause that produces the positive correlation between the two. Would you consider using either ice cream consumption or drowning to predict the other? In absence of information about the season or temperature itself, yes, it is reasonable to do so. In fact, unlike the predictions based on Cage’s films (which consistently overpredict), Figure 11 shows that predictions of drownings based on ice cream production alone are, while far from perfect, reasonable.

Figure 11: Can Ice Cream Predict Drownings?

Figure 11: Can Ice Cream Predict Drownings?

Let Data Speak, But Do Not Torture Them

The discovery of unexpected correlations, like the release of Nicholas Cage’s films and pool drownings, should come as no surprise. In his book The Improbability Principle, David Hand explains that what is at work is “the Law of Truly Large Numbers.” He defines the principle succinctly: “With a large enough number of opportunities, any outrageous thing is likely to happen.” In other words, if one looks hard enough, one may identify statistical coincidence that is neither causal nor predictive.

The Law of Truly Large Numbers is also a reason to exercise caution when we analyze data and draw inferences from them. For example, with a large number of tests comparing the effectiveness of a drug with that of a placebo, it is almost guaranteed that at least one comparison will appear to show that a drug is “effective,” when, in fact, it is not. This type of spurious finding often results from a process known as data mining. Other colorful names for this concept include “data dredging,” “data snooping,” and “data torturing.” As one author put it, such practices are “the analytical equivalent of bunnies in the clouds, poring over data until you found something. Everyone knew that if you did enough poring, you were bound to find that bunny sooner or later, but it was no more real than the one that blows over the horizon.”

As another example of the danger of data torturing, consider the study that a team of neuroscientists once conducted on a salmon whose brains underwent fMRI scans:

When they presented the fish with pictures of people expressing emotions, regions of the salmon’s brain lit up. . . . [H]owever, as the researchers argued, there are so many possible patterns that a statistically significant result was virtually guaranteed, so the result was totally worthless. . . . [T]here was no way that the fish could have reacted to human emotions. The salmon in the fMRI happened to be dead.

That dead salmon saved the researchers from some misleading discoveries. An economist is unlikely to have the benefit of such dead giveaways. Instead, the economist needs to take extreme care not to “cherry pick” findings just because they support the client’s or lawyer’s preferences.

Data mining is not just a theoretical or academic concept. In fact, it has been alleged as a basis for excluding experts’ reports and testimony in several recent cases. In In re Processed Egg Products Antitrust Litigation, the plaintiffs’ expert used a regression model to relate prices to other factors. The defendants asked the court to disregard the plaintiffs’ expert’s regression model because, when the model was estimated using only a subset of the data, specifically using “just one certain [d]efendant’s transactions,” some aspects of the regression results changed. The plaintiffs countered that the defendants’ results were “the product of inappropriate ‘data mining.’” Judge Gene E.K. Pratter denied the defendants’ challenge against the model and found the plaintiffs’ data mining counterargument persuasive.

In another antitrust class certification case, In re Pool Products Distribution Market Antitrust Litigation, the plaintiffs filed a motion to exclude the testimony of the defendants’ expert, based in part on an argument that alleged data mining bias rendered the testimony unreliable. In particular, according to the court’s order, the defendants’ expert estimated the plaintiffs’ expert’s regression model using subsets of the data and argued that the results show that “common factors do not predominate in determining pricing across the class.” Plaintiffs argue that applying their regression model to subsets of data is “impermissible ‘data mining’.” Citing to various literature and case laws, the court concluded that the defendants’ expert’s sensitivity check was “sufficiently reliable” and ultimately denied the motion to exclude the testimony.

Finally, in Karlo v. Pittsburgh Glass Works, an age discrimination case, Judge Terrence F. McVerry found one expert’s analysis of impact to be “improper” because it did not correct for “the likelihood of a false indication of [statistical] significance.” He added that it was “data-snooping, plain and simple.” The Third Circuit, however, vacated Judge McVerry’s ruling, stating, “We conclude that the District Court applied an incorrectly rigorous standard for reliability,” although it did not expressly refer to the alleged data-mining issue. Given the discussions in these cases, this important but subtle statistical concept will continue to receive well-deserved attention in the legal domain.

Conclusion

At this point, the reader may have many additional questions. Indeed, time and space as well as pedagogical goals limit what this article can offer, but counsel’s interaction with their expert will have fewer limitations. As econometrics has become an indispensable and widely used tool in both merger control and antitrust litigation, by probing the questions discussed in this article, counsel can clearly communicate the expert’s work to the finders of fact and at the same time make the expert’s analysis more robust. Doing so can potentially reduce Daubert and other litigation risks. In fact, failing to appreciate these key concepts could easily result in deeply flawed and misleading analyses.

The author thanks Tim Watts for helpful comments and Mike Packard for excellent research assistance. The views expressed in this article are those of the author and do not necessarily reflect the opinions of NERA or its clients, Johns Hopkins University, or their affiliates.

    Authors