What Is a Multiple Regression Analysis?
A simple regression may be visualized as the average linear relationship in an x-y scatterplot. As shown in figure 1, a scatterplot presents pairs of values, analogous to longitude and latitude on a map. Each observation, or data point, has an x value and a y value. For instance, x might be “Years Employed” and y might be “Annual Salary.”
Figure 1
The regression analysis fits a line through the cloud of observations. The regression line does not pass through every actual observation. Rather, it represents the average linear relationship between the two variables.
A common way of determining the line is a process called least squares. This process identifies the line that produces the minimum sum of squared deviations from the line, hence the name.
Regression analysis is not limited to fitting lines to simple scatterplots of points in two dimensions. A regression can fit annual salary to a linear combination of explanatory variables such as years employed, educational attainment, and sex, for example. An extended model that includes multiple variables is called a multiple regression. A model with multiple variables tells us the effect of each variable on annual salary while holding constant the effects of the other variables. For example, a variable such as sex can be used to test for salary differences not explained by years employed or educational attainment.
What You Should Know about Regression Analysis
Secret number 1: Data issues. There is an old joke about a restaurant being panned by its patrons because the food is no good and, anyway, the portions are too small. Similarly, in litigation, debates about sample size, statistical significance, and whether the data are good enough are common. Even if the data are valid and measure what they are intended to measure, real-world data shortcomings can include missing observations, data entry errors, and discontinuities that result when businesses change their computer or accounting systems midstream. When discontinuities create non-homogeneous data, one cannot necessarily expect the relationship between the dependent variable and the explanatory variables to be a constant (which is what the regression analysis seeks to estimate).
With this in mind, counsel are often well served to go beyond the evergreen issues of sample size and sample validity to understand how the expert identified and dealt with any data problems. Combining potentially disparate data may be necessary and proper, but the process should be vetted and understood. The expert’s domain knowledge of the data can help ensure that the data set is assembled in a responsible and unbiased way. Counsel will want to know what adjustments or splices were taken (or not taken) and the impact that an alternative adjustment would have had on the conclusions and inferences.
An issue of data discontinuities that resulted in the exclusion of an expert’s report arose in Reed Construction Data Inc. v. McGraw-Hill Companies, Inc., 49 F. Supp. 3d 385 (S.D.N.Y. 2014). In that case, the defendant was, among other things, alleged to have violated the Lanham Act by improperly accessing a database and using that information to generate false or misleading product comparisons. A regression analysis was performed to compute the price change caused by the alleged conduct. However, the district court held that Reed Construction Data Inc.’s combining of local and national pricing data in the regression analysis was improper. The court concluded that the national and local markets had different characteristics and should not have been combined—at least not as was done by the expert.
Secret number 2: Outliers and influential observations. Many regression techniques work by squaring deviations and finding the line that minimizes the sum of these squared deviations. Squaring means that the effect of outliers is amplified. Consequently, a single observation that is located far from the average can have an outsized influence on the regression results, even to the point of making them misleading.
Figure 2 is identical to figure 1 except with a change to the outlying data point at 20 years. A regression run on this new data set produces completely different characteristics than the original regression.
Figure 2
When there are multiple variables, influential data points may not be as evident as is the case in figure 2. The expert should be able to produce diagnostic statistics identifying any influential points and should be able to show how including or excluding those outliers affects the overall results of the analysis.
Results that are contingent on whether or not the influential data point is included in the analysis may not be reliable for purposes of the litigation. However, throwing out influential observations is not necessarily appropriate, either. Indeed, the outlier may be a critical feature of the data and thus may have a significant impact on the legal case. Determining the correct disposition of the outlier will usually require domain knowledge by the expert.
Secret number 3: Omitted variables. The results of the regression are biased when an important explanatory variable is omitted and when that omitted variable is correlated with the remaining explanatory variables. The regression coefficient will be overestimated if the sign (+ or -) on the correlation coefficient of the omitted variable and the explanatory variable(s) are the same as the sign on the correlation coefficient of the omitted variable and the dependent variable. The coefficient will be underestimated if the signs are not the same.
While omitting a relevant variable may induce bias in the remaining variable, this does not necessarily render the analysis unreliable. In Bazemore v. Friday, 478 U.S. 385, 400 (1986), the Supreme Court concluded that the standard of admitting a regression analysis does not require all measurable variables thought to have an effect on the dependent variable to be included; rather, the standard requires the court to consider whether the model is “so incomplete [due to omission] as to be inadmissible.”
Curing the omitted variable problem can be difficult when the omitted variable is difficult to observe or quantify. For example, in a pay discrimination analysis, an individual’s past pay and tenure at a previous employer may have a significant influence on starting pay at a new employer. However, that information may not have been retained by the new employer and may be impossible to obtain from prior employers (especially in the instance of a large class action), and so it might be omitted by the expert even though it may have an effect on pay.
As another example, a firm may accuse another of unfair practices that resulted in price and profit compression. An analysis that seeks to isolate the effect of the alleged practice would have to control for other factors that affect profit margins such as (possibly) the anticipated technological change and changes, and the amount and characteristics of existing and anticipated competition, any one of which may be difficult to quantify. The expert may be able to resolve an omitted variable problem by using so-called instrumental variables. In principle, the instrumental variable is one that is obtainable and highly correlated with the omitted variable and uncorrelated with the other explanatory variables.
Secret number 4: Overfitting and irrelevant variables. As noted by the Supreme Court in Bazemore, the standard for including variables should not be a kitchen sink approach.
Adding irrelevant variables to the analysis creates a fragile model that is not capable of prediction. A model that essentially connects the dots on all of the observations may appear to fit all of the known data, but it is also likely to fail to provide adequate predictions based on new data.
Including irrelevant variables can also reduce the precision of the estimates of the other explanatory variables. This can result in the erroneous conclusion that a key variable is statistically insignificant when in fact it is significant. Because of this, an expert may claim that a variable is critical, but in fact its inclusion makes the key variable statistically insignificant and does not improve the overall fit of the model.
Including all variables that may have an effect on the dependent variable is not always a benign practice because it can erroneously cause an important variable to appear to be statistically insignificant. Counsel will want to have a good understanding of why variables are included in the analysis—and not just include variables out of a false abundance of caution.
Conclusion
Regression is based on certain assumptions about the relationships among the variables. Happily, the regression approach is relatively robust when it comes to violations of its fundamental assumptions. This makes the approach useful in the real world. But this robustness is not infinite, and the expert should be sure to demonstrate, first to counsel and then to the court, how well the analysis fits with the theoretical requirements of the technique, how it is adapted to the facts of the case, and how the common pitfalls discussed here were addressed.
Keywords: litigation, expert witnesses, statistics, regression, Bazemore v. Friday
Frank Pampush, PhD, CFA, is a director at Navigant in its Chicago, Illinois, office. Jeremy Guinta is an associate director at Navigant’s Los Angeles, California, office.
Navigant Consulting is the Litigation Advisory Services Sponsor of the ABA Section of Litigation. This article should be not construed as an endorsement by the ABA or ABA Entities.