February 19, 2016 Practice Points

Data Processing and Hidden Assumptions

An overview of three issues that can arise with data manipulation.

By Frank Pampush

In data, as in life, there is no perfection. An area that can get overlooked in litigation is the pre-processing of raw data for use in downstream analyses. How fundamental data manipulation issues are resolved can be an important factor in determining the reliability of even the most sophisticated downstream statistical analysis.

There are three data issues that can arise: discontinuities or incompleteness; data that are themselves the result of a process rather than raw measurements; and (in time-series data) data periodicity mismatches.

1. Incompleteness and discontinuities. These can arise when a company has changed its accounting or operations tracking systems (e.g., from Oracle to PeopleSoft). The before-and-after cutover can produce discontinuities because account or tracking numbers change, and there may be no simple crosswalk between the two due to different methods in which the data are aggregated at the most fundamental level. The expert may conclude that pre-cutover data simply cannot be used because any reasonable splicing method will hide (e.g., aggregate relevant with irrelevant data) the data that is most relevant to the case. In other instances, splicing the two data series can be done without causing downstream problems, but counsel will still want to know how and why a particular splicing approach was used. Depending upon the nature of the discontinuity, the splice may be made in the downstream statistical analysis itself, such as using a dummy variable in a regression analysis to account for the discontinuity. The upshot is that counsel should understand what the options were to address a discontinuity, why a particular path was taken, and what the outcome would have been using an alternative approach.

It is not obvious when missing data, or records with incomplete fields, will cause a problem. One approach (and the default approach used by many statistical packages) is to drop incomplete cases. But dropping incomplete cases can produce potentially misleading results if the remaining records do not represent the characteristics of the population. For example, if survey responses were submitted by one class (e.g., men) and not by another (e.g., women), but the topic of interest included men and women, the data would not be representative. An expert will have to decide (possibly in consultation with company subject-matter experts) whether to splice, interpolate, or impute observations for the missing data—and what approach is optimal as applied to the facts of that case.

2. Data that are a result of process rather than raw measurements. These may also create issue in situations such as accounting data, in which data may be plant value accounts and are not direct measurements of a physical process or of an economic transaction. They are instead accounting constructions based on accounting principles and rules. This might be fine for some analyses, but not others. For example, an expert may conclude that a firm happens to be worth its book value, but this conclusion is the outcome of his or her analysis involving various tests and the use of market-based data, and would not be based on the accounting data except as an input into that analysis.

3. Periodicity mismatches. These may occur when some data, such as the book data on plant accounts, are available quarterly, yet the more relevant analysis is monthly. The expert may seek to create a monthly variable via interpolation methods. Whether or not this is a good idea depends on the structure of the downstream analysis and on the output of interest. If the important output is a point estimate and if the interpolated variable would (in real life) be expected to evolve in a way that is reasonably characterized by the interpolation, its use may be okay. Interpolated data could affect the statistical significance in unanticipated ways relative to the result that would occur were monthly data actually available.

Conclusion
Downstream statistical analyses can command attention because this is where the estimate of the variable of interest to the litigation is actually produced. The upstream data pre-processing step can be as important to the determination of the end result as the analysis itself. It is important to understand how the experts addressed tricky data issues, and how use of alternatives would have affected the results of the analysis.

Frank Pampush, Ph.D., is a CFA at Navigant Economics in North Carolina.

Keywords: expert witnesses, litigation, downstream data, data processing, downstream statistical analysis

Navigant Consulting is the Litigation Advisory Services Sponsor of the ABA Section of Litigation. This article should be not construed as an endorsement by the ABA or ABA Entities.

Copyright © 2016, American Bar Association. All rights reserved. This information or any portion thereof may not be copied or disseminated in any form or by any means or downloaded or stored in an electronic database or retrieval system without the express written consent of the American Bar Association. The views expressed in this article are those of the author(s) and do not necessarily reflect the positions or policies of the American Bar Association, the Section of Litigation, this committee, or the employer(s) of the author(s).