The Danger of Forensic Measurement Results: Accuracy, Reliability, and Illusions of Certainty
Many believe that measurement is a mechanical process whereby a measurement is performed and the result yields a quantity’s value. Unfortunately, no matter how sophisticated or competently performed, measurement does not yield a quantity’s value. To be useful, forensic scientists must provide the relationship between measured values and those attributable to the quantity measured.
This is typically addressed through testimony characterizing measured results as “accurate and reliable” when obtained in a scientifically rigorous manner. When uttered by an expert, this characterization assures fact finders that the result can be trusted as representing the measured quantity’s value. Despite the confidence it may bestow, however, it actually says relatively little about the conclusions a result supports. In fact, the characterization may do more to mislead fact finders than to facilitate the discovery of truth.
This was illustrated during a week-long hearing in Washington State in 2009. The head of Washington’s breath test program was provided with the results of a breath test whose values either equaled or exceeded the per se limit (0.08 g/210L). The results were provided as is typically done in courtroom proceedings, without their uncertainty. Because the results had been obtained in compliance with strict scientific requirements, the expert correctly determined that they were “accurate and reliable.” Based on this, he concluded that the individual’s BrAC exceeded the per se limit beyond a reasonable doubt. When presented with similar results characterized in this manner, judges and juries typically come to the same conclusion.
The witness was subsequently provided with the result’s uncertainty, however, whereupon he withdrew his previous conclusion. The uncertainty revealed that there was actually a 44 percent likelihood that the individual’s BrAC was less than the limit! The expert’s initial mistake was not concluding that the result was “accurate and reliable.” Rather, it was due to his belief that he could adequately understand the conclusions supported by an “accurate and reliable” result without knowing its uncertainty.
Characterization of results as “accurate and reliable” is not a common scientific practice. The reason is that accuracy and reliability are qualitative (and relative) terms that convey little real information. The convention of doing so in expert testimony is the product of court-imposed requirements/rituals. Labeling a result “accurate and reliable” without also providing its uncertainty, however, can be misleading because it invites the fact finder to place greater confidence in a result’s value than is justified by the science it is based upon. As shown by this example, “[a]bsent the reporting of uncertainty, there is a substantial possibility that even an expert would not make a meaningful analysis of a particular breath reading.”
Measurement, Metrology, and Meaning
Two types of empirical activities are generally relied on to gather scientific information: measurement and observation. Observation is a qualitative activity whose results communicate nominal or ordinal properties such as classification or identification. The determination of whether a set of fingerprints or multiple samples of DNA match falls within this category.
Measurement, on the other hand, is a quantitative activity. It is defined as the experimental process whereby the values attributable to a quantity, such as length, weight, or time, are determined. The determination of the angle at which a bullet entered a wall, the weight of drugs possessed, or an individual’s blood alcohol concentration are each examples of forensic measurements. As a method for gathering scientific information, measurement is generally favored over observation because the results it generates have greater information content.
Metrology is the science of measurement and its application. Where measurement is concerned, “without metrology there can be no science.” William Thomson, (later Lord Kelvin), Electrical Units of Measurement, Lecture to the Institution of Civil Engineers, London (May 3, 1883). It encompasses “all theoretical and practical aspects of measurement, whatever the . . . field of application,” including forensics. International Vocabulary of Metrology—Basic and General Concepts and Associated Terms (VIM), § 2.1, (Joint Comm. for Guides in Metrology, Int’l Bureau of Weights and Measures 2008). Forensic metrology is simply the application of metrology and measurement to the investigation and prosecution of crime. Metrology provides the framework necessary for developing and critically evaluating measurements.
Before considering this framework, however, a significant question concerning the meaning of scientific propositions must be considered. Do scientific propositions describe physical states of the universe itself or simply our state of knowledge concerning such physical states? If the former, the direct object of a proposition is an external, fully independent reality. If the latter, the direct object is an internal cognitive position that is information dependent. Although the question may seem esoteric, the position adopted has practical implications where measurement is concerned.
In the context of metrology, what we want to know is: When a measurement result is reported, is it to be interpreted as a statement about the physical state (value) of the quantity being measured? Or is it simply an expression of our state of knowledge about the quantity’s physical state (value)? What are the practical implications of the choice made?
Metrology provides two distinct paradigms for the analysis of measured results: measurement error and measurement uncertainty. While the terms error and uncertainty are often used interchangeably, they represent distinct paradigms. Measurement “error is an unknowable quantity in the realm of the state of nature,” while measurement “uncertainty . . . is a quantifiable parameter in the realm of the state of knowledge about nature.”
Measurement Error
The traditional (error) approach to characterizing the limitations associated with a measurement is error analysis. The error of a result is simply the difference between the value of the result obtained and a measured quantity’s true value. Error analysis is based on the premise that if the error associated with a measurement can be determined, a quantity’s value can also be determined. The reality, though, is that the error associated with a particular measurement is unknowable. Accordingly, the goal of error analysis is to obtain an estimate of a quantity’s value as close as possible to its true value by identifying, accounting for, and minimizing as many sources of measurement error as possible.
Two types of error are associated with every measurement: random and systematic. Systematic error, also referred to as bias, is the tendency of a measurement to consistently (on average) under- or overestimate the value of a quantity. Bias leads to measured values that tend to be artificially elevated or depressed compared to a quantity’s true value. Random error is the unpredictable (random) fluctuation of measured values under fixed conditions. The variability it introduces determines a method’s precision (characterization of the dispersion of individually measured values under fixed conditions) and places a limit on repeatability of measured results. Together, systematic and random error make up what is formally known as “measurement error.”
Ironically, error analysis doesn’t provide a generally accepted means for combining systematic and random errors to yield an estimate of a measurement’s total error that also indicates how well a measured result corresponds to a quantity’s true value. Instead, the tool most frequently relied upon within error analysis to communicate limitations associated with measured results is the confidence interval. A confidence interval is a range of values symmetrically distributed about the sample mean of a set of measurements accompanied by a probability.
Icon = ¯x ± y (99%)
Care must be taken to avoid misunderstandings when interpreting a confidence interval. For example, the probability associated with a confidence interval does not tell us that a quantity’s value is ¯x ± y with a probability of 99 percent. In fact, it is not a statement about a quantity’s value at all, but, rather, a statement about the quality of the process leading to a result. It tells us that if one were to conduct multiple sets of measurements and generate a confidence interval for each set, 99 percent of the intervals would randomly overlap the population mean of all measured values for an infinite number of measurements.
The primary weaknesses of the error approach are that the actual error associated with a measurement is unknowable, and error analysis is generally incapable of revealing how well a particular estimate approximates a measured quantity’s value. In response to this, the scientific community developed an improved approach during the latter half of the twentieth century: measurement uncertainty.
Measurement Uncertainty
Measurement uncertainty incorporates and expands upon the error approach. As with error analysis, the uncertainty approach does not reveal a quantity’s true value. Nor does it permit us to know how well a particular estimate approximates that value. Unlike error analysis, though, the core issue is not error but the impossibility of ever possessing perfect or complete knowledge concerning a quantity’s physical state. Uncertainty reflects the lack of exact knowledge concerning the value of a measured quantity.
Reframing the issue in this manner takes the focus off of physical quantities whose values can never be known (i.e., a quantity’s true value and a measurement’s total error) and places it on something that can be known: what the information in our possession permits us to conclude about a quantity’s value. The goal is the characterization of our state of knowledge about a quantity’s value. The uncertainty paradigm provides the mathematical tools necessary for communicating how justified conclusions about a quantity’s value are given our state of knowledge. Measurement uncertainty quantitatively characterizes the values that the information possessed permits to be reasonably attributed to a measured quantity based upon the results obtained.
When a measurement is performed, an infinite number of values dispersed about its result can be attributed to the quantity measured with varying degrees of credibility. The credibility of a particular value is simply its relative likelihood with respect to all other possible values given our state of knowledge concerning that quantity. These likelihoods provide a measure of how scientifically justified the belief in a particular value is. This permits every measurement result to be modeled as a probability distribution.
The distribution associated with a result characterizes our knowledge of the values attributable to a quantity. It specifies both the values a result permits, as well as their relative likelihoods. This provides an unambiguous map of the inferences supported by a measured result. Importantly, this distribution is not centered on a measured mean but, rather, on the value obtained once the mean of such results has been corrected for bias. The bias-corrected mean of a set of measurements provides the best estimate of a quantity’s true value.
The entire range of values spanned by this distribution includes many that are quite unlikely. The objective of measurement uncertainty, however, is to convey only those that can be reasonably attributed to a quantity. The measurement’s probability distribution provides a conceptually straightforward way of accomplishing this: simply exclude values that are highly unlikely in the distribution’s tail, while including enough others so that a significant proportion of the values making up the distribution are left.
The range of values remaining, along with their combined likelihood, define a coverage interval. The combined likelihood of the values included is referred to as the interval’s associated level of confidence, which conveys how strongly our state of knowledge, given the measured results obtained, permits us to believe that a measured quantity’s value is provided by the coverage interval.
When reported with its uncertainty, a result is typically expressed in terms of a coverage interval as Y99% = Yc ± U. This tells us that:
- the best estimate of the quantity’s value, represented by the bias-corrected mean of measured results, is Yc;
- the values that can reasonably be attributed to the measurand lie within a range from Yc − U to Yc + U, where U is the result’s expanded uncertainty;
- the coverage interval’s level of confidence is 99 percent.
Confidence and coverage intervals express quite different things. A confidence interval characterizes the performance of a process used to measure a quantity with respect to the mean of measured values. A coverage interval characterizes a range of values attributed to a quantity itself.
When reported without an estimate of its uncertainty, the values a result permits are, at best, vague. No measurement result is considered complete until its associated uncertainty has been included.
Knowledge of the uncertainty associated with measurement results is essential to the interpretation of the results. Without quantitative assessments of uncertainty, it is impossible to decide . . . whether laws based on limits have been broken. Without information on uncertainty, there is a risk of misinterpretation of results. Incorrect decisions taken on such a basis may result in . . . incorrect prosecution in law. . . .
GUIDANCE FOR THE USE OF REPEATABILITY, REPRODCUCIBILITY AND TRUENESS ESTIMATES IN MEASUREMENT UNCERTAINTY ESTIMATION, ISO/TS 21748:2010 (International Organization for Standardization 2004). As pointed out by the National Academy of Sciences, this applies to all forensic measurements.
Error versus Uncertainty
Although measurement uncertainty is the more advanced approach to characterizing the results of measurement, error analysis is still an acceptable tool within the scientific community. Hence, either may be relied upon. For correct inferences to be drawn, however, forensic scientists must be consistent in the methods employed and testimony provided. Unfortunately, this is not always the case. Despite the distinctions between coverage and confidence intervals, forensic scientists often present confidence intervals in testimony as if they were coverage intervals while being totally unaware of the meaning of the former and the independent existence of the latter.
An advantage of the uncertainty paradigm is that it makes explicit that scientific conclusions are matters of belief subject to uncertainty. What distinguishes beliefs generated by science is their anchoring in the scientific method, which provides quantitative measures of epistemological robustness. Another is that the uncertainty approach permits researchers to consider all information at hand, whereas error analysis limits types of information that may be considered.
The Law
Few jurisdictions require that results of forensic measurements be accompanied by their uncertainty when introduced into evidence. If the admissibility of such results is evaluated at all, it typically revolves around a court’s determination of their reliability. The issue is whether a judge finds the evidence a reliable enough basis for a jury verdict. This serves the important gatekeeping function of minimizing convictions based on scientifically unsound forensic evidence. As illustrated, though, the fact of a result’s accuracy or reliability doesn’t divest it of its potential to mislead. To the contrary, it may increase any tendency for it to do so.
If evidence is to be properly understood and weighed, fact finders must be given the necessary information. Asking them to weigh measurement results without uncertainty renders any determination speculative, even if ultimately correct, as the conclusions supported by the results cannot be understood without it. This undermines confidence in such determinations. “[C]onsidering or not the uncertainty of a critical result can make the difference between acquittal and a guilty sentence.” Walter Bich, Interdependence between Measurement Uncertainty and Metrological Traceability, 14(11) Accred. Qual. Assur. 581 (2009). “Helping the trier to better understand the evidence by giving the trier a meaningful insight into the uncertainty of any scientific measurement can prevent . . . miscarriage[s] from occurring.” Edward Imwinkelried, The Importance of Forensic Metrology in Preventing Miscarriages of Justice: Intellectual Honesty About the Uncertainty of Measurement in Scientific Analysis, VII(2) J. Marshall L. J. 331 (2014).
Federal and most state evidentiary rules require that, to be admissible, scientific evidence must help the trier of fact understand evidence or determine facts in issue. Although interpreted primarily as a relevancy requirement, other admissibility questions arise. Assuming that a piece of scientific evidence is relevant, if jurors aren’t provided the information necessary to understand which conclusions it supports, can it be deemed “helpful”?
In forensic DNA analysis, random match probability refers to the frequency with which randomly selected samples of DNA from the population would match a sample in evidence. This provides a measure of how strongly a DNA match supports the conclusion that the source of an unknown sample may have been the defendant. The majority of jurisdictions have found that, to be admissible, DNA match evidence must be accompanied by match frequencies because “[t]o say that two DNA patterns match, without providing any scientifically valid estimate of the frequency with which such matches might occur by chance, is meaningless.” Nelson v. State, 628 A.2d 69, 76 (Del. 1993).
The concern in these jurisdictions is not that DNA results are inaccurate or unreliable. Rather, it is that, absent information concerning match frequencies, a jury cannot know how strongly a “match” result supports particular conclusions, rendering its significance a “matter of speculation.” State v. Brown, 470 N.W.2d 30, 33 (Iowa, 1991); State v. Cauthron, 120 Wn.2d 879, 906–07 846 P.2d 502 (Wash. 1993).
The same principles apply to measurement results and their uncertainty. Both measurement uncertainty and match frequencies convey information about the conclusions, or limitations thereof, that the science behind a result supports. This is why the National Academy of Sciences (NAS) concluded that the results of every forensic measurement must be accompanied by their uncertainty—so that lay jurors may be able to understand, properly weigh, and arrive at valid conclusions based on them.
The question of whether the admissibility of measured results requires that they be accompanied by their uncertainty was raised for the first time in the courtroom in the last decade. Trial courts in Washington and Michigan, the first to consider it, got it right, ruling that the results of breath and blood alcohol measurements would be inadmissible unless accompanied by their uncertainty. The courts gave significant weight to the NAS conclusion that breath and blood alcohol measurements must be accompanied by their uncertainty to be properly understood and weighed.
While the Michigan rulings still stand, the Washington Court of Appeals ignored science, finding that the admissibility of measured results does not require the determination or reporting of a measured result’s uncertainty. Although aware of the NAS opinion, it determined that conclusions supported by breath test results were apparent standing alone as long as the results are accurate and reliable. The flaw in this conclusion is demonstrated by the earlier example.
Another example helps reveal how fundamental this flaw is. The evidence before the Washington Court of Appeals included two accurate and reliable breath tests administered to two individuals, each yielding identical results of 0.081 g/210L and 0.084 g/210L. However, the results’ uncertainty expressed as a coverage interval differed:
Test 1: 0.0749 g/210L ↔ 0.0903 g/210L
Test 2: 0.0764 g/210L ↔ 0.0913 g/210L
The differing coverage intervals reveal that the two identical results do not support identical sets of conclusions. Although the difference may seem unremarkable, it is significant. The first result yields a 19.2 percent likelihood that the individual’s BrAC is less than the legal limit. The second yields only a 9.2 percent likelihood.
This demonstrates the court’s flawed reasoning in two ways. First, presenting measurement results absent their uncertainty hides the fact that, even though they may exceed a particular limit, there may be a significant likelihood that they are actually less than that limit. Second, by describing identical results identically, it hides the fact that identical results may support importantly distinct sets of conclusions. The only way jurors can know the conclusions supported by a measured result is to be provided with its uncertainty.
Conclusion
At issue is the ability of juries to render verdicts consistent with scientific and factual reality. Reporting the uncertainty of forensic measurements . . . promotes honesty in the courtroom. It is axiomatic that measurements are inherently uncertain. As the Washington cases emphasize, it is misleading to present the trier of fact with only a single point value. There is a grave risk that without the benefit of qualifying testimony, the trier will mistakenly treat the point value as exact and ascribe undue weight to the evidence. The antidote—the necessary qualification—is a quantitative measure of the margin of error or uncertainty.
If the results of forensic measurements are to facilitate the determination of truth in the courtroom, then the law must adhere to the same principles in the interpretation of such evidence as science does. This requires that the admissibility of measured results be conditioned upon being accompanied by their uncertainty. Failure to do so divests such evidence of the fact-finding power science provides and undermines confidence in verdicts obtained in reliance upon it.