For over a hundred years, firearms examiners have testified in criminal trials linking fired bullets or cartridge cases to a particular firearm. Such testimony continues to be in great demand, given the degree of firearm-related violence in the United States, with more than 100,000 requests for bullet or cartridge case comparisons each year. But more recently, the consequences of the uncritical judicial acceptance of firearms comparison testimony have come into sharper focus. Indeed, we now know that firearms evidence has played a role in numerous high-profile wrongful convictions and that multiple forensic laboratories have shuttered as a result of errors by practitioners. In the 2014 per curiam opinion in Hinton v. Alabama, for example, the U.S. Supreme Court reversed a conviction because of a defense lawyer’s inadequate performance in failing to develop firearms evidence at a capital murder trial in response to a state examiner’s comparison. Hinton was subsequently exonerated, and he commented: “I shouldn’t have [sat] on death row for thirty years . . . . All they had to do was to test the gun.”
As detailed in a recent comprehensive review of all judicial rulings in the United States concerning firearms comparison evidence, judges did not anticipate or account for these consequences for many decades. Instead, after some initial skepticism, judges uncritically accepted firearms comparison experts and did not quickly change their approaches—even in the years after the United States Supreme Court’s 1993 decision in Daubert v. Merrell Dow Pharmaceuticals, which imposed clearer and more rigorous gatekeeping responsibilities to assess the reliability of scientific evidence. However, recent scientific scholarship has called into question the validity and reliability of firearms examination—contributing to an explosion of judicial rulings, some of them critical of the field. In a 2008 report, the National Academy of Sciences (NAS) found that “[t]he validity of the fundamental assumptions of uniqueness and reproducibility of firearms-related toolmarks has not yet been fully demonstrated.” In its 2009 report, the NAS concluded “[s]ufficient studies have not been done to understand the reliability and repeatability of the methods.”
Solidifying this trend, in 2016, the President’s Council of Advisors on Science and Technology (PCAST) reviewed in detail all of the firearms examination studies that had been conducted to date. While numerous studies were reviewed, PCAST found that only one study had been appropriately designed to test the performance of firearms examiners, and it had yet to be published in a peer-reviewed scientific journal. Therefore, PCAST concluded that “the current evidence falls short of the scientific criteria for foundational validity.” Additionally, numerous research scientists—including psychologists, statisticians, and other academics with training in conducting science rather than applying a forensic technique—have furthered and expanded NAS and PCAST’s critiques in recent years by discussing how sampling issues, attrition, error rate calculations, external validity, and the like impact the value of foundational research into the accuracy and consistency of firearms examination methods. Indeed, several such scientists have even testified about the research base of firearms examination during pretrial admissibility hearings or contributed to amicus briefs in support of defense motions to exclude such evidence. Generally, these methods experts reach the same conclusions as NAS and PCAST—that is, firearms evidence remains scientifically unvalidated. As one judge put it, “[R]arely do the experts fall into such cognizable camps, forensic practitioners on one side and academic researchers on the other.”
The impact of these modern critiques on the admissibility of firearms examination has borne concrete results, but gradually. Judges were slow to react to scientific concerns raised regarding firearms comparison evidence, even after the Daubert ruling. As lawyers have litigated the findings of scientific reports and error rate studies with the addition of methods experts, more judges have imposed increasingly stringent limits. Rather than permit the expert to testify to a “source identification,” to the exclusion of all other firearms in the world, judges have instructed that experts must testify to a “reasonable certainty,” a “more likely than not” conclusion, or, still more limited, that they “cannot exclude” a firearm. In more recent years, following a well-known 2019 ruling in the D.C. Superior Court, judges have limited testimony of firearms experts in increasingly stringent ways. One trial judge even excluded firearms expert testimony outright in one case, finding that the field’s methods lack general acceptance given inadequate evidence of reliability. But with the exception of that one ruling, admissibility remains the norm.
If courts have intended that such limits on the overstatement of findings would counteract what has been described as the “talismanic significance” afforded forensic testimony by jurors, then there are reasons to doubt that their compromise solution focusing on conclusion language (in combination with vigorous cross examination) will have its desired effect. One prior study that examined how lay jurors react to variations in testimonial framing of conclusions as well as confrontation through cross examination found that (with the exception of precluding any inclusionary testimony and permitting experts to say only that they could not exclude the firearm in question) neither were effective in moderating the guilty verdicts of participants. But there are also reasons to think judges may examine firearms testimony more carefully in future years and potentially impose greater restrictions or bars on admissibility. Federal Rule of Evidence 702 was amended for the first time since 2000 on December 1, 2023. The Advisory Committee notes emphasized these revisions were “especially pertinent” to forensic evidence. The rule changes squarely addressed two issues that judges have grappled with in the area of firearms evidence: the reliability of the methods and the overstatement of conclusions.
The effectiveness of other approaches (by judges) to moderate or (by litigants) contest firearms examination evidence is not well understood. Therefore, stakeholders in the criminal legal system will benefit, in navigating the changes to Rule 702, from an expanded understanding of lay reactions to testimony offered by defense rebuttal experts, including concerns about methods used in the comparison of fired munition. Given trends in judicial regulation of firearms expert testimony, advances in the scientific understanding of that testimony, and changes to Rule 702 itself, we sought to examine how laypeople evaluate firearms testimony where the defense seeks to contest it using expert testimony of its own. Next, we turn to an introduction to how firearms experts conduct their work and testify, before describing our methods and study results.
I. An Introduction to Firearms Expert Testimony
A. Firearms Comparison Methods and Testimony
Firearms examination is a subspecies of toolmarks examination, which is the practice of examining marks to opine on whether they were left on a substance by a particular type of tool or particular tool. When conducting comparisons, firearms examiners seek to link crime scene evidence—such as spent cartridge casings or bullets—with a firearm. These examiners assume that the manufacturing processes used to cut, drill, and grind a gun leave distinct and identifiable markings on the gun’s barrel, breech face, firing pin, and other components. When the firearm discharges, those components contact the ammunition and leave marks. Examiners have long assumed that firearms leave distinct toolmarks on expended munitions. These examiners believe they can definitively link spent ammunition to a particular firearm using these toolmarks.
When firearms examiners testify as experts, they begin by opining on class characteristics. These class characteristics are design features such as the shape of the firing pin or the number and direction of the grooves on the barrel of the gun, which vary by manufacturer and type of firearm. Those design features would be shared by all firearms of that type, however, and do not permit any more specific identification of a particular firearm.
So-called “individual” characteristics permit those more searching conclusions. By the late 1990s, firearms examiners premised expert testimony on a “theory of identification” set out by a professional association, the Association of Firearms and Tool Mark Examiners (AFTE). AFTE defines individual characteristics as “[m]arks produced by the random imperfections or irregularities of tool surfaces. These random imperfections or irregularities are produced incidental to manufacture and/or caused by use, corrosion, or damage. They are unique to that tool to the practical exclusion of all other tools.”
In reviewing such class and individual characteristics, AFTE instructs practitioners to use the phrase “identification” to explain in testimony what they mean when they identify “sufficient agreement” of markings when examining bullets or cartridge cases. There are no quantitative guidelines or numeric thresholds for how many individual characteristics must be observed to reach “an identification.” Rather, the AFTE protocol states an identification is justified “when the unique surface contours of two toolmarks are in sufficient agreement.” As the PCAST Report observed, it is a circular definition, ultimately relying on the expert’s own subjective decision that sufficient commonalities, nowhere defined, exist. AFTE nevertheless associates statistical certainty with “identifications” claiming that such a conclusion means “the likelihood another tool could have made the mark is so remote as to be considered a practical impossibility.”
Even in the face of criticism of both the “sufficient agreement” standard (as described above) and their assertions of certainty, firearms examiners have largely refused, in recent years, to temper their conclusions. For example, federal experts, following Department of Justice guidelines regarding expert testimony, use the term “source identification” to express their ultimate conclusions; are not prohibited from expressing their conclusions as a practical certainty; and are encouraged, having reached an “identification” to describe “the probability that the two toolmarks were made by different sources” as “so small that it is negligible.”
More importantly, though, firearms examiners have limited their conclusions regarding firearms evidence for a different reason: judges have ordered them to do so. Those court-imposed restrictions have included limiting examiners to opinions of reasonable certainty, “more likely than not,” “consistent,” or “could not exclude.”
B. Jury Research on Firearms and Forensic Experts
As noted, few studies have examined how laypeople evaluate firearms examination testimony. One prior paper, presenting two studies, found as a preliminary matter that laypeople place great weight on such testimony. In the first study, the authors found variation in conclusion language (reasonable certainty, more likely than not, source identification, and the like) did not affect guilty verdicts, nor jurors’ estimates of the likelihood that the defendant’s gun fired the bullet recovered. In contrast, a more limited conclusion that an examiner “cannot exclude the defendant’s gun” did significantly reduce guilty verdicts and likelihood estimates alike. In the second study presented in that paper, presence of cross-examination largely did not affect these findings.
A small earlier study examined how laypeople evaluate firearms conclusion language, surveying 107 participants and finding “a significant main effect for certainty,” with increased expression of expert certainty generally leading to increased participant certainty. A follow-up study with 437 U.S. participants examined the impact of cross-examination on lay evaluations of firearm testimony. The study placed half of the participants in a group who were told the conclusion of a firearms expert and half in another group who were given a statement that the expert acknowledged limitations on cross-examination. Neither group was provided with a transcript. As predicted, the acknowledgment of limitations on cross-examination reduced the weight participants placed on the evidence, where the expert had earlier professed certainty.
Finally, a recent study explored the impact of firearms examination testimony on 492 and 1002 undergraduate psychology students across two experiments. Experiment 1 varied firearms examination conclusion testimony offered by the prosecution (identification vs. inconclusive vs. elimination) and found statistically significant differences in guilty verdicts between each. Experiment 2 added a control condition without firearms examination testimony of any kind, as well as a condition in which the firearms examiner considered the evidence unsuitable for comparison, and found that while identification testimony substantially increased guilty verdicts, lay participants treated inconclusive conclusions as essentially neutral (i.e., less inculpatory than identification testimony, more inculpatory than elimination testimony, and equivalent to unsuitable and no forensic evidence conditions).
This small number of studies examining firearms expert evidence follows a larger body of research examining how jurors evaluate other types of forensic testimony. Generally, laypeople place strong weight on forensic science and view it as highly accurate and persuasive. Other studies have found that laypeople are “sometimes insensitive” to variations in the way in which a forensic “match” is communicated using qualitative terms. For example, jurors place great weight on fingerprint evidence and regard it as accurate and reliable, regardless of whether the expert expresses conclusions in more certain or more cautious terms. However, some evidence suggests that the weight mock jurors place on forensic evidence varies depending on the forensic discipline. In addition, multiple research efforts have concluded that cross-examination shows “little or no ability . . . to undo the effects of an expert’s testimony on direct examination.” But specific lines of cross-examination, about error rates and proficiency of experts as well as subjectivity and bias, appear to buck this trend and impact laypeople.
One prior study examined the impact of defense expert testimony during a battle of the experts in a criminal case. That study examined three types of rebuttal testimony in a mock trial involving fingerprint expert testimony: (1) a methodological rebuttal explaining the general risk of error in the fingerprint-comparison process; (2) a new-evidence rebuttal concluding the latent fingerprint recovered in this case was not suitable for comparison; and (3) a new-evidence rebuttal excluding the defendant as the source of the latent fingerprint. All three rebuttals significantly altered perceptions of the prosecution’s fingerprint evidence, but new-evidence rebuttals proved most effective. No such study has been done in the context of firearms evidence, that is, examining the impact of different types of defense expert witnesses on lay perceptions of the evidence.
Other studies have focused on how laypeople evaluate DNA evidence, which is presented using statistical and not qualitative conclusions. Studies have found that jurors place especially high weight on DNA evidence. However, jurors have been found in a variety of studies to be sensitive to different presentation formats of statistical conclusions in the DNA context, including undervaluing the evidence in some instances, as well as falling prey to logical fallacies of different types and undervaluing the risk of error.
Overall, that body of work suggests that laypeople place great weight on firearms expert testimony, and alterations to the language used to communicate the conclusions have little impact. What has yet to be tested is whether, in the specific context of firearms examination testimony, defense experts—including research scientists who explain the weakness in the scientific foundation of the relevant methods, or “methods experts”—have any effect on jurors. Methods experts testify almost exclusively in admissibility hearings to judges. They have rarely been called at trial to explain their analyses to jurors. It may take a fair amount of evidence regarding the lack of reliability of firearms methods to moderate jurors’ prior beliefs. It is also unclear if jurors have the wherewithal to understand the scientific foundation of firearms examination. As one judge noted after conducting an extensive pretrial evidentiary hearing:
[A] full exploration of the issues surrounding the reliability of [firearms examiner] evidence in the present case required several days of testimony from multiple expert witnesses, close evaluation of numerous applied-science studies, exploration into the studies’ design and methodology and the problems arising therefrom, and advocacy by counsel on each side specially tasked with litigating forensic science issues. It would be fanciful to conclude that the normal adversarial process would enable a lay jury to adequately understand these issues . . . .
The present Article reports the results of a study designed to empirically test the impact of defense experts, including methods experts, on jurors. As described below, we report the results of an online study in which 351 jury-eligible adults read transcripts of a criminal trial and rendered a verdict along with several other judgments regarding the guilt of the defendant and the strength and reliability of the prosecution’s evidence.
II. Study Design and Methods
A. Participants
The study participants were recruited through Prolific and completed the study online. Prolific is a crowdsourcing platform that can produce high-quality data suitable for social science research. Individuals had to be jury-eligible (i.e., over age 18, a resident of the United States, and able to speak English) to participate in this study. The study also included several attention check and reading comprehension questions to ensure participants were engaged with the materials. Participants who failed attention or reading check questions or were identified from suspicious, duplicate geolocations were terminated from the study and excluded from subsequent analyses.
The final sample was comprised of 351 participants, with ages ranging from 18 to 75 (median = 34, IQR = 18). The sample was gender balanced, with 50% self-identifying as male and 50% as female. In terms of self-reported racial and ethnic backgrounds, 10% identified as Black, 9% as Asian, 66% as White, 10% as Hispanic, 0.3% as Native American or Pacific Islander, and the remaining participants selected other categories. Education-wise, 54% held a two-year college degree or less, 33% possessed at least a four-year college degree, and 13% held a post-graduate degree. The sample included residents from 43 states.
Participants were asked to self-identify their political preferences, with 3.4% identifying as “Very Conservative,” 14.8% as “Somewhat Conservative,” 24.5% as “Middle of the Road,” 30.2% as “Somewhat Liberal,” and 26.8% as “Very Liberal.” Additionally, 9% reported an annual household income of less than $20,000, while 15% reported an income exceeding $100,000. Furthermore, 17% reported having previously served on a jury. The vast majority (75%) of participants who had previously served as a juror reported having served in a criminal trial. Participants were then asked which trial error causes more harm in society, 45% thought that “erroneously convicting an innocent person” caused the most harm, while 8% thought “failing to convict a guilty person” caused the most harm, and 47% thought “both are equally bad.”
A minority (18%) of participants reported being firearm owners. Approximately 1/3 of the sample reported being extremely, moderately, or slightly comfortable with firearms, whereas over half of the sample reported being extremely, moderately, or slightly uncomfortable with firearms, and a small minority (12%) were neither comfortable nor uncomfortable. Note that none of these demographic or individual difference variables appear to be related to or predictive of the outcome measures in this study (e.g., guilty verdicts).
B. Procedure
After consenting to participate in the study, participants provided demographic information, which was reported above. In addition, participants answered the following question in Figure 1, adapted from Koehler, about the false positive error rate:
Figure 1. Question to Participants About the False Positive Error Rate