What Is Good Science?
First, we provide some definitions of the terms used below. “Non-science”—as one would expect—refers to pursuits that by both intent and necessity do not follow scientific methodology. (Examples include law, art, religion, and public policy.) “Pseudoscience” pertains to endeavors that do not follow scientific methodology consistently or at all, but can be confused with science because they project a scientific façade (e.g., through the use of graphs and other seemingly technical displays). Examples of pseudoscience include astrology and creationism. “Bad science” occurs when accepted scientific methods are followed, but yield—by error, happenstance, or bias —erroneous results. The “cold fusion” fiasco of the 1990s is an example of good scientists gone awry and producing bad science.
What constitutes good science and the underlying “scientific method” has varied through history. The Daubert standard is most closely aligned with the philosophy of science espoused by Karl Popper. According to Popper, the scientific method includes development of testable hypotheses, design and implementation of studies to test these hypotheses, and re‑formulation and retesting of hypotheses as new information becomes available. Other components of the scientific method include open debate within the scientific community and peer review prior to publication to minimize author bias and errors.
Strictly speaking, hypotheses can be tested but never be “proved” using Popper’s scientific method. Rather, alternative hypotheses can only be shown to be false (i.e., disproved). This dimension of Popper’s philosophy has profoundly affected current views of science. Thus, science pertains only to assertions (as alternative hypotheses) that are disprovable. By this definition, regulatory policies that cannot be disproved—for example, the assertion that an additional cancer risk of one in a million (10-6) is “safe”—express value judgments that, although perhaps informed by science, are themselves non-science. Popper’s emphasis on disproof also affects scientists’ views of the quality of science; hypotheses are considered scientifically tenable only after likely alternative hypotheses have been disproved.
Regardless of the nuances of scientific philosophy, several readily recognizable and broadly accepted characteristics differentiate good science from non-science and pseudoscience. Good science is progressive. It accepts new data and willingly replaces old ideas and concepts with new and better ones. Good science is rigorous. It is based on development of hypotheses with roots in existing knowledge and on experimental designs that can test hypotheses. Good science is transparent. It is based on methods that are well described in the published literature and can be replicated by others. Finally, good science is explicitlynon-biased. It is based on fact rather than on ideology and personal preferences. (For a further introduction to the philosophy of science, we recommend Goodstein, D. 2000, How Science Works, or the Federal Judicial Center’s Reference Manual on Scientiﬁc Evidence, 2d Ed., at pp. 67–83. For a general discussion of pseudoscience, cold fusion, and the scientific method, the Wikipedia discussions are quite good.)
What Are ERAs and NRDAs?
Site characterization and selection of remedial alternatives under the Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA), the Resource Conservation and Recovery Act (RCRA), and equivalent state programs rely heavily on ERAs to identify concentrations of contaminants in soil, surface water, and sediment that are protective of ecological resources. The results of ERA are, in turn, often used as the basis for natural-resource-damage claims for injury to ecological resources and to human uses of those resources. On the surface, ERAs appear to many regulators, practitioners, and responsible parties as meeting the criteria for good science. ERAs typically include extensive calculations, statistical analyses, descriptions of hypotheses, and references from the scientific literature. They are often conducted by technical experts with advanced degrees in the relevant sciences. Nevertheless, by intent and design, ERAs do not adhere to the scientific method.
ERAs generally consist of two tiers of evaluation. For the first tier, the screening assessment is conservatively biased in a compounded manner. Maximum concentrations of contaminants detected in environmental media are compared to “ecological screening values,” which are generally the lowest (i.e., most conservative) concentrations at which adverse effects to ecological receptors have been reported in the literature. In effect, the screening assumes that ecological receptors are exposed only to the maximum concentrations and that all receptors are as vulnerable as the most sensitive receptor. These conservative biases are intentional to minimize the chance of missing potential impacts and reflect a policy judgment essentially to place a “thumb on the scale” in favor of the most protective standard.
For the second tier of the ERA process, the baseline assessment, the conservative assumptions can be relaxed in favor of less conservative but still biased scenarios. These baseline ERAs call for development of risk “hypotheses” and design of studies to address those “hypotheses.” This language suggests a scientific process. Notwithstanding the nomenclature, subsequent studies to address these “hypotheses” are descriptive in nature and do not meet the criteria for the scientific method. Studies typically undertaken include collection of biological samples for chemical analysis, characterization of structure and function of benthic invertebrate communities (i.e., organisms that live in sediment), and toxicity tests. The information gathered is important in assessing risks, but no scientific hypotheses are rigorously tested or disproved.
In contrast to the ERA, the injury-assessment phase of NRDA is purportedly intended to be an accurate assessment of injuries to ecological systems and their ability to provide services. In our experience, however, natural-resource trustees routinely rely on the results of ERAs as prima facia evidence of actual injuries, and injury assessments typically employ the same intentionally conservative methods developed for ERAs.
ERAs and NRDAs Are Conservative Regulatory Policy, Not Science
As should be clear from the preceding, the underlying assumptions of ERAs and, by extension, many NRDAs, are different and sometimes antagonistic to those of good science. Notably, whereas good science vigorously attempts to be unbiased and eliminate ideology, ERAs explicitly embrace conservative bias as appropriate. While good science demands that results be replicable, ERAs routinely rely on worst-case exposure scenarios and most sensitive toxicological responses that are, by their very nature, outlier data that are not corroborated by other results. Whereas good science should be transparent, ERAs are inherently obscure because there are no concrete definitions of unacceptable ecological risk. Thus, none of the broadly accepted characteristics of good science, outlined earlier, are generally met during the ERA process.
This description is not intended as a criticism of the ERA process. The ERA process is public policy, and as such, must include judgments that exceed the limits of the questions that can be answered by good science. Indeed, science itself is not competent to address such issues as “how safe is safe.” The problem lies in the commonly held misconception that ERA is a robust scientific process that, by testing hypotheses, can demonstrate a high likelihood or actual occurrence of ecological impacts. This view of the process is simply inaccurate. Given the high levels of overt bias and conservative ideology deliberately included in the ERA process, a conclusion of “unacceptable ecological risk” often really means “we cannot be certain that there are no impacts” rather than “unacceptable impacts will occur or are likely to occur.” Consequently, conclusions of unacceptable ecological risk are not strong evidence of actual, significant injuries in NRDA.
That ERAs and good science occur in very different universes is illustrated by the widespread use of ecological screening values (ESVs) that, from the viewpoint of rigorous science, range from questionable to ridiculous. For much of the 1990s, for example, Environmental Protection Agency (EPA) Region 3 promulgated ESVs for soil that were based on unknown methods and the overriding goal of conservatism. Many of these values were within or below naturally occurring soil concentrations—a fact that conflicts with common sense and the scientific theory of evolution and natural selection. Most notable was an ESV of 50 mg/kg for aluminum, which is about 1,000 times below naturally occurring background concentrations. In other words, this ESV would require remediation for aluminum to levels 1,000 times below the concentrations naturally present before the presence of the first human beings in North America. Explanations that these values have now been rescinded have a “It’s true I’ve stopped beating my spouse, regularly, in public” quality. That such obviously wrong values, from a scientific perspective, could be generated, promulgated, and persist in public for about a decade demonstrates that the ERA process falls well outside the realm of good science. Because the overriding goal of the ERA process is to avoid underestimating risk, even literally incredibly low ESVs and very improbable risk scenarios are not considered evidence of technical deficiency. That stark difference between ERAs and good science persists today. For example, the EPA recently generated an ecological soil screening level (EcoSSL) for vanadium that is lower than about 90 percent of naturally occurring background levels.
The Non-Science of Assessing Risks and Injury to the Benthos
As an illustration of these issues, we consider next the pseudoscience of assessing risk and injury to aquatic sediments and benthos. Benthos are biota, usually macroinvertebrates, that live in or on the sediments at the bottom of aquatic systems. Although some benthos (e.g., oysters, crabs, and lobsters) are themselves economically important, the primary ecological and economic importance of benthos pertains to their role as food for fish and wildlife. In addition to being ecologically important, benthos are also very susceptible to harm by toxic chemicals because they are highly exposed to legacy contaminants that persist in aquatic sediments. Hence, CERCLA/RCRA investigations and associated NRDAs often focus on assessment of risks and injuries to benthos. Compared to terrestrial areas, impacts to sediments also tend to be more extensive and more costly to remediate and to satisfy NRDA claims.
The “sediment triad” is often used to evaluate risk and injury to benthos. The triad consists of (1) characterization of the structure and function of benthic communities, (2) laboratory bioassays that expose surrogate species to sediments collected from the assessment area, and (3) comparison of chemical concentrations to ecological benchmarks.
This approach appears to approximate the scientific method, but falls short on myriad fronts. Appropriate characterization of benthic communities depends on numerous factors, including substrate size, time of year, temporal variation, physical stressors, presence of substances other than contaminants of concern, and identification and characterization of appropriate reference sites that have characteristics similar to the assessment area. Laboratory bioassays, the second leg of the triad, are of limited value for several reasons: a) They cannot identify risk and injury attributable to specific contaminants; b) results of the tests can be confounded by numerous factors other than contaminant concentrations; c) the tests often do not demonstrate a clear dose‑response relationship (which is another indication of confounding effects); and d) results of laboratory tests do not always easily translate to actual site conditions.
The Pseudoscience of Assessing Risks and Injury to the Benthos
Widely used sediment benchmarks may be the most problematic component of the sediment triad, which is unfortunate because this leg of the triad is frequently the most heavily weighted. Precise assessment of risks and injuries to aquatic benthos based on chemical concentrations is very difficult. Sediment toxicity is a complex function of contaminants’ inherent toxicity, bioavailability, and biotic sensitivity, all of which vary across sediment type, specific chemical type, and benthic species. Although the EPA has been trying to develop scientifically defensible sediment benchmarks since the early 1990s, it still has not issued definitive benchmarks and, given the complexity of sediment toxicity, may never do so.
Desperate for toxicological‑based sediment benchmarks—any benchmarks—with which to evaluate sediment toxicity, a variety of federal and state regulatory agencies took matters into their own hands. Using non-scientific methods and field-contaminated sediments, these agencies developed a variety of so-called co-occurrence sediment quality benchmarks (CoSQBs). Commonly used CoSQBs include the National Oceanic and Atmospheric Administration ER-L and ER-M values, Florida’s TEL and PEL values, Ontario’s LEL and SEL values, and the TEC and PEC values. Use of these CoSQBs is strongly recommended in the guidance of many states and EPA regions. It is important to note that the CoSQBs are widely discussed and ostensibly vetted in the peer-reviewed literature. (Specific references and detailed description of most widely used CoSQBs can be found in the following: MacDonald, D.D, C.G. Ingersoll, and T. Berger. 2000. Development and evaluation of consensus‑based sediment quality guidelines for freshwater ecosystems. Arch Environ Contam. Toxicol 39:20–31.)
As implied by the name, the CoSQBs are supposedly based on the coincidence of concentrations of potentially toxic chemicals with observed impacts on benthos. “Impacts” are typically negative effects noted in sediment bioassays conducted in the laboratory (although the absence of benthic species in a sample was also considered evidence of impacts in the calculation of some CoSQBs).
A critical aspect of CoSQBs is that the sediments used in their derivation were based on sediment contaminated in the field, including anthropogenically impacted sediments from large harbors and near large cities. Use of field-contaminated sediments, as opposed to clean sediments spiked with specific amounts of specific chemicals, is simultaneously seen as both a key advantage and a critical failing of CoSQBs. On the plus side, the resulting information considers the potential effects of chemicals at concentrations and in combinations that occur in the field. On the negative side, the presence of multiple potential causal agents in the same sediment makes identification of a specific causal agent, or agents, all but impossible.
The potential for spurious relationships between toxicity and any one chemical is especially problematic for sediments from working harbors and urban areas. Sediments near large urban areas typically have multiple sources of many toxic chemicals. Once released to the environment, many of these toxicants will tend to accumulate in the same fine‑grained, organic‑rich sediments. Hence, sediments with elevated concentrations of one heavy metal will often have high concentrations of other heavy metals and other toxic chemicals (e.g., polycyclic aromatic hydrocarbons [PAHs] and pesticides). These chemical stressors also tend to co‑occur with high levels of other stressors (e.g., ammonia, hydrogen sulfide, and low dissolved oxygen). Citing the complexity of sediment toxicity, inadequacy of then-available science, and the need to be conservative, the generators of CoSQBs made little or no attempt to identify the causal chemical(s)—or for that matter, any non-chemical causes—in sediments where impacts to benthos were observed.
The potential for spurious relationships between observed impacts and any specific chemical was exacerbated because, as a conservative measure, the same impacted sediment sample was used as “evidence” of toxic effects for as many as 10 or more different chemicals. For example, the same impacted sediment sample would be used as evidence of the toxicity of all of the heavy metals and some of the co-occurring organics (e.g., PAH, PCBs, and pesticides).
In effect, the methodology assumes that all observed impacts are 1) due to chemical toxicity; 2) due to a chemical that was analyzed; and 3) due to allof the chemicals that were analyzed. These three assumptions transit from uncertain to unlikely to not credible, respectively. In reality, impacts observed in some sediments were not due to chemical toxicity at all. Further, some impacts that were due to chemical toxicity were due to unmeasured chemicals. And refined analyses of sediment samples indicates that toxicity is usually due to one or two chemicals or chemical groups, not the 5 to 15 different chemicals assumed by the CoSQB calculation methods.
Even less credible are instances in which these conservative assumptions are applied to calculation of CoSQBs based on the presence or absence of specific benthic species. To be conservative, calculation methods assumed that the presence or absence of a specific benthic species, in a specific sediment sample, was due entirely to the toxicity from chemicals measured in that sample. This highly conservative assumption ignores the fact that the distribution of benthic species across different sediment habitats depends on a host of physical and biological factors. In short, this assumption ignores the entire science of ecology (which can be defined as determination of factors controlling the abundance and distribution of biota). By analogy, if the CoSQB methods had been used to develop soil benchmarks for terrestrial plants, these conservative methods would have attributed the absence of bald cypress from the desert and hothouse orchids from the tundra to toxicity exerted by contaminants in underlying soils.
The inherent problems with methods described above should be apparent to scientists and non-scientists alike. One simply cannot estimate the toxic threshold for a specific contaminant by looking at effects caused largely, or entirely, by other factors. By analogy, one could not hope to gauge accurately the toxicity of cigarettes by compiling the smoking history of people that had died of all causes (smoking-related or not). What one would end up with would be an extremely conservative, or in the terminology of good science, “wrong” estimate of cigarette toxicity based on the background incidence of smoking.
Testing the Pseudoscientific CoSQBs with Good Science Methods
The above discussion points out significant issues with the “science” underlying the CoSQBs. Using an approach that is based on good science, we conducted a series of analyses designed to test the toxicological basis of CoSQBs. (This and subsequently described results are based on a series of analyses by the authors. More detailed discussion, methods, and scientific backup can be obtained from the following or by contacting the first author.) As a first step, we hypothesized that CoSQBs for a specific compound could not be based on the actual toxicity of that chemical. Instead, it was predicted that CoSQBs would be based on the compound’s background incidences (as with the smoking analogy above). To test this hypothesis, we compared CoSQBs for specific metals versus their median concentrations in sediments from various locations across the United States. As predicted, median background concentrations for metals were highly significant (Figure 1, r2 values of 98 to 99.6 percent) predictors of the values of the primary CoSQBs (ER‑L, LEL, TEL, and TEC). In other words, there are very tight relationships between the relative magnitude of a CoSQB and its ambient concentrations in the environment.
Figure 1. Relationships between two types of CoSQBs (ERL and LEL) and median background concentrations for metals. Median background concentrations from Rice 1999, Environ. Sci. Technol. 33: 2499–2504.
In the second step of our analysis, we obtained a database of sediment toxicity and chemistry that was used to generate one set of commonly used CoSQBs. First, the data were used to determine which chemical(s) were actually causing toxic impacts in a given sample. The causal agents could be identified by comparing their concentrations to known toxic levels that have been determined in laboratory experiments and were available in the scientific literature. As hypothesized, the results demonstrated that the CoSQBs had been based on sediment concentrations that were largely, and for some chemicals, always below actual toxic levels. Second, the data set was again (as above) used to test whether the CoSQBs reflected background concentrations across all the samples. As in the previous analysis, the magnitude of a CoSQB, for organics as well as metals, was predictable by the ambient concentration in the data sets. That is, chemicals that had high ambient concentrations in the environment had high CoSQBs and vice versa for rarer chemicals.
These very tight relationships between ambient concentrations of different chemicals and their CoSQBs demonstrate that these supposed toxic values cannot really, in fact, reflect relative toxicity. There is no reason to believe that relative toxicity of different chemicals would scale more or less linearly with their relative abundance in sediment. Rare chemicals are not necessarily more toxic than more common ones, and vice versa. Therefore, the tight relationships between background concentrations and CoSQBs disproves the hypothesis that they are based on toxicity. Rather, the CoSQBs are most parsimoniously demonstrated to be just indices of background concentrations.
The irrelevance of the toxicity of a chemical to its CoSQB (which, remember, is supposedly a toxicological value) was further demonstrated by another analysis. In this analysis, a set of real CoSQBs were compared to randomly generated CoSQBs. (Although developed by Ontario Ministry of the Environment, the Ontario CoSQBs are widely used and embedded in the regulations of several states, including New York and New Jersey.) To generate the “random” CoSQBs, chemical values were first randomly selected from the same data set of chemical concentrations used to generate the real CoSQBs. Critically, the “impacts” on benthos associated with these chemical concentrations—the toxic effects that supposedly were determinative in estimating the real CoSQBs—had no influence whatsoever on these random samples. After this, “random” CoSQBs for individual chemicals were derived using the same calculation methods as the real CoSQBs. In support of our hypothesis, the randomly generated CoSQBs looked very, very much like the real CoSQBs, meaning that essentially the same CoSQBs are generated irrespective of whether the impacts are considered or not (Figure 2). In essence, the random-CoSQBs are an experimental control that demonstrates that the impacts had no significant effect on the real CoSQBs.
Figure 2. Relationships between actual LEL (lowest effect level) and actual SEL (severe effect level) values versus randomly generated LEL and SEL values.
Contrary to conventional wisdom and frequent assertion, then, these CoSQBs are not really based on the co-occurrence of impacts and chemical concentrations. There apparently is no “Co”—that is, co-occurrence—in CoSQBs. Instead, they are just semi-randomly generated values based on ambient concentrations. More simply, CoSQBs are background numbers.
The story of CoSQBs illustrates much of the ambiguity about good science, generally, and specifically with respect to its relevance to ERA and NRDA. These supposed toxicological values have elements of non-science, pseudoscience, bad science, and even good science, because they are published and discussed in the scientific literature. This pretense to good science is frequently cited in their widespread use in both ERAs and NRDAs. However, as demonstrated above, CoSQBs are not science and, given the widespread misapprehension that they are based on toxic effects, arguably not even good regulatory policy.
Conservative regulatory policy, not science, is the underlying basis of ERAs and related studies in NRDA. This policy judgment may be appropriate because scientific methods are not capable of addressing critical issues such as “how safe is safe.” However, it becomes problematic when regulators, counsel, and the courts confuse ERA and its results with good science and when these intentionally biased methods are misapplied to injury assessments in NRDA.
The confusion between non-scientific regulatory policy and good science is exemplified by the widespread use of CoSQBs to assess risks/injuries to benthos. Although CoSQBs may appear to non-experts as good science, they were not derived by rigorous application of the scientific method, and they lack necessary characteristics of science. Worse, most users believe that they have some technical basis based on coincidence with negative effects on benthos, but that supposed link is apparently baseless. Expert testimony regarding risks/injuries that relies on CoSQBs is not good science.