Stats and Methods Urban Legend 4: Effect Size vs. Hypothesis Testing
In what I can only assume is a special issue of Organizational Research Methods, several researchers discuss common statistical and methodological myths and urban legends (MUL) commonly seen in the organizational sciences (for more introduction, see the first article in the series). Fourth and final in this series: Cortina and Landis1 write “The Earth is Not Round (p = .00)”
I had initially hoped to report on this fourth article in the same week as Part 3 of this series, but quickly realized that I would need to parse it with a much finer-tooth comb. Cortina and Landis, with this article, are jumping squarely into the null hypothesis significance testing (NHST) vs. effect size testing (EST) debate with a reasonably strong position. If you aren’t familiar with what these terms refers to, here’s a little reminder:
- NHST: The comparison of obtained differences/relationships to a theoretical sampling distribution to determine the probability that we would find that difference/relationship (or one larger) if there were really no difference/relationship in the population (called the null hypothesis). If an observed result is improbable (assuming the null hypothesis were true), we typically use the term “statistically significant.”
- EST: The simple reporting of the observed result, such as the size of a correlation or d-statistic, and the confidence with which have made that estimation.
An influential 1994 paper by Cohen2 entitled “The earth is round (p<.05)” raised a large number of valid criticisms of the current state of NHST in psychology, and its strong perspective can be best summarized with this statement:
NHST has not only failed to support the advance of psychology as a science, but also has seriously impeded it. (p. 997)
NHST, Cohen argued, is so commonly misused and misunderstood by psychologists that it had a negative net effect on scientific progress. I’ve detailed potential problems with NHST elsewhere, so I won’t go into them here, but here’s the basic problem: it’s so easy to simplify statistical significance testing into a simple “it’s significant and therefore a real effect”/”it’s not significant and therefore not a real effect” that many researchers do exactly that, despite the fact that it is a completely invalid conclusion.
Cohen argues that the replacement of NHST with EST would do much good to fix this problem. Instead of statements like this…
- NHST: The difference between conditions was statistically significant [t(145) = 4.12, p < .05].
…you would see statements like this…
- EST: The difference between conditions was 0.12 standard deviations in magnitude [d = .12, CI(.95): .01 < d < .23].
Same phenomenon; different reporting. EST discourages researchers from making categorical statements like “there was an effect” while simultaneously giving information about the precision of the estimate obtained.
In the present article, Cortina and Landis argue that this shift will do no good. They say that while NHST is frequently misused, there at least exists a well-structured system by which to make judgments about NHST. An effect either is statistically significant, or it is not. EST, on the other hand, has very few standards by which to make judgments.
The most prominent standard used currently in EST is in fact a bastardization of Cohen’s own article. As an example, Cohen defined what might be considered “small,” “medium,” and “large” effects, but recommended that researchers come up with their own standards of comparison within individuals research literatures. And yet, researchers typically use the specific values supplied by Cohen and brandish them as a hammer, making claims about “medium effects” regardless of context. This is the same “dichotomous thinking” that plagues the use of NHST (although perhaps in this context, it is better called “trichotomous” or “polytomous thinking”).
Cortina and Landis thus argue that a shift to EST will be even worse than with NHST due to the lack of structured expectations with EST. Although researchers misuse NHST, at least there are standards by which to say there is misuse! This is a straw man. Any new technique that gains popularity in the literature will require a “breaking in” period as it is explored independently by new researchers – just look at meta-analysis, SEM, HLM, and other relatively new approaches to data. A lack of clarity and direction with a new technique does not mean that the old technique is better or even safer.
So what exactly is the urban legend here? It is that a shift to EST will magically solve the interpretability problems associated with a research literature filled with NHST. I certainly agree that this belief is a myth and potentially a problem. Both NHST and EST are simplifications and summaries of much more complex phenomena, and as a result, information is lost along the way. It takes an expert in the content area, the methodology, and the stats to make valid conclusions about constructs from data. Sometimes these are the same person; sometimes they are not. Even with a complete shift to EST, many researchers would continue to overestimate their abilities and misuse the tools they are given.
The authors ultimately conclude that it’s time to simply embrace that taking one analytic approach to a dataset is not enough. Some combination of well-used NHST and EST is needed. What I find peculiar, though, is that after decrying a lack of structure and expectations in EST, they give a set of “situations…we as a field should be able to agree upon” that are a little vague:
- “If one’s sample is very large, then chance is irrelevant.” An example of N = 50000 is given as an example of a “large sample”.
- “If one’s sample is very small, then quantitative analysis as a whole is irrelevant.” This seems ripe for articles citing this and saying “We only had N = 30, so we don’t need to do quantitative analysis.”
- “If one’s model is simple, then expectations for rejection of chance and for effect magnitudeshould be higher than if one’s model is complex.”
- “If one’s results could be characterized as counterintuitive, either because of the design or because of the questions being asked, then expectations for magnitude should be lower than for results that replicate previous research or might be characterized as reflecting conventional wisdom.”
Something tells me this problem will not be solved any time soon.
- Cortina, J., & Landis, R. (2010). The Earth is not round (p = .00) Organizational Research Methods, 14 (2), 332-349 DOI: 10.1177/1094428110391542 [↩]
- Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49 (12), 997-1003 DOI: 10.1037//0003-066X.49.12.997 [↩]
Previous Post: | FCVW 2011: Virtual Worlds Conference Live-Blog |
Next Post: | Understanding Presence in Virtual Worlds |
Nice and brief critique Richard…. As much as I like Cortina and Landis’ work, I have to concur with your position here… Well articulated, and I think insightful. To my mind, NHST has not been a disaster…. Sure, there are problems, but if it was that bad then science really would not have advanced in the manner that it has. NHST has a place, and has had a place, and now more work is needed with Effect Sizes, Bayesian Techniques, etc.
—Sincerely,
Mark