Skip to content

Stats and Methods Urban Legend 4: Effect Size vs. Hypothesis Testing

2011 May 16
by Richard N. Landers

ResearchBlogging.orgIn what I can only assume is a special issue of Organizational Research Methods, several researchers discuss common statistical and methodological myths and urban legends (MUL) commonly seen in the organizational sciences (for more introduction, see the first article in the series). Fourth and final in this series: Cortina and Landis1 write “The Earth is Not Round (p = .00)”

I had initially hoped to report on this fourth article in the same week as Part 3 of this series, but quickly realized that I would need to parse it with a much finer-tooth comb. Cortina and Landis, with this article, are jumping squarely into the null hypothesis significance testing (NHST) vs. effect size testing (EST) debate with a reasonably strong position. If you aren’t familiar with what these terms refers to, here’s a little reminder:

  • NHST: The comparison of obtained differences/relationships to a theoretical sampling distribution to determine the probability that we would find that difference/relationship (or one larger) if there were really no difference/relationship in the population (called the null hypothesis).  If an observed result is improbable (assuming the null hypothesis were true), we typically use the term “statistically significant.”
  • EST: The simple reporting of the observed result, such as the size of a correlation or d-statistic, and the confidence with which have made that estimation.

An influential 1994 paper by Cohen2 entitled “The earth is round (p<.05)” raised a large number of valid criticisms of the current state of NHST in psychology, and its strong perspective can be best summarized with this statement:

NHST has not only failed to support the advance of psychology as a science, but also has seriously impeded it. (p. 997)

NHST, Cohen argued, is so commonly misused and misunderstood by psychologists that it had a negative net effect on scientific progress.  I’ve detailed potential problems with NHST elsewhere, so I won’t go into them here, but here’s the basic problem: it’s so easy to simplify statistical significance testing into a simple “it’s significant and therefore a real effect”/”it’s not significant and therefore not a real effect” that many researchers do exactly that, despite the fact that it is a completely invalid conclusion.

Cohen argues that the replacement of NHST with EST would do much good to fix this problem.  Instead of statements like this…

  • NHST: The difference between conditions was statistically significant [t(145) = 4.12, p < .05].

…you would see statements like this…

  • EST: The difference between conditions was 0.12 standard deviations in magnitude [d = .12, CI(.95): .01 < d < .23].

Same phenomenon; different reporting.  EST discourages researchers from making categorical statements like “there was an effect” while simultaneously giving information about the precision of the estimate obtained.

In the present article, Cortina and Landis argue that this shift will do no good.  They say that while NHST is frequently misused, there at least exists a well-structured system by which to make judgments about NHST.  An effect either is statistically significant, or it is not.  EST, on the other hand, has very few standards by which to make judgments.

The most prominent standard used currently in EST is in fact a bastardization of Cohen’s own article.  As an example, Cohen defined what might be considered “small,” “medium,” and “large” effects, but recommended that researchers come up with their own standards of comparison within individuals research literatures.  And yet, researchers typically use the specific values supplied by Cohen and brandish them as a hammer, making claims about “medium effects” regardless of context.  This is the same “dichotomous thinking” that plagues the use of NHST (although perhaps in this context, it is better called “trichotomous” or “polytomous thinking”).

Cortina and Landis thus argue that a shift to EST will be even worse than with NHST due to the lack of structured expectations with EST.  Although researchers misuse NHST, at least there are standards by which to say there is misuse!  This is a straw man.  Any new technique that gains popularity in the literature will require a “breaking in” period as it is explored independently by new researchers – just look at meta-analysis, SEM, HLM, and other relatively new approaches to data.  A lack of clarity and direction with a new technique does not mean that the old technique is better or even safer.

So what exactly is the urban legend here?  It is that a shift to EST will magically solve the interpretability problems associated with a research literature filled with NHST.  I certainly agree that this belief is a myth and potentially a problem.   Both NHST and EST are simplifications and summaries of much more complex phenomena, and as a result, information is lost along the way.  It takes an expert in the content area, the methodology, and the stats to make valid conclusions about constructs from data.  Sometimes these are the same person; sometimes they are not.  Even with a complete shift to EST, many researchers would continue to overestimate their abilities and misuse the tools they are given.

The authors ultimately conclude that it’s time to simply embrace that taking one analytic approach to a dataset is not enough.  Some combination of well-used NHST and EST is needed.  What I find peculiar, though, is that after decrying a lack of structure and expectations in EST, they give a set of “situations…we as a field should be able to agree upon” that are a little vague:

  1. “If one’s sample is very large, then chance is irrelevant.”  An example of N = 50000 is given as an example of a “large sample”.
  2. “If one’s sample is very small, then quantitative analysis as a whole is irrelevant.”  This seems ripe for articles citing this and saying “We only had N = 30, so we don’t need to do quantitative analysis.”
  3. “If one’s model is simple, then expectations for rejection of chance and for effect magnitudeshould be higher than if one’s model is complex.”
  4. “If one’s results could be characterized as counterintuitive, either because of the design or because of the questions being asked, then expectations for magnitude should be lower than for results that replicate previous research or might be characterized as reflecting conventional wisdom.”

Something tells me this problem will not be solved any time soon.

www.nchsoftware.com/capture
  1. Cortina, J., & Landis, R. (2010). The Earth is not round (p = .00) Organizational Research Methods, 14 (2), 332-349 DOI: 10.1177/1094428110391542 []
  2. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49 (12), 997-1003 DOI: 10.1037//0003-066X.49.12.997 []

FCVW 2011: Virtual Worlds Conference Live-Blog

2011 May 12
by Richard N. Landers

I’m virtually attending the Federal Consortium for Virtual Worlds 2011 conference today. The purpose of this conference is to discuss innovation in the area of 3D virtual worlds in government service. This entry will be a live blog of my experiences at this virtual conference.

Day 1 10:20 AM – About half an hour after first attempting to get to this conference, I am finally able to listen to a session.  The conference has multiple gateway points to get to the live streaming content, all of which are inside virtual worlds.

That’s fine theoretically, but it doesn’t really meet my needs as an attendee – I need simple, straightforward instructions such that I can get into the conference material easily and quickly.  Instead, there’s an artificial barrier to entry.  Instead of viewing a webpage with the live video stream, I have to download software and walk over to a webpage with a live video stream.  Why?!  If there was dedicated content inside the virtual world that encouraged participant interaction, that would be fine, as that is added value to the experience – but as it is, I had to hop between 3 virtual worlds to find anyone else even present, and when I finally got there, the only added value I received was that one of the other users was having microphone problems, which was covering up the conference speaker audio.  Hopefully this experience will improve now that I’m settled in…

10:30 AM – Apparently we’ll now be listening to another presenter through Second Life, through the live-stream, through another VW.  A lot of layers

10:33 AM – Speaker: there is no reason to have a 3D environment unless it is valuable in a way that can’t be replicated elsewhere, e.g. on a webpage. Yes! 3D virtual worlds add specific value, but using them blindly across situations without specific consideration of their value in those situations doesn’t make any sense.

10:37 AM – Apparently kids REALLY get into role playing in 3D virtual environments.  Is it better engagement than in-person though, I wonder? Apparently Chinese kids are very shy in-person, so the MUVE gives them an opportunity to act out in a way they would not otherwise feel comfortable to do.

10:39 AM – “Artificial intelligence is not yet good enough to replace a real teacher.”  Inevitable though?

10:54 AM – The best projects, with the most added value, are those that benefit from creative use of the 3D virtual environment – for example, 3D content creation.

11:04 AM – Panel disagreement: “Be one person” and determine who you are online and offline versus “Multi-persona” approaches, where you choose to be a different person depending on the goal – social media, virtual world, whatever. Tricky, tricky. But I wonder what effect these different approaches would have on the person engaging in them.

12:50 PM – The feed has been dead for over an hour now – we’re in a lunch break and vendor fair. Keynote upcoming in about 10 minutes.

12:57 PM – At a vendor booth for the National Center for Telehealth & Technology. They have a PTSD simulator, which starts with the simulation of a “traumatic event.” Disturbing… but looks effective.

1:08 PM – Sitting in the virtual expo hall, but no video. Not sure what’s happening.

1:17 PM – Switched to Internet Explorer… didn’t seem to like Firefox. Working now, but have to catch up with what’s going on.

1:21 PM – Keynote by Ms. Mk Haley: VR not being used so much in the public sphere, but more so behind the scenes.  Working on full-body interfaces.  Considers “virtual reality” to be a much wider term – Disneyland as a virtual reality, for example.

1:33 PM – I am definitely using the Marshmallow Challenge.

1:38 PM – Engineers given a basically-unwinnable game to play as an exercise in innovation. By working outside the perceived “rules,” the engineers could have won, but considered that cheating. Sometimes the innovation/cheating line is unclear. The engineers got mad.

3:07PM – Legal ambiguity surrounding virtual environments (e.g. what happens if your student vandalizes something online while in your class?), although scary, is not really important because “on the whole, we want to use these spaces for fairly well-known, fairly understood, fairly innocuous things” and thus teaching activities online are no more risky than doing them in person. Not sure I agree…

3:15 PM – The tech needs to evolve such that anyone can access MUVEs anywhere, from mobile phones to immersive VR systems, for them to be truly valuable for getting things done.

3:19 PM – Question from the audience: How do we create standards of behavior in MUVEs? Sometimes people come out of their shell online and the thing that comes out of that shell isn’t very pretty.

3:45 PM – That’s it for me for today – tune in tomorrow!

Day 2 8:48 AM – Back in the saddle! This speaker’s content makes sense – using virtual worlds for science education by creating simulations that accomplish what cannot be accomplished in person (situated learning). Personal simulated ecosystem, for example.

9:01 AM – Paper-and-pencil tests are invalid in education? Methinks someone doesn’t write very good tests. The immersive assessment strategy can certainly assess different competencies than a paper-and-pencil test, but that doesn’t invalidate paper-and-pencil tests – that argument is unnecessary.

9:13 AM – Has similar goals to mine: virtual environments automatically customizing themselves to student needs.

9:16 AM – Oi… my concern with that video is that it encourages children to run in seedy back alleys without supervision. 😉

10:36 AM – Panel on the use of VWs for “command and control centers”

10:40 AM – Again, the appeal of simulations seems to be largely in the ability to quickly make cheap simulations. Rapid prototyping with VOIP and instant messaging.

10:47 AM – Using VW-built prototypes to model hypothetical tactical systems and system displays… sounds similar to the process modeling work by Ross Brown but with more of a focus on information flow rather than physical flow

11:07 AM – “People do better playing war on XBOX Live than we do in the field in…urban occupation environment[s]”

11:35 AM – In Q&A with command and control session… and I’m out of time. Very interesting comments – will summarize thoughts in a dedicated post Monday.

Stats and Methods Urban Legend 3: Myths About Meta-Analysis

2011 May 3
by Richard N. Landers

ResearchBlogging.orgIn what I can only assume is a special issue of Organizational Research Methods, several researchers discuss common statistical and methodological myths and urban legends (MUL) commonly seen in the organizational sciences (for more introduction, see the first article in the series). Third up: Aguinis et al.1 write “Debunking Myths and Urban Legends About Meta-Analysis.”

Meta-analysis has become such a de facto method by which to synthesize a research literature in the organizational sciences that I hardly imagine a modern narrative literature review without one.  If you aren’t familiar with it, meta-analysis essentially involves the computation of a grand mean something across research studies.  This might be a mean difference (usually a Cohen’s d) or a correlation (usually a Pearson’s r).

Unfortunately, the surge in popularity of this statistical technique has brought with it a large number of researchers employing it without really understanding it – imagine the person who computes an ANOVA without any clue what “ratio of the between- to within-group variability” means.  And even if we were to assume all researchers do understand it completely, we now have a large population of “consumers of meta-analyses” that need that same understanding just to accurately interpret a literature review.

Aguinis et al. provide a list of what they believe to be the 7 most common myths and urban legends associated with meta-analysis.  My understanding is that this list came out of a session I attended at SIOP 2010 and subsequent discussions.  I’ll list each of the myths as Aguinis et al. listed them, and my own interpretation of them:

  • MUL #1: A single effect size can summarize a literature. Much as you cannot use a sample mean or sample correlation to conclude anything about a single person within that sample, you cannot generalize from a single meta-analytic estimate about any particular setting.  This is why we have moderator analyses; the overall effect size from a meta-analysis only tells you what happens “on average.”  There is not necessarily even a single study or setting where you would find the relationship described by that overall effect size.
  • MUL #2: Meta-analysis can make lemonade out of lemons; meta-analysis allows researchers to gather a group of inconclusive and perhaps poorly designed studies and draw impressive conclusions with confidence. Larger samples are certainly gathered by meta-analysis than is possible in a single study, which is certainly a strength of approaching data from this perspective.  But this has led to the common misconception that you can throw anything you want into a meta-analysis and get out “good” results.  It reminds me of the old computer science expression, GIGO: garbage in, garbage out.  If you include only poor quality studies, you’ll get a poor quality average.
  • MUL #3: File drawer analysis is a valid indicator of possible publication bias. One of  the techniques recommended to identify if your research suffers from a publication bias (published studies tend to show stronger results than unpublished ones) is to compute a failsafe N.  This value represents how many studies with null results would need to be added to nullify the results of the present meta-analysis.  While a low failsafe N indicates potential publication bias, a high failsafe N does not necessarily indicate the absence of it.
  • MUL #4: Meta-analysis provides evidence about causal relationships. GIGO all over again.  If you aren’t meta-analyzing experiments that provide evidence of causality, your meta-analysis will not magically add that interpretation.
  • MUL #5: Meta-analysis has sufficient statistical power to detect moderating effects. It’s a common assumption that by meta-analyzing a research literature, you automatically have sufficient power to detect moderators.  While it is true that meta-analyses have greater power to detect moderators than individual primary studies, you do not automatically have sufficient power to detect anything you want to detect.
  • MUL #6: A discrepancy between results of a meta-analysis and randomized controlled trials means that the meta-analysis is defective. While a discrepancy might indicate a poorly designed meta-analysis, this is by no means conclusive.  Some discrepancy is inevitable because a meta-analysis is an average of studies, and those studies will vary randomly.
  • MUL #7: Meta-analytic technical refinements lead to important scientific and practical advancements. Most refinements in meta-analytic technique do not dramatically alter computed estimates.  Although you should certainly use the most recent refinements (as they will produce the most accurate estimates), you don’t need to worry too much about forgetting one… although there are certainly a few exceptions to this (my own work on indirect range restriction comes to mind!). The biggest mistake is to redo and attempt to publish a meta-analysis that directly replicates another meta-analysis with only minor changes in approach; the difference between the old and new results will almost never be large enough to justify this unless the meta-analytic k is also dramatically increased.
  1. Aguinis, H., Pierce, C., Bosco, F., Dalton, D., & Dalton, C. (2010). Debunking myths and urban legends about meta-analysis. Organizational Research Methods, 14 (2), 306-331 DOI: 10.1177/1094428110375720 []