In what I can only assume is a special issue of Organizational Research Methods, several researchers discuss common statistical and methodological myths and urban legends commonly seen in the organizational sciences (for more introduction, see the first article in the series). Second in the exploration: Spector and Brannick1 write “Methodological Urban Legends: The Misuse of Statistical Control Variables.”
Spector and Brannick criticize the tendency for researchers conducting correlational research to blindly include “control variables” in an attempt to get better estimates of population correlations, regression slopes, and other statistics. Such researcher effort is typically an attempt to improve methodological rigor when true experimentation isn’t possible, feasible, or convenient. Unfortunately, this is a methodological urban legend. And yet, shockingly, the authors report a study finding a mean of 7.7 control variables in macro-org research and 3.7 in micro-org research.
I will let the authors explain the problem:
Rather than being included on the basis of theory, control variables are often entered with limited (or even no) comment, as if the controls have somehow, almost magically, purified the results, revealing the true relationships among underlying constructs of interest that were distorted by the action of the control variables. This is assumed with often little concern about the existence and nature of mechanisms linking control variables and the variables of interest. Unfortunately, the nature of such mechanisms is critical to determining what inclusion of controls actually does to an analysis and to conclusions based on that analysis.
The authors call the blind inclusion of control variables in any attempt to get more accurate results the purification principle. The problem with the purification principle is that it is false; the inclusion of statistical controls does not purify measurement. Instead, it simply removes the covariance between the control variable and the other variables from later analyses, even though that covariance may be meaningful to the researcher’s hypotheses. The authors give this illustrative example:
A supervisor’s liking for a person might inflate the supervisor’s rating of that person’s job performance across multiple dimensions. Correlations among those dimensions might well be influenced by liking, which in effect has contaminated ratings of performance. Thus, researchers might be tempted to control for liking when evaluating relationships among rating dimensions. Note, however, that whether it is reasonable to control liking in this instance depends on whether liking is in fact distorting observed relationships. If it is not (perhaps, liking is the result of good performance), treating liking as a control will lead to erroneous conclusions. This is because removing variance attributable to a control variable (liking) that is caused by a variable of interest (performance) will remove the effect you wish to study (relationships among performance dimensions) before testing the effect you wish to study, or “‘throwing out the baby with the bathwater.”
So how should one actually use control variables? Two recommendations are given:
- Use specific, well-explored theory to drive the inclusion of controls, which goes beyond simple statements like, “previous researchers used this control” or “this variable is correlated with my outcomes.” If you believe that a specific relationship may be contaminating your results, this may be justification for a control, but you should explicit state why and defend this decision when describing your methods. Follow up on this discussion; test hypotheses about control varibles.
- Don’t control for demographic variables, e.g. race, gender, sex, age. For example, if you find a gender difference in your outcome of interest, controlling for that variable may hide real variance in the outcome that could be explained by whatever real phenomenon is causing that difference. In my own research are, it is not uncommon to control for age when examining the effects of technology on outcomes of interest (e.g. learning). But age does not itself cause trouble with technology; instead, underlying differences like familiarity with technology or comfort with technology or other characteristics may be driving those differences. Simply controlling for age not only removes “real” variance that should remain in the equation but also camouflages a real relationship of interest.
So generally, Spector and Brannick are calling for an organizational science based on iterative theory building, progressively testing alternative hypotheses and narrowing in on answers bit by bit. This approach is closer to what is employed in the natural sciences; instead of testing one-off theories, they build and build, approaching a problem from as many perspectives as possible to narrow in real results.
My only concern is this: since one-off studies with revealing and/or controversial results are the ones most often rewarded with recognition, is this an approach that organizational researchers will really take?
- Spector, P., & Brannick, M. (2010). Methodological urban legends: The misuse of statistical control variables. Organizational Research Methods, 14 (2), 287-305 DOI: 10.1177/1094428110369842 [↩]
In what I can only assume is a special issue of Organizational Research Methods, several researchers discuss statistical and methodological myths and urban legends commonly seen in the organizational sciences (which is a term I’ve adopted for organizational behavior, human resources, and industrial/organizational psychology). Four articles in this issue stuck out to me, which are the ones I’ll be discussing over the next few weeks. First up: Edwards1 writes “The Fallacy of Formative Measurement.”
Before getting into this, I will give a small disclaimer: I am not a research methods specialist, nor am I a quantitative psychologist. I am industrial/organizational psychologist with an interest in research methods as a means to study phenomena I find interesting. As a result, I am approaching from the perspective of an end-user of methods rather than a person who studies them explicitly. So any misrepresentation here is probably my fault! That said…
According to Edwards, there are two models of the relationships between constructs (theoretical concepts we want to measure, like job satisfaction, happiness, learning, etc.) and measures (the actual data we end up analyzing). The distinction between these is important when creating models to analyze data, especially in the context of structural equation modeling or other explicit data modeling techniques.
The first and more traditional view is called reflective measurement. From this perspective, constructs cause measures, i.e. the construct is the “real” underlying characteristic, and a measure is simply a reflection of that characteristic.
The second and more recent view is called formative measurement. From this perspective, measures cause constructs, such that the only real constructs are latent constructs, i.e. ones that can be detected by looking for patterns in data and then carefully refining measurement.
The major conflict between these two perspectives is the role of theory. From a reflective perspective, we should develop theories of the relationships between constructs and use those theories to shape how data will be collected. From a formative perspective, we should look for patterns in data and let the data provide the basis for our theories, progressively shaping our theories as the data lead us to do so.
To add to the confusion, the two perspectives typically provide similar results. One researcher coming from the perspective of reflective measurement (developing theory, testing it) may end up at the same conclusions as another researcher coming from the reflective perspective (measuring a host of constructs, looking for patterns). But that doesn’t necessarily mean that the approaches are equally preferable.
Edwards discusses six key ways that formative measurement differs from reflective measurement:
- Dimensionality: Reflective measurement accepts that redundancy between measures is inevitable because all measurement is an imperfect representation of a construct. The goal, however, is to get as “pure” a measure of the construct as possible, so redundancy should be minimized. This is reflected in the typical measure development process – iteratively removing items that load on multiple factors in order to tap the construct as effectively as possible. In a formative measurement model, measures (and items) that are multidimensional by definition reflect multidimensional constructs. Consider this double-barreled Likert-type agreement item: “Sometimes I like pancakes and sometimes I like broccoli.” This question is multidimensional, but is it more likely that this is a bad question and should be refined (reflective) or that there is an underlying multidimensional construct (formative)?
- Internal Consistency: In a reflective model, two measures/items designed to tap the same construct should correlate highly. Formative measures have no such expectation; in fact, internal consistency might be a sign of a poor model as there is not definitive that the facets of a construct should be correlated. Edwards argues that this difference has led some researchers to conclude, finding that their reflective measures are multidimensional, that those measures are in fact formative. Unfortunately, this is not a valid conclusion.
- Identification: Identification refers to the ability to derive unique values for each model parameter. It’s probably a little overkill to go into the details of this here. But in a nutshell, one needs for a model to be overidentified in order to determine how well that model fits the data. For a formative measure to be identified, one must add reflective measures as outcomes of the formative measure. But because measures cause constructs in a formative model, adding these measures has implications about the construct (which, if I may editorialize a bit here, makes a huge mess of things).
- Measurement Error: In a reflective model, measurement error is associated with each item – this is the degree to which the item is unique and measures constructs other than the construct that it is supposed to measure. In a formative model, measurement error is associated with the latent construct instead. Each measure contributes some explanation of the latent construct, and whatever is left over is error. Thus the implicit assumption in a formative model is that the items do not possess unique measurement error – which seems a little odd, at least in the context of psychological constructs. Errors in recall, day to day affect fluctuations, etc. – these and many other sources of item-level measurement error appear to be discarded in formative models.
- Construct Validity: In a reflective model, construct validity refers to the degree to which a measure reflects the construct they are supposed to measure. Construct validation is intended to determine how well the measures represent the construct, which is a process that ultimately relies on researcher judgment. In a formative model, these decisions are made statistically, using by examining the interrelationships within the model itself. This means that the construct’s validity is driven by the kinds of variables you test its validity with, which makes it quite tricky.
- Causality: As stated before, in a reflective model, constructs cause measures, while in a formative model, measures cause constructs.
Thus, reflective models describe measures as imperfect indicators of underlying phenomena while formative models describe measures as an indistinguishable part of the constructs they are tied to.
I think the problem with pure formative models can be illustrated with this example. Consider a 10-question scale measuring the psychological state of happiness. Using a reflective model, we would use theory to conceptually define happiness and develop 10 questions to measure the construct. We would iteratively work to include only items that had high item-total correlations so that we had consistency of measurement, although we accept that we won’t have perfect measurement, as that would be impossible. We also realize that happiness exists independently of whether or not we measure it, and whether or not we measure it well.
A formative model would not make many of these assumptions. For example, once we created a 10-question scale, we have created a construct. If multidimensionality in measurement is discovered, that indicates a multidimensional construct. Rather than saying happiness exists as a human characteristic that we are attempting to measure, we say that this measurement is a part of the definition of whatever construct is being measured (hopefully happiness). Individual items in the measure do not contain both common elements of happiness and their own unique contributions; instead, they contribute directly to the variance in the constructs. For example, an item “I am generally happy at work.” contains variance contributed both by general happiness and work-specific (unique) happiness when explored reflectively. From a formative perspective, the unique variance gets added to the error associated with the latent trait – it is a part of the construct, but a part that doesn’t help explain anything theoretically. Simply because we asked the question, it becomes a part of the construct – which, in my opinion, doesn’t really make any sense.
Thus, Edwards’ conclusion is that formative measurement is based on invalid assumptions about the nature of data in the six categories listed above. He continues by suggesting alternatives to pure formative models by combining elements of reflective models with them. These mixed models are no longer purely formative, but maintain many of the advantages of formative measurement – for example, better handling of multidimensional constructs.
For my own work, this has convinced me that, at a minimum, formative models as they exist currently are too controversial to be employed safely, both from a publishing perspective and for me to be confident in my own conclusions.
- Edwards, J. (2010). The fallacy of formative measurement. Organizational Research Methods, 14 (2), 370-388 DOI: 10.1177/1094428110378369 [↩]
SIOP 2011 Coverage: Schedule Planning | Junior Faculty Consortium
Day 1 Live Blog | Day 1 Summary | Day 2/3 Live Blog | Day 2/3 Summary
Day 2 at SIOP started with a session not quite related to tech research, but rather something I found personally interesting: ways that I/O Psychology is currently “making a difference.” The presentation that struck me the most in the set was one covering the role of I/O in the Bureau of Indian Affairs, which is apparently a division of the US Department of the Interior tasked with working with the several hundred Native American tribes residing in the United States. Historically, the Bureau was responsible for overseeing “Indian Affairs” but is currently in the midst of a cultural transition towards an advisory help-from-within sort of role.
As a result, the Bureau is actively trying to hire Native Americans to fill its ranks (Native Americans serving Native Americans), and there are many, many job roles within the Bureau that need to be filled. Although their recruitment efforts are fairly successful – they are able to recruit several hundred Native Americans each year – these folks often leave within a year. I/O psychologists working within the Bureau discovered the reason and helped design new recruitment and other materials to support Native American retention.
I also attended a session on online recruitment. It was fine, but there was not a whole lot of new information. Online recruitment at the military, for example, consists of live chat and e-mail with potential recruits. The Navy alone apparently holds 700-800 recruitment chats per day with around 100 e-mails per day – that’s a lot of recruits. But that’s also a recruitment technology and approach that’s been around for at least a decade. While the volume is impressive, it’s not particularly innovative.
The one piece of that presentation that I did find interesting was the report on their online social network called MyNavySpace, which is a space for potential recruits to chat and communicate prior to showing up for basic training. Across the board, 26% of recruits don’t show up for basic training, but among those using MyNavySpace, the number drops to about 6%. Whether that’s because recruits using the social network are more motivated or because the social network motivates them is unclear.
On Day 3, I only attended one session, but it was a good one: a group of practitioners discussing serious games and virtual worlds. Several major issues relatively untouched in these research literatures were touched upon, including: the distinction between serious games and gameification, the limitations of artificial intelligence for automatic assessment within serious games, and the lack of evidence of transfer of behaviors from serious games to the workplace.
Two ideas discussed were particularly interesting to me. First, Ben Hawkes at Kenexa brought up research on the uncanny valley and its implications for video-based simulations. The uncanny valley is a fascinating theory – the idea that as the fidelity of human representation in 2D/3D media is only good to up to a point, at which point it becomes suddenly very disturbing. For example, a small photo of a person is more “human” than a name in a chatroom; a Second Life avatar is more “human” than a photo. But at a certain point – think Polar Express – the representation is just downright creepy. It’s close to “human” and yet there is something wrong that really jumps out at us. Some researchers say this is why people find zombies so disturbing – human, but not quite.
Second, I really did not like the idea of “stealth assessment.” There was some belief that people really engaged in a serious game would enter a “flow state,” and people in this flow state would forget they were being assessed (i.e. they would drop their self-monitoring defenses because they were so engaged). Thus, the assessor would get a more honest read of the applicants personality. The two problems I see are 1) this may be somewhat unethical, as we should never be tricking job applicants for any reason and 2) this creates measurement inequalities. If John the Applicant enters the flow state and starts yelling in frustration and Mary the Applicant does not enter the flow state and does not yell, it doesn’t mean that Mary has greater emotional stability than John. There is no way to disentangle the propensity for an employee to enter the flow state and the psychological constructs we think that flow state should let us see.
So that’s it for my SIOP 2011 conference experience. The relatively light density of technology presentations meant that I only spent about half my time at presentations and posters, and the other half chatting with old friends and new collaborators. And isn’t that what conferencing is all about?