In recent article by Blackhurst, Congemi, Meyer and Schau[1] in The Industrial-Organizational Psychologist, e-mail addresses from a group of 14,718 people who had applied for entry-level jobs in manufacturing were examined for their appropriateness. The researchers found that roughly 25% of e-mail addresses were inappropriate or antisocial, and that the level of inappropriateness predicted several qualities of interest to hiring managers: conscientiousness, professionalism, and work-related experience. Interestingly, cognitive ability was not related.
The types of e-mail addresses found appear in the table below, which were extracted by a team of 25 graduate students with high inter-rater reliability (a sub-sample of 1000 was used for this purpose).
The graduate students next categorized all 15,000ish e-mail addresses (600 addresses assigned to each of the 25 students). At the same time, the graduate student coded the e-mails as either “appropriate when applying to a job,” “questionable,” or “inappropriate when applying to a job.” Afterward, one of the researchers reviewed all 15,000, brought any questionable judgments to the attention of a 3-person panel for discussion. The researchers then compared mean scores on cognitive ability, conscientiousness, professionalism, and work-related experience across those with appropriate, questionable, and inappropriate e-mail addresses. Statistically significant differences were found on all dimensions.
Unfortunately, statistical significance is easy to attain in this sample. Even tiny effects will be statistically significant. The article did not report any standard deviations to give us a sense of effect sizes, so I had to do a little detective work. Here’s a table comparing outcomes for specific subtypes of inappropriate e-mail addresses:
This is the only table containing means, and fortunately, there are also degrees of freedom – that means we can reverse-engineer the t-test formula to estimate the standard deviation of this scale. It’s not perfect, but it’s the best available. Because these are independent-samples t-tests, t equals the mean difference (in this case, code group minus control group) divided by the pooled standard error (roughly s/SQRT(N)), and we can get the value of N by adding 2 to df (some of the numbers are a little odd in here – for example, DF should be equal to N*2-2, but it’s not – so this is my best guess). If we assume the SD of each group is equal, we can use the following formula to solve for s: s = sqrt(N)*(mean difference)/T. That produces SDs for each group here between 54.6 and 58.9, so if we assume these SDs hold up for the other scales, the differences on the predictors between appropriate and inappropriate e-mail addresses range from d=0.00 to d=0.11. So these are not by any stretch of the imagination big effects. But they are effects, about in line with what we’d expect from the intercorrelations between psychological predictors and performance generally.
The study is not without limitations; all of the measures provided were by a consulting firm, so we do not have any way to independently verify their content. The e-mail address ratings were also made by graduate students, and it is not clear how well their judgments would generalize to actual hiring managers. Actual hiring decisions made later were not available, nor was job performance data, so validation evidence is missing. All we really have is a new correlate of predictors.
Interestingly, the study also identified that roughly 5% of e-mail addresses contained information that looked like a date; considering legislation forbidding discrimination on the basis of age, the legality of hiring managers having access to this information is unclear. Although e-mail address appropriateness predicts characteristics of interest (and thus should potentially be included in a packet of information used for hiring), it may contain information itself inappropriate for a manager to see (and thus should not be included). Further research is needed to explore this further.
Footnotes:- Blackhurst, E., Congemi, P., Meyer, J. & Sachau, D. (2011). Should you hire BlazinWeedClown@Mail.Com? The Industrial/Organizational Psychologist, 49(2), 27-38. [↩]
A recent study by Information Solutions Group, sponsored by PopCap Games, led Gamasutra to claim:
A new study from PopCap Games finds that those who cheat while playing social games are nearly 3.5 times more likely to be dishonest in the real world than non-cheaters, with offenses ranging from cheating on taxes to illegally parking in handicapped spaces.
Although it doesn’t say so explicitly, that’s pretty obviously phased to lead you to believe that social cheating leads to real-life cheating. Since this was most likely a survey study, it seemed quite unlikely that they could make casual conclusions like that. So that led me to investigate the original study, which you can find for yourself here.
From that, it is clear that this was in fact a survey: a web-based presentation of 38 question to a sample that ultimately consisted of 801 US respondents and 400 UK respondents (total N=1201). The study specifically excluded anyone who played less than 15 minutes of social games per week. There’s no discussion of how many respondents completed the survey but were excluded, so it’s not very clear how well this survey would generalize to gamers in general (or any larger group).
The survey report starts by emphasizing the growing importance of social games by referencing another study that estimates 118.5 million social gamers in 2011 the US and UK, about a 17% increase from January 2010. There are a lot of social gamers; no surprise there. In PopCap’s study, they further identified that about 81 million play at least once per day, with 49 million playing more than once per day.
The report continued by exploring the profiles of current social gamers: mostly women (55%) with a mean age of 39 years old (down a little bit from last time). They play these games because they thinks it’s fun, competitive, stress relieving, and a mental workout. I’m curious exactly which social games are a mental workout (FarmVille?), but it was left unreported.
8% of respondents reported using hacks, bots, or cheats in a social game, with 11% saying they had considered it but had not actually tried it. That actually seems a bit high to me, and I wonder how their sample was located; they do not say. If their sample is loaded more toward (and bear with me here) “hardcore social gamers,” the rest of their results are a little less trustworthy. Without details on sampling, there’s no way to know.
Imagine my surprise when I reached the end of the report with absolutely no mention of the finding Gamasutra reports above. You are welcome to search for yourself (and if you find it, please let me know!), but after scanning through page by page, I searched the text for “cheat”, and for the specific percentages reported by Gamasutra. Nothing. So we are left to simply trust Gamasutra’s reporting with no verifiable source. That’s not that uncommon, but it is a bit suspicious when they point to a PDF report to provide support for their statement.
Without that support, there’s not much available to analyze, but we can at least say that the reporting above is a bit misleading. Here are several possible explanations for the reported cheating correlation, assuming it is accurate in the first place:
- People that cheat in social games are rewarded for doing so, and that leads them to cheat in real life.
- People that cheat in real life are rewarded for doing so, and that leads them to cheat in social games.
- People that self-report cheating in one category are more likely to self-report cheating in another category.
- There is an underlying psychological characteristic (e.g. integrity) that leads to cheating behaviors across situations.
As you can probably guess, the last two are more probable than the first two. Although it’s tempting to attribute causality here (much like in the debate on violence in video games causing violence), there is no evidence to suggest this – correlation is not causation. It is more likely that cheaters are cheaters, regardless of context. We’ve just found a new way to identify them.
In a recent study in the Journal of Experimental Psychology: Applied, Roediger, Agarwal, McDaniel and McDermott[1] provide additional evidence for test-enhanced learning as a way to improve memory. It echoes an earlier study of Roediger’s in which he found in a controlled laboratory experiment that students randomly assigned to take a test had greater long-term retention than students randomly assigned to study the material. In this new research, Roediger and colleagues replicate this finding in 3 quasi-experimental field studies.
- Experiment 1: Students were quizzed on the material. They then completed later items on an exam with items parallel to the quiz items. Both chapter exam and semester exam scores of those completing quizzes were higher. Students from multiple course sections participated, and different sections received different pre-test questions; the effect held only for those questions presented int the pre-test.
- Experiment 2: Students were quizzes on the material. They then completed later items on an exam with both parallel and identical items to the quiz items. Again, exam scores were higher. The design in this experiment was similar, except for the addition of a control condition. Recall on the control was similar to that of the non-pre-tested items, lending additional support to the effect.
- Experiment 3: Students were given a multiple choice quiz in class and encouraged to continue quizzing themselves at home using a web-based tool. Students using the quizzing tool had higher exam scores on items from the quiz.
The third of these is the iffy-ist – the increased test scores could reflect higher-quality students rather than higher-quality studying, and it’s not clear to what extent the first testing effort or the home testing elicited the effect. But the general approach did seem to work for at least some of them.
Test-enhanced learning is potentially valuable in several ways. First, it is a potential application of gamification, which I covered last week. By motivating students to complete optional quizzes using badges and other motivational game-derived elements, students may learn more (and enjoy it!). Second, it is a potential pedagogical tool in both education and employee training. For example, a mid-training 5-minute practice test may increase retention more than simply asking people to review their notes for 5 minutes.
But here’s the big question for me as an educator – does this mean that adding regular quizzes to a course will increase scores on the final exam, even if the quiz questions don’t appear on the final? And perhaps more importantly, does that mean that students actually learned more, or is it because you focused their attention on the topics you knew you’d be testing on? Future research is clearly warranted.
Footnotes:- Roediger, H., Agarwal, P., McDaniel, M., & McDermott, K. (2011). Test-enhanced learning in the classroom: Long-term improvements from quizzing. Journal of Experimental Psychology: Applied, 17 (4), 382-395 DOI: 10.1037/a0026252 [↩]
In a recent study by Landers and Callan[1], undergraduates completed optional multiple-choice tests online and reported them, on average, as “fun”, “enjoyable”, and “rewarding”. They did this in the context of an online social network platform previously covered on this blog. Students were awarded badges (social rewards) in exchange for completing optional practice tests theorized to improve their learning.
This is, to my knowledge, the first published empirical study of gamification in educational settings.
Landers and Callan posit that gamification can be best expressed as an extension of goal setting theory. By making explicit goals and recognizing their achievement, we can motivate people to action. Gamification, in this sense, is the recognition of goals electronically and automatically, without the need for a human mediator (most often an instructor in an educational context or a supervisor in an organizational context). This makes the reward for goal achievement more immediate than is possible with traditional methods and thus more motivating. Goal setting is well established as a motivational intervention in a wide variety of contexts, so we would expect gamification to be similarly versatile – and perhaps even more powerful.
What’s especially interesting about this study (if I do say so myself) is that the authors managed to make the completion of optional multiple choice tests a valid student goal. Most of the time, grades on tests are themselves a performance goal for students. But if you ask them to complete practice tests on their own time, you are often met with varying levels of resistance – or simple apathy. With gamification, about 30% of students enrolled in the social network platform opted to take these tests for no reward other than a virtual badge. I expect you’d see similar success with an organizational training intervention (upcoming research!).
So why should we get students to complete optional multiple choice tests? Because other research suggests that the act of testing promotes long term retention of knowledge better than studying does. Not only do they get a badge, but they learn material for their courses more effectively than they could do on their own! I consider that a win-win.
Footnotes:- Landers, R. N., & Callan, R. C. (2011). Casual social games as serious games: The psychology of gamification in undergraduate education and employee training. In M. Ma, A. Oikonomou, & L. C. Jain (Eds.), Serious Games and Edutainment Applications (pp. 399-423). Surrey, UK: Springer. DOI: 10.1007/978-1-4471-2161-9_20 [↩]
You might have noticed that I missed last week. Well, that’s because it’s the holiday season, which for academics means intense sessions of writing to make up for all the not-writing during the Fall semester! I’ll be returning to my regularly weekly coverage of technology, education, and psychological scholarly articles in January.
In the meantime, I wanted to assure everyone that I was in fact coming back, as well as wish everyone a happy, safe, and productive holiday season!
One of the questions faced by survey designers is presentation order. Does it matter if I put the demographics first? Should I put the cognitive items up front because they require more attention? If I put 500 personality items in a row, will anyone actually complete this thing? Some recent research in the Journal of Business and Psychology reveals that placing demographic items at the beginning of a survey increases the response rate to those items in comparison to demographic items placed at the end. And more importantly, it did not affect scores on the three noncognitive measures that came afterward, in this case: leadership, conflict resolution, and culture and goals measures.
To investigate this, Teclaw, Price and Osatuke[1] conducted a large survey (roughly N = 75000) on behalf of the Veterans Health Administration. Respondents were randomly assigned to one of three surveys. Of those randomly assigned to the third survey, participants received one of seven scales, three of which were those listed above, resulting in a sample size for this study of N = 4508. Respondents completing each of these three surveys were in turn randomly assigned to either complete demographic items at the beginning of end of their survey. The authors compared response rates, considering both skipped items and “Don’t Know” responses to be a lack of response, and included all respondents that opened the survey regardless of how many questions were actually completed.
Response rates were indeed different. On the first of the three focal surveys, the response rate to demographics placed at the beginning of the survey was around 97%, while the response rate to demographics placed at the end was around 87%. While this isn’t a huge difference, if demographics are involved in your primary research questions (and they often are), then this may be a good idea.
What’s especially interesting about this is that conventional wisdom is to place demographic items at the end. The argument that I have most often heard is that priming your survey respondents with their demographic characteristics (e.g. race) will lead them to respond differently than they otherwise would have. This is especially salient in the context of race-based stereotype threat, the tendency for minority group performance on cognitive measures to decrease as a result of anxiety associated with confirming negative stereotypes about intelligence. So what should we do?
There are two important facts about this study that limit its applicability. First, all measures investigated were noncognitive, i.e. survey items. Stereotype threat typically applies in contexts where there is a “right answer,” for example, knowledge tests or intelligence tests. So the placement on demographic items on such measures may still be important. Second, the study did not control for cognitive fatigue – survey length was confounded with experimental condition. Is it because the survey items were at the beginning vs. the end, or was it simply because respondents had already responded to many, many items and were bored/tired/at a loss for time/etc? Would the effect still hold with a 20-item survey? A 50-item survey? We don’t really know.
If you’re giving a noncognitve voluntary survey, you are probably interested in demographics specifically and want to ensure they are responded to more so than any other items. For now, it appears to be safe to put demographic items up front if that is your goal. Whether your survey is 20 items or 200 items, it is a low cost to move the demographic items on your survey. But if your survey has cognitively-loaded items, I’d still recommend against it.
Footnotes:- Teclaw, R., Price, M., & Osatuke, K. (2011). Demographic Question Placement: Effect on Item Response Rates and Means of a Veterans Health Administration Survey Journal of Business and Psychology DOI: 10.1007/s10869-011-9249-y [↩]
One of the biggest challenges associated with this newfangled social media is demonstrating monetary return on investment (ROI). A properly run social media campaign can be very expensive, as it takes a lot of time to properly engage an audience. Up to this point, there has been little to link social media to ROI other than an intuitive sense from practitioners of “of course it must have value!” Fortunately, a new research study soon to be published in the Journal of Business Research finally ties social media marketing to some more tangible outcomes. Customers with better perceptions of social media marketing are more likely to purchase the brands represented there.
In their study, Kim and Ko[1] first developed a list of luxury fashion brands to be the focus of the study. They did this by asking a team of fifteen graduate students to list three brands “that came to mind when thinking of luxury.” The list ultimately consisted of: Louis Vuitton, Gucci, Burberry and Dolce & Gabbana, and from this list, Louis Vuitton was chosen to be the focus of the study. This made sense because the next phase of the study would be conducted in Korea, and this was a high-profile brand there with a strong social media presence.
Next, researchers staked out malls in luxury shopping districts in Seoul, where they would intercept shoppers and provide them with a survey. 400 surveys were collected, 362 of which contained complete data, which asked questions about shopper perceptions of Louis Vuitton’s social media presence, and their perceptions of the Louis Vuitton brand. This enabled researchers to examine how beliefs about social media were related to brand perceptions. Study participants were shown a picture of Louis Vuitton’s Facebook page and Twitter feeds before responding to questionnaires about them.
Using structural equations modeling, the authors demonstrated that there were three mediators of social media marketing perceptions and purchase intentions/customer equity. If you aren’t familiar with mediation, the basic idea is that A affects B only through its effect on C. For example, eating candy does not itself make one happy – instead, it’s because candy is delicious that it makes one happy. Thus, candy creates deliciousness creates happiness. We would conclude from this that the relationship between candy and happiness is mediated by deliciousness.
Here are the three mediators tested:
- Value Equity. This is a customer’s assessment of how “worth it” the product is. If it’s priced well for a good product, you have very high value equity. If it’s priced badly for a poor product, you have very low value equity.
- Relationship Equity. This is a customer’s assessment of how loyal they are to a brand.
- Brand Equity. This a customer’s assessment of the added value of the brand beyond the product itself; for example, a customer with high brand equity would perceive a Louis Vuitton product to be of greater value than an identical product from another brand.
Researchers looked at two outcomes:
- Purchase Intentions. Hopefully this one’s pretty obvious. However, the scope of this was not clear from the article – intentions within the next month, six months, year?
- Customer Equity. This is a customer’s assessment of their expected lifetime value with a brand, a combination of expected total purchases, purchase frequency, purchase volume, expected purchases over other brands, and a few other features.
The researchers found a relationship between perceptions of social media marketing and all three mediators, with the strongest relationships to brand equity and relationship equity. But the relationships between mediators and outcomes was more complex: only value equity and brand equity predicted purchase intentions, while only brand equity predicted customer equity.
Overall, we can conclude that customer perceptions of social media marketing are linked to purchase intentions through their effect on value equity. Or in other words, the better a customer perceives your social media marketing effort, the more likely they are to think your products give them more for their money, and the more likely they are to actually purchase something from you.
There are certainly some limitations. First, the study is limited to the variables chosen; there may be other mediators not examined that could affect outcomes. Second, actual behavioral outcomes (e.g. purchases) were not measured; we are relying on self-report of purchase intentions. Third, and most importantly, as a survey-based study, we can make no causal conclusions. So we cannot safely say “if you increase your social media marketing efforts, more people will intend to purchase your products.” That is left to future research. But even with these limitations, this marks the first explicit tying of social media efforts to measurable cash-related outcomes.
Footnotes:- Kim, A., & Ko, E. (2011). Do social media marketing activities enhance customer equity? An empirical study of luxury fashion brand Journal of Business Research DOI: 10.1016/j.jbusres.2011.10.014 [↩]
Computing Intraclass Correlations (ICC) as Estimates of Interrater Reliability in SPSS
If you think my writing about statistics is clear below, consider my student-centered, practical and concise Step-by-Step Introduction to Statistics for Business for your undergraduate classes, available now from SAGE. Social scientists of all sorts will appreciate the ordinary, approachable language and practical value – each chapter starts with and discusses a young small business owner facing a problem solvable with statistics, a problem solved by the end of the chapter with the statistical kung-fu gained.
Recently, a colleague of mine asked for some advice on how to compute interrater reliability for a coding task, and I discovered that there aren’t many resources online written in an easy-to-understand format – most either 1) go in depth about formulas and computation or 2) go in depth about SPSS without giving many specific reasons for why you’d make several important decisions. The primary resource available is a 1979 paper by Shrout and Fleiss[1], which is quite dense. So I am taking a stab at providing a comprehensive but easier-to-understand resource.
Reliability, generally, is the proportion of “real” information about a construct of interest captured by your measurement of it. For example, if someone reported the reliability of their measure was .8, you could conclude that 80% of the variability in the scores captured by that measure represented the construct, and 20% represented random variation. The more uniform your measurement, the higher reliability will be.
In the social sciences, we often have research participants complete surveys, in which case you don’t need ICCs – you would more typically use coefficient alpha. But when you have research participants provide something about themselves from which you need to extract data, your measurement becomes what you get from that extraction. For example, in one of my lab’s current studies, we are collecting copies of Facebook profiles from research participants, after which a team of lab assistants looks them over and makes ratings based upon their content. This process is called coding. Because the research assistants are creating the data, their ratings are my scale – not the original data. Which means they 1) make mistakes and 2) vary in their ability to make those ratings. An estimate of interrater reliability will tell me what proportion of their ratings is “real”, i.e. represents an underlying construct (or potentially a combination of constructs – there is no way to know from reliability alone – all you can conclude is that you are measuring something consistently).
An intraclass correlation (ICC) can be a useful estimate of inter-rater reliability on quantitative data because it is highly flexible. A Pearson correlation can be a valid estimator of interrater reliability, but only when you have meaningful pairings between two and only two raters. What if you have more? What if your raters differ by ratee? This is where ICC comes in (note that if you have qualitative data, e.g. categorical data or ranks, you would not use ICC).
Unfortunately, this flexibility makes ICC a little more complicated than many estimators of reliability. While you can often just throw items into SPSS to compute a coefficient alpha on a scale measure, there are several additional questions one must ask when computing an ICC, and one restriction. The restriction is straightforward: you must have the same number of ratings for every case rated. The questions are more complicated, and their answers are based upon how you identified your raters, and what you ultimately want to do with your reliability estimate. Here are the first two questions:
- Do you have consistent raters for all ratees? For example, do the exact same 8 raters make ratings on every ratee?
- Do you have a sample or population of raters?
If your answer to Question 1 is no, you need ICC(1). In SPSS, this is called “One-Way Random.” In coding tasks, this is uncommon, since you can typically control the number of raters fairly carefully. It is most useful with massively large coding tasks. For example, if you had 2000 ratings to make, you might assign your 10 research assistants to make 400 ratings each – each research assistant makes ratings on 2 ratees (you always have 2 ratings per case), but you counterbalance them so that a random two raters make ratings on each subject. It’s called “One-Way Random” because 1) it makes no effort to disentangle the effects of the rater and ratee (i.e. one effect) and 2) it assumes these ratings are randomly drawn from a larger populations (i.e. a random effects model). ICC(1) will always be the smallest of the ICCs.
If your answer to Question 1 is yes and your answer to Question 2 is “sample”, you need ICC(2). In SPSS, this is called “Two-Way Random.” Unlike ICC(1), this ICC assumes that the variance of the raters is only adding noise to the estimate of the ratees, and that mean rater error = 0. Or in other words, while a particular rater might rate Ratee 1 high and Ratee 2 low, it should all even out across many raters. Like ICC(1), it assumes a random effects model for raters, but it explicitly models this effect – you can sort of think of it like “controlling for rater effects” when producing an estimate of reliability. If you have the same raters for each case, this is generally the model to go with. This will always be larger than ICC(1) and is represented in SPSS as “Two-Way Random” because 1) it models both an effect of rater and of ratee (i.e. two effects) and 2) assumes both are drawn randomly from larger populations (i.e. a random effects model).
If your answer to Question 1 is yes and your answer to Question 2 is “population”, you need ICC(3). In SPSS, this is called “Two-Way Mixed.” This ICC makes the same assumptions as ICC(2), but instead of treating rater effects as random, it treats them as fixed. This means that the raters in your task are the only raters anyone would be interested in. This is uncommon in coding, because theoretically your research assistants are only a few of an unlimited number of people that could make these ratings. This means ICC(3) will also always be larger than ICC(1) and typically larger than ICC(2), and is represented in SPSS as “Two-Way Mixed” because 1) it models both an effect of rater and of ratee (i.e. two effects) and 2) assumes a random effect of ratee but a fixed effect of rater (i.e. a mixed effects model).
After you’ve determined which kind of ICC you need, there is a second decision to be made: are you interested in the reliability of a single rater, or of their mean? If you’re coding for research, you’re probably going to use the mean rating. If you’re coding to determine how accurate a single person would be if they made the ratings on their own, you’re interested in the reliability of a single rater. For example, in our Facebook study, we want to know both. First, we might ask “what is the reliability of our ratings?” Second, we might ask “if one person were to make these judgments from a Facebook profile, how accurate would that person be?” We add “,k” to the ICC rating when looking at means, or “,1″ when looking at the reliability of single raters. For example, if you computed an ICC(2) with 8 raters, you’d be computing ICC(2,8). If you computed an ICC(1) with the same 16 raters for every case but were interested in a single rater, you’d still be computing ICC(2,1). For ICC(#,1), a large number of raters will produce a narrower confidence interval around your reliability estimate than a small number of raters, which is why you’d still want a large number of raters, if possible, when estimating ICC(#,1).
After you’ve determined which specificity you need, the third decision is to figure out whether you need a measure of absolute agreement or consistency. If you’ve studied correlation, you’re probably already familiar with this concept: if two variables are perfectly consistent, they don’t necessarily agree. For example, consider Variable 1 with values 1, 2, 3 and Variable 2 with values 7, 8, 9. Even though these scores are very different, the correlation between them is 1 – so they are highly consistent but don’t agree. If using a mean [ICC(#, k)], consistency is typically fine, especially for coding tasks, as mean differences between raters won’t affect subsequent analyses on that data. But if you are interested in determining the reliability for a single individual, you probably want to know how well that score will assess the real value.
Once you know what kind of ICC you want, it’s pretty easy in SPSS. First, create a dataset with columns representing raters (e.g. if you had 8 raters, you’d have 8 columns) and rows representing cases. You’ll need a complete dataset for each variable you are interested in. So if you wanted to assess the reliability for 8 raters on 50 cases across 10 variables being rated, you’d have 10 datasets containing 8 columns and 50 rows (400 cases per dataset, 4000 total points of data).
A special note for those of you using surveys: if you’re interested in the inter-rater reliability of a scale mean, compute ICC on that scale mean – not the individual items. For example, if you have a 10-item unidimensional scale, calculate the scale mean for each of your rater/target combinations first (i.e. one mean score per rater per ratee), and then use that scale mean as the target of your computation of ICC. Don’t worry about the inter-rater reliability of the individual items unless you are doing so as part of a scale development process, i.e. you are assessing scale reliability in a pilot sample in order to cut some items from your final scale, which you will later cross-validate in a second sample.
In each dataset, you then need to open the Analyze menu, select Scale, and click on Reliability Analysis. Move all of your rater variables to the right for analysis. Click Statistics and check Intraclass correlation coefficient at the bottom. Specify your model (One-Way Random, Two-Way Random, or Two-Way Mixed) and type (Consistency or Absolute Agreement). Click Continue and OK. You should end up with something like this:
In this example, I computed an ICC(2) with 4 raters across 20 ratees. You can find the ICC(2,1) in the first line – ICC(2,1) = .169. That means ICC(2, k), which in this case is ICC(2, 4) = .449. Therefore, 44.9% of the variance in the mean of these raters is “real”.
So here’s the summary of this whole process:
- Decide which category of ICC you need.
- Determine if you have consistent raters across all ratees (e.g. always 3 raters, and always the same 3 raters). If not, use ICC(1), which is “One-way Random” in SPSS.
- Determine if you have a population of raters. If yes, use ICC(3), which is “Two-Way Mixed” in SPSS.
- If you didn’t use ICC(1) or ICC(3), you need ICC(2), which assumes a sample of raters, and is “Two-Way Random” in SPSS.
- Determine which value you will ultimately use.
- If a single individual, you want ICC(#,1), which is “Single Measure” in SPSS.
- If the mean, you want ICC(#,k), which is “Average Measures” in SPSS.
- Determine which set of values you ultimately want the reliability for.
- If you want to use the subsequent values for other analyses, you probably want to assess consistency.
- If you want to know the reliability of individual scores, you probably want to assess absolute agreement.
- Run the analysis in SPSS.
- Analyze>Scale>Reliability Analysis.
- Select Statistics.
- Check “Intraclass correlation coefficient”.
- Make choices as you decided above.
- Click Continue.
- Click OK.
- Interpret output.
- Shrout, P., & Fleiss, J. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86 (2), 420-428 DOI: 10.1037/0033-2909.86.2.420 [↩]
So as you might have picked up by now, I’m part of the faculty at Old Dominion University. Imagine my surprise to discover that Matt Jabaily, Education Reference Librarian, is running a multiplayer Minecraft server that acts as an introduction to the new library extension! If you aren’t familiar with Minecraft, it is an indie game runaway hit in which players can manipulate everything about their environment, one block at a time, with a sort of zombie survival aesthetic; you must carefully explore your environment and build a shelter to survive the zombie wave that comes out at night. Here’s a description of the library’s version:
Monsters have attacked the Learning Commons, and only you can save it! Destroy the invading mobs and light up the library to keep more from spawning. As you do, you’ll learn more about the resources and services available at the Learning Commons, finding weapons, armor, and rewards as you explore. Play alone or with friends live on the server (IP: 96.8.117.2:25607) or request the single player download to modify and play the game according to your desire (email mjabaily@odu.edu).
Each area of this new part of our library is represented in Minecraft for students to explore and learn – plus zombies! The zombie extension is a special Halloween-inspired version that will be retired on November 18, at which point the server will revert to an ordinary library again. So your time is short to try it out!
I don’t think defeating zombies is necessarily related to any vital library skills, but perhaps it will motivate students to try investigate aspects of the library they otherwise would not have. The requirement to own Minecraft (which is not university-provided – although it should be!) is certainly a barrier to student adoption.
So will this be more effective at getting students to learn about new library resources than anything else? Who knows? At the very least, it sure is a fun way to try.
One of my graduate students, Katelyn Cavanaugh, has decided to embark to the brave new world of science funding called crowdsourcing for her I/O Psychology Master’s thesis. Here are the two facts that led us to this point:
- Science funding is very hard to come by, as most of it comes from federal grants. As a graduate student, you generally don’t qualify for many federal grants. As a result, many graduate students either a) ask less interesting research questions than they really want to ask because they don’t have enough money to ask them or b) pay for their projects out of pocket. And asking someone making less than $12,000 per year to supply nearly 10% of her yearly salary to advance the cause of science is not terribly fair.
- The Internet enables new forms of funding that previously did not exist. Thanks to services like RocketHub, anyone in need of funds can simply ask for them! Generous patrons surfing the Internet can browse through folks that need money to do good works, and choose whom to fund. Have an amazing art project that will require purchasing rare plants only found on one slope on one mountain in the Alps? Crowdsource it, and perhaps a curious art lover will provide you the cash you need to make it a reality.
So, as you might guess, cash-poor scientists have begun to use crowdsourcing to fund their research. The effort is called SciFund, and it may be the future of science funding. This year, researchers are seeking funding for 49 projects, one of which is Katelyn’s Master’s thesis. I’ll let her explain her project in her own words from her windowless-4-person-shared-office/closet provided generously by ODU (YouTube video below):
Less than a day old, and already 4 funders – please join them! Any donation is valuable; every dollar gets her more participants for her study. You can make your own donation to her research before December 15 by visiting http://rockethub.com/projects/3800-learner-control-in-online-training-programs
When students feel that their performance on learning assessments is out of their control, they create excuses before the assessment has even taken place. These excuses are often not based in reality; instead, the students are trying to protect their self-esteem – if they ultimately fail, they have already created an excuse as to why it happened, and if they succeed, they have done so in the face of adversity. Either way, the student’s self-concept is protected. This practice is called self-handicapping.
In a fascinating article in the American Psychological Society’s Observer, Valkyrie and Tobin[1] discuss the research on self-handicapping, rather amusingly titled, “Teacher, I May Not Do Well on the Test Next Week Because I May Have to Babysit My Sister.” They discuss the myriad excuses students use a priori to excuse themselves from the responsibility of a difficult test looming before them.
This isn’t necessarily done to annoy instructors, although it often feels like it to the instructor. Instead, students are trying to protect themselves. Knowing that the test is difficult, they shift the blame from themselves to external factors. It’s not that I didn’t study enough, or that I didn’t know what coming – instead, the world is conspiring against me so that I have no choice but to fail.
Of course, in reality, these excuses are nonsense. Needing to babysit next week certainly does not prevent you from studying now. But they are convincing to students making the argument, perhaps with a similar mechanism to self-fulfilling prophecy.
The authors suggest several techniques that teachers might use to minimize the effect of self-handicapping:
- Teach students about self-handicapping so that they are strengthened against it.
- Create an environment that discourages self-handicapping by:
- Being supportive of students.
- Explaining why staying on task is important, and how course objectives relate to staying on task.
- Emphasize equity and fairness, that the power to succeed is shared by the teacher and student.
- Don’t use motivational structures that pit students against each other (i.e. don’t use competition to motivate)
- Teach learning strategies to students.
- Consider why students are motivated to perform (or not) in your classes and work toward those motivations.
These are generally good recommendations, although I’m not sure about #3. Certainly outright competition for grades is not a great idea, but revealing grade ranges so that students have some sense of how well they are doing relative to others can be useful. This de-legitimizes claims like, “This class is too hard, everyone is failing!” and can be a useful tool to give students a better sense of perspective.
Footnotes:- Valkyrie, K., & Tobin, C. (2011). ‘Teacher, I may not do well on the test next week because I may have to babysit my sister.’ Observer, 24 (8) [↩]
If you want to stay current with your research area, you need to read new journal articles as they come out. In the past, this was much more difficult; you needed to subscribe to journals and have them delivered to your office or home, or trudge over to your university library. Archaic!
These days, you can often be notified of and read new journal articles as they are placed in press. This puts you at the very forefront of new research; you can read articles after peer review, the moment they are made available to the public, often pre-publication. To do this, you need only two things:
- A news reader (sometimes called a news aggregator) that processes RSS feeds
- Each journal’s RSS feed
RSS stands for Really Simply Syndication. This is a technique used by many websites (not just journals) to provide plain, unformatted, transportable content to anyone that might want it, which is packaged as an RSS feed. For example, this site’s RSS feed is http://neoacademic.com/feed, which is also linked on the top right of this page. It contains the content of all the articles I post here, as they are released, and nothing more.
That way, if you have an RSS feed reader, you can go to a single website (or run a single program) and view the RSS feeds from all of your favorite websites (and journals!) in one place. This eliminates the need to visit a large different number of websites each day and hunt for new content. The news reader remembers what you’ve already read; all you see is new content.
So to enjoy the wide world of journal RSS feeds, you need an RSS feeder – and there are many to choose from. To start, I suggest Google Reader, which is free and quite easy to use. If you already have a Google account (and who doesn’t?), just click on that link to see your very own Google Reader. The first time you use it, it will be a bit empty. That’s because you need to fill it up with RSS feeds. I suggest starting with this one. If you’re using a modern web browser, adding RSS feeds is very easy – just click on the link to a feed and follow the prompts.
Once you get comfortable with RSS, you should think about the easiest way to get your RSS content. Do you want to stick with the cloud (i.e. Google) or use a standalone program where you can more easily save your favorite RSS entries for later? If you already use an e-mail management program, like Outlook, there are often RSS readers built-in so that you can read your RSS with your e-mail. If not, there are plenty of other options.
Finding RSS feeds can be quite easy (just look for the little orange RSS symbol, like the one on this page) or quite difficult (involving hunting through many layers of badly formatted publishers’ web pages). A well-crafted Google search is, as always, the best solution. Simply search Google for “JournalName rss” and you are likely to find what you’re looking for – at least, as long as the journal publisher has decided to release an RSS feed.
Here are a couple of lists of RSS feeds to get you started (thanks to Jeremy Anglim for providing me a starting point for the I/O list):
Related to I/O Psychology
- Administrative Science Quarterly
- Human Performance
- Human Resource Development Quarterly
- Human Resource Management
- In-press articles from the Academy of Management (includes AMJ, AMR, AMLE, and AMP)
- Industrial and Organizational Psychology: Perspectives on Science and Practice
- International Journal of Selection and Assessment
- International Journal of Training and Development
- Journal of Applied Psychology
- Journal of Business and Psychology
- Journal of Management
- Journal of Managerial Psychology
- Journal of Organizational and Occupational Psychology
- Journal of Organizational Behavior
- Journal of Personality and Social Psychology
- Organizational Behavior and Human Decision Processes
- Organizational Research Methods
- Personnel Psychology
You can also subscribe to the Society for Industrial and Organizational Psychology’s blog’s RSS feeds.
Related to Psychology, Games, Technology, and Education
- Computer and Education
- Computers in Human Behavior
- Cyberpsychology, Behavior and Social Networking
- Game Studies
- Games and Culture
- International Journal of Gaming and Computer-Mediated Simulation
- International Journal of Game-based Learning
- Journal of Computer-Mediated Communication
- Journal of Media Psychology
- Simulation and Gaming
And finally, consider subscribing to a few I/O Psychology blogs!
In an upcoming issue of Cyberpsychology, Behavior and Social Networking, Maslowska, van den Putte and Smit[1] explore the value of personalizing mass e-mail newsletters. They found that personalizing a newsletter caused an improved reaction to the content of that newsletter.
Research on personalization, to this point, has been mixed. The majority of studies have found personalization to improve outcomes, but a few have found negative effects. Hypothesized effects vary, but generally fall along the lines of persuasion; the article lists enhanced attention, memory, intentions and behaviors as potential outcomes.
The use of personalization itself is surprisingly simple. In this study, the authors manipulated it by adding the recipient’s name in three places in the newsletter – a relatively low-effort change. The authors randomly assigned Dutch undergraduates to one of two conditions: personalized or generic e-mails promoting the “University Sports Centre.” The newsletter was then distributed; at the bottom of the newsletter, readers could follow an optional link to a survey, compensating participants with a raffle entry. This resulted in 105 cases in the final dataset.
The study is not terribly clear as to response rates. My suspicion is that it was sent to a much larger number of students, and only 109 responded (4 were already Sports Centre users and were discarded). This probably doesn’t matter anyway, as the missingness was likely at random – I find it unlikely that the likelihood of responding to a newsletter survey to be entered into a raffle is correlated with the the effect personalization might have on persuasion.
In comparing the two conditions with independent-samples t-tests, they found an effect on the evaluation of the newsletter but not on any attitude or behavioral outcomes. Or in other words, those with personalized newsletters liked the newsletters more, but didn’t like the subject of those newsletters any more (i.e. the Sports Centre). The effect was rather small (eta squared = .05).
The authors also examined a variety of two-way interactions, but approached the questions in a non-standard way. Typically, we would use hierarchical multiple regression to determine if interactive effects provided incremental prediction above and beyond main effects. Instead, the authors did some sort of convoluted analysis of those interactions using ordinary multiple regression, followed up by simple slopes comparisons. It’s hard to tease apart, but it seems as if they examined the effect of the interaction alone on the outcome, which would be blatantly incorrect, if true. But it is difficult to tell.
So what can we reasonably conclude from this article? It does seem that personalizing a newsletter e-mail results in the recipient liking the newsletter somewhat more. This might have other long-term effects; for example, recipients might be more likely to stay subscribed to the newsletter. But it does not seem to affect how the letter-receiver acts on the information it contains. At least when it comes to advertisement of on-campus resources to undergraduates, anyway. These implications are, of course, not tested in this article.
Would I personalize a newsletter? Sure! There seem to be few if any negative effects, and it’s a very easy change to make. Mail merges are not that complicated. So even if the positive effect is limited, any benefit at all makes the cost-benefit ratio quite good.
Footnotes:- Maslowska, E., Putte, B., & Smit, E. (2011). The Effectiveness of Personalized E-mail Newsletters and the Role of Personal Characteristics Cyberpsychology, Behavior, and Social Networking DOI: 10.1089/cyber.2011.0050 [↩]
In his 2008 book, Outliers, Malcolm Gladwell criticizes the traditional “top-down” selection method used by organizations hiring on the basis of cognitive ability (colloquially referred to as “intelligence”). By his argument, there is a certain level at which additional cognitive ability simply is not valuable.
It is certainly an attractive idea; Gladwell argues that it is historical advantages that bias scores on employee selection devices. John is lucky to come from a good family, happens to put himself in a position to take advantage of the opportunities that present themselves, and gets hired to a solid position more by luck than by skill. By this logic, there are many qualified applicants; some were simply more lucky than others, and hiring from the top-down merely punishes the unlucky ones without any real benefit to the organization. And if the success of your highest performing employees is due to luck, there’s not much reason to worry about selecting the best of the best, right?
In a recent article in Psychological Science, Arneson, Sackett, and Beatty[1] call Gladwell’s approach the good-enough hypothesis while they call the traditional approach the more-is-better hypothesis. They test the relative value of these hypotheses by comparing cognitive ability and performance in four large datasets containing a total of 170,286 people:
- Project A. In the late 1980s, US soldiers took part in a large scale project designed to link various selection devices to soldier performance, which included the ASVAB, which itself contained a strong cognitive ability component. Actual ratings of job performance were standardized and compared with cognitive ability scores derived from the ASVAB.
- College Board SAT. This dataset contained SAT scores and GPA data. GPA was standardized within schools and compared with SAT performance.
- Project TALENT. The Department of Health, Education and Welfare (before Health and Education split in the late 1970s) ran a longitudinal study high school seniors, starting in 1960, tracking a large group of them through their college GPA. This GPA was compared with a composite ability score collected from the study.
- National Education Longitudinal Study of the Class of 1988. Standardized verbal and mathematics ability scores of eighth graders were compared with their later college GPAs.
To determine which hypothesis better described the data across these datasets, the authors did the following:
- Checked for ceiling effects (to ensure that the data toward the upper end of ability was not clustered)
- Used a lowess smoother on the scatterplot of the ability-performance relationship to visualize the relationship at all levels of ability
- Used a power polynomial test to identify curvilinearity in the relationship
- Used a second power polynomial test on only the high ability folks (z > 1)
- Compared regression slopes (b) of performance on ability at four segments of the relationship: minimum to z = -1, -1 < z < 0, 0 < z < 1, and 1 to maximum
Across analysis types, it became clear that the more-is-better hypothesis better described the data. But more surprisingly, ability appeared to be more important at the high end than it was at the low end. Or in other words, the relationship between ability and performance gets stronger among those with high ability. Gladwell could not have been more inaccurate in describing these relationships.
Interestingly, Gladwell may have previously had a more anti-selection bent in his book that was perhaps edited down later after he became aware of the massive research literature on hiring using psychological measurement.
Footnotes:- Arneson, J., Sackett, P., & Beatty, A. (2011). Ability-Performance Relationships in Education and Employment Settings: Critical Tests of the More-Is-Better and the Good-Enough Hypotheses Psychological Science DOI: 10.1177/0956797611417004 [↩]
According to recent research appearing in Science, the traditional advice to graduate students to minimize their time spent teaching and maximize their time spent on research may ultimately harm the development of their research skills.
Previous research examining the relationship between teaching and outcomes has been difficult to interpret. For example, in one study, 524 students self-reported their publications and presentations, and these numbers were compared between students teaching and conducting research versus students only conducting research. Students teaching had higher presentation and publication rates. But this relied on self-report; would it hold true if we had more objective measures?
In their article, Feldon et al.[1] describe their research study in which 95 graduate students within the first three years of their education completed a validated research skills measure at two time points: at the beginning and end of an academic year. The research skills measure involved the creation and revision of a research proposal in the student’s area of interest. No feedback was provided between the two time points.
Analyses revealed that student teacher-researchers were moderately better at generating testable hypotheses (d = .40) and at generating valid research designs (d = .48) than student researchers. This indicated to the authors that “teaching experience can contribute substantially to the improvement of essential research skills.”
Hold on a minute… how did they conclude causality? Teaching experience “contributing” to research skills implies that teaching causes an increase in research skills, but that conclusion is unjustified given the non-experimental design. After a bit of digging in the online supplement, I discovered that they included the Time 1 scores as a covariate in their MANCOVA to attempt to account for pre-existing group differences, but that does not change the fact that this is a correlational study.
Sure, there are differences between groups. The causal element could be group membership (teacher-researcher vs researcher). But it could also be any number of individual differences correlated with group membership but uncorrelated with pre-test scores. Perhaps, as I suggested earlier, more highly skilled/qualified graduate students are attracted to teaching roles? Perhaps students also teaching simply have more to prove? The source of this variance is unclear.
Perhaps more disconcerting – the authors only report 2 of the 10 dimensions as statistically significant in the expected direction. The other 8 were not; in fact, one was even opposite of the hypothesized direction.
Even with these limitations, the results are still interesting. Graduate students teaching do have higher scores on two outcome dimensions, even when controlling for pre-test differences. Why does this happen? Is it causation, or an interaction between graduate student individual differences and time?
In conclusion, I found this study somewhat of a paradox. It is in Science, which is a top-tier journal by any account, but it is a non-experimental design with causal conclusions. So most surprisingly to me, this study proves you can get a correlational design from the social sciences with 95 participants published in Science. Who knew?
Footnotes:- Feldon, D., Peugh, J., Timmerman, B., Maher, M., Hurst, M., Strickland, D., Gilmore, J., & Stiegelmeyer, C. (2011). Graduate Students’ Teaching Experiences Improve Their Methodological Research Skills Science, 333 (6045), 1037-1039 DOI: 10.1126/science.1204109 [↩]









