Are you a psychologist interested in learning some new techniques to leverage data science in your academic research or in your consulting practices? Web scraping may be the answer you need.
Last year, I published the first in what will likely be a series of articles focused on teaching psychologists techniques from data science. Specifically, I introduced the concept of web scraping, which involves the systematic, algorithmic curation of unstructured online data, usually from social media, and its conversion into an analyzable dataset. I furthermore provided a step-by-step tutorial explaining how to use the free programming language Python and its free package scrapy to do just that.
This year, I’ll be presenting three workshops on web scraping in various venues. Each presentation is somewhat different in focus and learning objectives, so feel free to attend all three!
- April 28, 2017: Automated conversion of social media into data: Demonstration and tutorial (3 hours)
Part of the Friday seminar series at the 2017 Annual Conference of the Society for Industrial and Organizational Psychology (SIOP) in Orlando, FL.
- July 17, 2017: Web scraping and machine learning for employee recruitment and selection: A hands-on introduction (3.5 hours)
A pre-conference workshop for the International Personnel Assessment Council (IPAC) annual conference in Birmingham, AL.
- August 3-6 (TBD), 2017: How to create a dataset from Twitter or Facebook: Theory and demonstration (1.8 hours)
A skill-building session for the American Psychological Association (APA) annual conference in Washington, DC.
All three presentations will start with an explanation of data source theories, the key theoretical consideration that affects external validity when trying to identify high quality sources of online information for research.
Additionally, the SIOP presentation will focus on instruction in Python and scrapy, mimicking the online tutorial I provided but with some extra information and a lot of hands-on examples.
The IPAC presentation will focus on the practicals of web scraping, including discussion of tradeoffs to various data sources when using web scraping for employee selection and recruitment, demonstration of both easy-to-use commercial scraping packages and the manual, Python-based approach, and interactive discussion of use cases.
The APA presentation will be a hands-on walkthrough of accessing the Facebook and Twitter APIs to web scrape without nearly as much programming as you need when you don’t have an API!
With any of the three, you should be able to leave the workshop and curate a new internet-sourced dataset immediately!
I believe all three provide CE credit, but I’ll update this when I know for sure! See you in Orlando, Birmingham and Washington!
Careless responding is one of the most fundamental challenges of survey research. We need our respondents to respond honestly and with effort, but when they don’t, we need to be able to detect and remove them from our datasets. A few years ago, Meade and Craig published an article in Psychological Methods exploring a significant number of techniques for doing exactly this, ultimately recommending a combination of three detection techniques for rigorous data cleaning, which, let’s face it, is a necessary step when analyzing any internet survey. These techniques are even-odd consistency, maximum longstring, and Mahalanobis D:
- The even-odd consistency index involves calculating the subscale means for each measure on your survey, split by even and odd items. For example, the mean of items 1, 3, 5, and 7 would become one subscale whereas the mean of items 2, 4, 6, and 8 would become the other. Next, you take all of the even subscales and pair them with all of the odd subscales across all of the measures in your survey, calculate a correlation, and then apply the Spearman-Brown prophecy formula to adjust the value up to a scale of -1 to 1.
- Maximum LongString is the largest value for LongString across all scales on your survey, where LongString is the number of identical response in a row. Meade and Craig recommended LongString would be most useful when the items were randomly ordered.
- Mahalnobis D is calculated from the regression of scale means onto all the scores that inform them. In a sense, you are trying to see if responses to individual items correspond with the scale mean they created consistently across individuals. Some conceptualizations of this index regress participant number onto scores, which conceptually accomplishes basically the same thing.
In all three cases, the next step is to create a histogram of the values and see if you see any outliers.
Calculating Careless Responding Indices
Of these three, Mahalanobis D is the most easily calculated, because saving Mahalanobis D values is a core feature in regression toolkits. It is done easily in SPSS, SAS, R, etc.
The second, the even-odd consistency index, is a bit harder but still fundamentally not too tricky; you just need to really understand how your statistical software works. Each step, individually, is simple: calculate scale means, calculate a correlation, apply a formula.
The third, Max LongString, is the most intuitively understandable but also, often unexpectedly, the most difficult to calculate. I imagine that the non-technically-inclined generally count by hand – “this person has a maximum of 5 identical answers in a row, the next person has 3…”
An SPSS macro already exists to do this, although it’s not terribly intuitive. You need to manually change pieces of the code in order to customize the function to your own data.
Given that, I decided to port the SPSS macro into Excel and make it a little easier to use.
An Excel Macro to Calculate LongString
Function LongString(cells As Range) Dim cell As Range Dim run As Integer Dim firstrow As Boolean Dim maxrun As Integer Dim lastvalue As String firstrow = True run = 1 maxrun = 1 For Each cell In cells If firstrow = True Then firstrow = False lastvalue = cell.Value Else If cell.Value = lastvalue Then run = run + 1 maxrun = Application.Max(run, maxrun) Else run = 1 End If lastvalue = cell.Value End If Next cell LongString = maxrun End Function
To Use This Code Yourself
- With Excel open, press Alt+F11 to open the VBA Editor.
- Copy/paste the code block above into the VBA Editor.
- Close the VBA Editor (return to Excel).
- In an empty cell, simply type =LONGSTRING() and put the cell range of your scale’s survey items inside. For example, if your first scale was between B2 and G2, you’d use =LONGSTRING(B2:G2)
- Repeat this for each scale you’ve used. For example, if you measured five personality dimensions, you’d have five longstrings calculated.
- Finally, in a new cell use the =MAX() function to determine the largest of that set. For example, if you put your five LongStrings in H2 to L2, you’d use =MAX(H2:L2)
That’s it! Importantly, the cells needs to be in Excel in the order they were administered. If you used randomly ordered items, this adds an additional layer of complexity, because you’ll need to recreate the original order for each participant first before you can apply LongString. That takes a bit of Excel magic, but if you need to do this, I recommend you read up on =INDIRECT() and =OFFSET(), which will help you get that original order back, assuming you saved that item order somewhere.
Once you have Max LongString calculated for each participant, create a histogram of those values to see if any strange outliers appear. If you see clear outliers (i.e., a cluster of very high Max LongString values off by itself, away from the main distribution), congratulations, because it’s obvious which cases you should drop. If you don’t see clear outliers, then it’s probably safer to ignore LongString for this analysis.Footnotes:
So, what’s the difference between industrial and organizational psychology?
The difference these days is quite fuzzy, but it used to be much clearer. Let me tell you a little story.
In the old and ancient times for the field of psychology – which of course means the end of the 19th and first half of the 20th century – there was only one field: industrial psychology. It did not always formally have this name (e.g., people calling themselves “industrial psychologists” were often found in “counseling psychology” or “applied psychology” organizations), but it was what now think of as historical “industrial psychology.” Industrial psychology was for the most part (although not entirely) focused on improving production in manufacturing and other manual labor sorts of jobs, as well as improving soldier performance on the battlefield (which at that time was also often manual labor).
In manufacturing, managers noticed that employees seemed to work harder sometimes and less hard other times, and they were not sure why. A bunch of researchers with names you’ll recognize if you have studied I/O – Hugo Munsterberg, Walter Dill Scott, James Cattell, and Edward Titchener in particular – promoted the idea that the fledgling field of psychology might be able to shed some light on this. They would of course believe this as they were all students of Wilhelm Wundt, the grandfather of modern psychology.
The growth of industrial psychology was also heavily influenced by and contributed to a movement in the early 1900s called Taylorism, reflecting the viewpoint of Frederick Taylor, a mechanical engineer by training who was inspired by Munsterberg and others. His view was that the American worker was slow, stupid, and unwilling to do any work except by force or threat. However, he also viewed science as the only way to fix the problem he perceived. The popularity of Taylorism (sometimes called “scientific management”) in the US and around the world (Stalin reportedly loved the idea) paved the way for our field to grow, for better or worse.
As a result, industrial psychology at that time had a lot of overlap with what we now call “human factors psychology.” Studies were often conducted like the famous ones by Elton Mayo at the Hawthorne plant of Western Electric, where key elements of the environment – such as lighting – were varied systematically and the effects on worker behavior observed using the scientific method. In fact, if you poke into the history of specific I/O graduate programs, you’ll often find a split between I/O and HF somewhere in their past. The goal of many studies of that era could be described as trying to trick the worker into working harder. The interesting thing about such techniques: they do work… at least to a certain degree.
In addition to trying to change performance while people were at work, other industrial psychologists became interested in hiring. Specifically, many believed that if they could design the “perfect test,” they could find the absolute most productive workers for these businesses. These tests were typically intended to be assessments of intelligence, early versions of what we now conceptualize as latent “general cognitive ability.” One of the earliest and most well known examples of these efforts were the Army Alpha and Army Beta, tests used by the US Army in World War I given to more than a million soldiers for the purposes of assessing readiness to become a soldier, place them into specific military positions, and also – the first hint of a later shift – to identify high-potential leaders. These tests are early versions of the current test, which is still maintained and studied by I/O psychologists: the ASVAB.
As industrial psychology grew, so did the feeling that our field was missing something. The Hawthorne studies I referenced earlier are often credited as being the trigger point for this, but Hawthorne better serves as an example of this shift rather than thecause. As early as the 1930s, people became aware that industrial psychology’s focus on predicting and improving performance often ignored other aspects of the worker, specifically those involving people’s feelings. Motivation, satisfaction, how people get along with others – these topics were not of much concern among industrial psychologists, and a number of studies, including those at Hawthorne, increased interest in the application of psychology to the broader workplace. They also wondered if performance could be increased further by looking beyond hiring and worker manipulation – perhaps there is more we could do?
Thus, in 1937, the first organization devoted to I/O was created: Section D of the American Association for Applied Psychology, Industrial and Business. The AAAP merged into the American Psychological Association in 1945, rebirthing our field as APA Division 14: Industrial and Business Psychology. The shift from “Business” to “Organization” reflected changing priorities over several decades. Dissatisfaction with the explicit ties to Business (and not, for example, the military, government, etc.) resulted in the division being renamed simply “Industrial Psychology” in 1962. With the shift away from an industrial economy in the 1960s, dissatisfaction with the term “Industrial” led to the name we have today as of 1973: Industrial and Organizational Psychology.
So the short version of this answer is that: the distinction between industrial and organizational psychology these days is not a particularly strong one. It is instead based on historical shifts in priorities among the founding and early members of the professional organizations in our field. If I had to split them, I’d say people on the industrial side tend to focus more on things like employee selection, training and development, performance assessment and appraisal, and legal issues associated with all of those. People on the organizational side tend to focus more on things like motivation, teamwork, and leadership. But even with that distinction, people on both sides tend to borrow liberally from the other.
There was also a historical association of industrial psychology with more rigorous experimentation and statistics, largely because the focus on hiring could only be improved with those methods. The topics common to org psych were much broader with much more unexplored territory for a lot longer. But that has changed too – there aren’t many org psych papers published anymore without multilevel or structural equation modeling, as contributions on both the I and O sides have become smaller and more incremental than in the past. The old days of I/O were practically a Wild West! You could essentially just go into an organization, change something systematically, write it up, and you’d have added to knowledge. These days, it’s a lot harder.
Behind the scenes of all these theoretical/stance changes was also a huge ongoing battle against the American Psychological Association with where our field should fit as a professional organization (did you ever think it strange that SIOP incorporated as a non-profit while still a part of APA?), a problem that continues to this day. But that’s a different story!
“The Difference Between Industrial and Organizational Psychology” originally appeared in an answer I wrote on Quora.