Skip to content

Internet Scraping for Research: A Python Tutorial for Psychologists

2016 June 15
by Richard N. Landers

ResearchBlogging.orgOne of the biggest challenges for psychologists is gaining access to research participants. We go to great lengths, from elaborate and creative sampling strategies to spending hard-earned grant money, to get random(ish) people to complete our surveys and experiments.  Despite all of this effort, we often lack perhaps the most important variable of all: actual, observable behavior. Psychology is fundamentally about predicting what people do and experience, yet most of our research starts and stops with how people think and feel.

That’s not all bad, of course. There are many insights to be gained from how people talk about how they think and feel. But the Internet has opened up a grand new opportunity to actually observed what people do – to observe their Internet behaviors.  We can actually see how they communicate with others on social media, tracking their progress in relationships and social interactions in real time, as they are built.

Despite all of this potential, surprisingly few psychologists actually do this sort of research. Part of the problem is that psychologists are simply not trained on what the Internet is or how it works. As a result, research studies involving internet behaviors typically involve qualitative coding, a process by which a team of undergraduate or graduate students will read individual posts on discussion boards, writing down a number or set of numbers for each thing they read. It is tedious and slow, which slows down research for some and makes the entire area too time-consuming to even consider for others.

Fortunately, the field of data science has produced a solution to this problem called internet scraping.  Internet scraping (also “web scraping”) involves creating a computer algorithm that automatically travels across the Internet or a select piece of it, collecting data of the type you’re looking for and depositing it into a dataset. This can be just about anything you want but typically involves posts or activity logs or other metadata from social media.

In a paper recently published at Psychological Methods by me and my research team[1], our goal is to teach psychologists how to use these tools to create and analyze their own datasets. We do this in a programming language called Python.

Now if you’re a psychologist and this sounds a little intimidating, I totally understand. Programming is not something most research psychologists learned in graduate school, although it is becoming increasingly common.

R, a statistical programming language, is increasingly becoming a standard part of graduate statistical training, so if you’ve learned any programming, that’s probably the kind you learned.  If you’re one of these folks, you’re lucky – if you learned R successfully, you can definitely learn Python.  If you haven’t learned R, then don’t worry – the level of programming expertise you need to successfully conduct an Internet scraping project is not as bad as you’d think – similar to the level of expertise you’d need to develop to successfully use structural equation modeling, or hierarchical linear modeling, or any other advanced statistical technique. The difference is simply that now you need to learn technical skills to employ a methodological technique. But I promise it’s worth it.

In our paper, we demonstrated this by investigating a relatively difficult question to answer: are there gender differences in the way that people use social coping to self-treat depression? This is a difficult question to assess with surveys because you always see reality through the eyes of people who are depressed. But on the Internet, we have access to huge databases of people with depression and trying to self-treat in the form of online discussion forums intended for people with depression. So to investigate our demonstration research question, we collected over 100,000 examples of people engaging in social coping, as well as self-reported gender from their profiles.  We wanted to know if women use social coping more often than men, if women are more likely to try to support women, if men are more likely to try to support men, and a few other questions.

The total time to collect those 100,000 cases?  About 20 hours, during a solid 8 hours of which I was sleeping. One of the advantages to internet scraping algorithms collecting your data is that they don’t require much attention.

Imagine the research questions we could address if more people adopted this technique! I can hear the research barriers shattering already.

So I have you convinced, right? You want to learn internet scraping in Python but don’t know where to start? Fortunately, this is where my paper comes in. In the Psychological Methods article, which you can download here with a free ResearchGate account, we have laid out everything you need to know to start web scraping, including the technical know-how, the practical concerns, and even the ethics of it. All the software you need is free to use, and I have even created an online tutorial that will take you step by step through learning how to program a web scraper in Python! Give it a try and share what you’ve scraped!

Footnotes:
  1. Landers RN, Brusso RC, Cavanaugh KJ, & Collmus AB (2016). A Primer on Theory-Driven Web Scraping: Automatic Extraction of Big Data From the Internet for Use in Psychological Research. Psychological Methods PMID: 27213980 []

I’m Presenting at SIOP’s Leading Edge Consortium on Big Data Talent Analytics!

2016 May 18
by Richard N. Landers

This year’s Leading Edge Consortium is on Talent Analytics, with a focus on Big Data and data science. If you haven’t heard of the LEC, it’s SIOP’s specialty conference intended to train practicing I/O psychologists on bleeding edge practice and science for solving real, current problems. Each year has a different theme, depending upon precisely what SIOP believes to be at the bleeding edge that year. 2014’s conference was on Succession Strategies, whereas 2015’s was on High Performance Organizations.

LEC Announcement Banner

THis topic is near and dear to both my researcher and practitioner heart. It is officially titled: Talent Analytics: Data Science to Drive People Decisions and Business Impact.  It is being held October 21 and 22, 2016, in the home of Turner Field, Coca-Cola, and a beautiful airport: Atlanta, Georgia.

If you can’t tell from the meeting title, the idea is to share the newest of the new in talent analytics, the stuff that our journals haven’t even been able to publish yet, the stuff that’s being used in the highest performing I/O teams in the country.  There’s quite the all-star cast for me to stand aside, which is great, since a benefit of presenting is that I get to attend the conference!

Presenters (so far):

  • Leslie DeChurch, Georgia Tech
  • Eric Dunleavy, DCI Consulting
  • Alexis Fink, Intel
  • Ed Freeman, University of Virginia
  • Rick Guzzo, Mercer
  • Hailey Herleman, IBM
  • Allen Kamin, GE
  • Eden King, George Mason University
  • Richard Landers, Old Dominion University
  • Nathan Mondragon, HireVue
  • Adam Myer, Johnson & Johnson
  • Fred Oswald, Rice University
  • Dan Putka, HumRRO
  • Sara Roberts, Category One Consulting
  • Evan Sinar, DDI
  • Rich Tonowski, EEOC
  • Paul Tsagaroulis, U.S. General Services Administration
  • Alan Wild, IBM

My particular topic will be the Science of Data Visualization! There’s still time to register, but you’d better get to it quick! Hope to see you there!

I’m Writing a TIP Column! Crash Course in I-O Technology

2016 May 5
tags: ,
by Richard N. Landers

The explosive growth of new technology is fundamentally changing I/O psychology, and we are not in general well-prepared to respond to it! Technology is not why most I/O psychologists went into I/O psychology!

To help fix this, I’m the writer of a brand new column in The Industrial-Organizational Psychologist called Crash Course in I-O Technology. In each issue, I’ll be “demystifying” a technology that people are encountering in the field and academia, and collecting reactions from I/Os in the field, to get the real story on who really needs this tech and why.

Will it fundamentally change everything we know about I-O, is it just old wine in new bottles, or perhaps somewhere in the middle? Let’s find out!

To that end, I’m asking I-O practitioners, academicians, and students (you!) to provide in the survey linked below some examples of technologies that you have encountered in the field, heard about in practice, or have to deal with on a daily basis – yet if someone asked you to explain precisely what that technology was or how it worked, you don’t really have a great answer.  Perhaps you’re the expert on this topic, but none of the other I-Os you work with are.  Perhaps you needed to hire a random person into your organization’s IT function to run some kind of weird software that no one else understands!  Whatever it is, I want to know the I-O technology you’re dealing with!

The most popular answers in this survey are very likely to be the topics I write about in Crash Course, so think carefully!

If you’re willing to help me out with this (thanks!) please complete this quick 2-question survey: https://odu.co1.qualtrics.com/SE/?SID=SV_bEIbKxHlC2Hjj9z

I’m looking to have answers collected before Friday May 13 (first writing deadline’s coming up!), so please complete it before then!