Internet Scraping for Research: A Python Tutorial for Psychologists
One of the biggest challenges for psychologists is gaining access to research participants. We go to great lengths, from elaborate and creative sampling strategies to spending hard-earned grant money, to get random(ish) people to complete our surveys and experiments. Despite all of this effort, we often lack perhaps the most important variable of all: actual, observable behavior. Psychology is fundamentally about predicting what people do and experience, yet most of our research starts and stops with how people think and feel.
That’s not all bad, of course. There are many insights to be gained from how people talk about how they think and feel. But the Internet has opened up a grand new opportunity to actually observed what people do – to observe their Internet behaviors. We can actually see how they communicate with others on social media, tracking their progress in relationships and social interactions in real time, as they are built.
Despite all of this potential, surprisingly few psychologists actually do this sort of research. Part of the problem is that psychologists are simply not trained on what the Internet is or how it works. As a result, research studies involving internet behaviors typically involve qualitative coding, a process by which a team of undergraduate or graduate students will read individual posts on discussion boards, writing down a number or set of numbers for each thing they read. It is tedious and slow, which slows down research for some and makes the entire area too time-consuming to even consider for others.
Fortunately, the field of data science has produced a solution to this problem called internet scraping. Internet scraping (also “web scraping”) involves creating a computer algorithm that automatically travels across the Internet or a select piece of it, collecting data of the type you’re looking for and depositing it into a dataset. This can be just about anything you want but typically involves posts or activity logs or other metadata from social media.
In a paper recently published at Psychological Methods by me and my research team, our goal is to teach psychologists how to use these tools to create and analyze their own datasets. We do this in a programming language called Python.
Now if you’re a psychologist and this sounds a little intimidating, I totally understand. Programming is not something most research psychologists learned in graduate school, although it is becoming increasingly common.
R, a statistical programming language, is increasingly becoming a standard part of graduate statistical training, so if you’ve learned any programming, that’s probably the kind you learned. If you’re one of these folks, you’re lucky – if you learned R successfully, you can definitely learn Python. If you haven’t learned R, then don’t worry – the level of programming expertise you need to successfully conduct an Internet scraping project is not as bad as you’d think – similar to the level of expertise you’d need to develop to successfully use structural equation modeling, or hierarchical linear modeling, or any other advanced statistical technique. The difference is simply that now you need to learn technical skills to employ a methodological technique. But I promise it’s worth it.
In our paper, we demonstrated this by investigating a relatively difficult question to answer: are there gender differences in the way that people use social coping to self-treat depression? This is a difficult question to assess with surveys because you always see reality through the eyes of people who are depressed. But on the Internet, we have access to huge databases of people with depression and trying to self-treat in the form of online discussion forums intended for people with depression. So to investigate our demonstration research question, we collected over 100,000 examples of people engaging in social coping, as well as self-reported gender from their profiles. We wanted to know if women use social coping more often than men, if women are more likely to try to support women, if men are more likely to try to support men, and a few other questions.
The total time to collect those 100,000 cases? About 20 hours, during a solid 8 hours of which I was sleeping. One of the advantages to internet scraping algorithms collecting your data is that they don’t require much attention.
Imagine the research questions we could address if more people adopted this technique! I can hear the research barriers shattering already.
So I have you convinced, right? You want to learn internet scraping in Python but don’t know where to start? Fortunately, this is where my paper comes in. In the Psychological Methods article, which you can download here with a free ResearchGate account, we have laid out everything you need to know to start web scraping, including the technical know-how, the practical concerns, and even the ethics of it. All the software you need is free to use, and I have even created an online tutorial that will take you step by step through learning how to program a web scraper in Python! Give it a try and share what you’ve scraped!Footnotes:
- Landers RN, Brusso RC, Cavanaugh KJ, & Collmus AB (2016). A Primer on Theory-Driven Web Scraping: Automatic Extraction of Big Data From the Internet for Use in Psychological Research. Psychological Methods PMID: 27213980 [↩]
|Previous Post:||I’m Presenting at SIOP’s Leading Edge Consortium on Big Data Talent Analytics!|
|Next Post:||Hiring Managers Fear Being Replaced by Technology|