As part of a project we completed assessing interdisciplinarity rankings of I-O psychology Ph.D. programs to be published in the next issue of The Industrial-Organizational Psychologist, we created a database containing a list of every paper ever published by anyone currently employed as faculty in an I-O psychology Ph.D. program as recorded in Elsevier’s Scopus database (special thanks to Bo Armstrong, Adrian Helms and Alexis Epps for their work on that project!). Scopus is the most comprehensive database of research output across all disciplines, so that is why we turned to it for our investigation of interdisciplinarity. But the dataset provides a lot of interesting additional opportunities to ask questions about the state of I-O psychology research using publication population data for some field-level self-reflection. No t-tests required when you have population data. I’ve already used the dataset once to very quickly identify which I-Os have published in Science or Nature. The line between self-reflection and naval gazing can be a thin one, and I’m trying to be careful here, but if you have any ideas for further questions that can be answered with these bibliographic data, let me know!
For this first real delve into the dataset, I wanted to explore the relative popularity of I-O psychology’s “top journals.” To get the list of the 8 top journals (and I chose 8 simply because the figure starts to get a bit crowded!), I simply counted how many publications with an I-O psychology faculty member on them appear in each of these journals (you will be able to get more info on this methodology in the TIP article in the next issue along with a list of the top hundred outlets). But importantly, that means my database doesn’t contain two groups of people: I-O practitioners and business school academics. It also doesn’t contain the total number of publications within those journals over years. Those are two important caveats for reasons we’ll get to later.
First, I looked at raw publication counts using loess smoothing.
As you can see, the Journal of Applied Psychology has been the dominant outlet for I-O psychology academics for a long time – but that seems to have recently changed. The Journal of Business and Psychology actually published more articles by academic I-O psychologists in 2016 than JAP did. In fact JAP appears to have published fewer and fewer articles by I-Os since around 2001. There are a few potential explanations for this. One, JAP might be published fewer articles in general; two, JAP may be publishing more research by business school researchers instead of I-O academics; three, JAP may be publishing more research by practitioners. I suspect the cause is not option 3.
Also notable is the volatility of Industrial and Organizational Psychology Perspectives, but this is easily attributable to highly varying counts of commentaries considering IOP‘s focal-and-commentary format.
In addition to overall publishing popularity, relative popularity is also of interest. Relative popularity assesses how many academic I-Os have published within a journal relative to each journal’s year with the most I-O publications. This mostly makes change patterns a bit easier to see. This figure appears next.
Of interest here are trajectories. Journals clearly fall into one of three groups:
- Consistency. IOP is the most consistent, as you’d expect; it’s truly the only journal “for I-Os.” Any articles not published by I-O academics are likely published by I-O practitioners.
- Upwards momentum. JBP, Journal of Occupational Health Psychology, and Journal of Organizational Behavior all show clear upward trajectories; they are either becoming more popular among I-Os or perhaps simply publishing more work in general with consistent “I-O representativeness” over time. All three published more work by I-O academics in 2015/2016 than they ever have before, as seen in the swing up to 100% at the far right of the figure.
- Decline. JAP, Human Performance, Journal of Applied Social Psychology, and Journal of Vocational Behavior are all decreasing in popularity among academic I-Os. Most journals either maintain their size or increase over time, which suggests that it is popularity with I-O academics in particular that is decreasing. Some of these are more easily explainable than others. JVB for example has been experiencing a long shift back towards counseling psychology over the last decade, which is itself moving even further away from I-O than it already was. Less than half of the editorial board are currently I-Os, and the editor is not an I-O. HP has only been publishing 2 to 5 articles per issue over the last several years, which could reflect either tighter editorial practices or declining popularity in general versus with I-Os. JASP is not really an I-O journal, and I suspect its reputation has likely suffered as the reputation of social psychology in general has suffered due to the replication crisis, but that’s a guess.
To I-Os, JAP is the most interesting of this set and difficult to clearly explain; they still appear to publish the same general number of articles per issue that they have for decades, but from a casual glance, it looks like more business school faculty have been publishing there, which of course also brings “business school values” in terms of publishing – theory, theory, and more theory. This decline among I-Os might also be reflected in its dropping citation/impact rank; perhaps JAP is just not publishing as much research these days that people find interesting, and as a result, I-O faculty are less likely to submit/publish there too. As the face of I-O psychology to the APA and much of the academic world in general, this is worth watching, at the very least.
Completing this analysis helped me realize that something I really want to know is what proportion of publications in all of these journals are not by I-Os. I suspect that the journals that the average I-O faculty member considers “primary outlet for the field” are changing, and that number would help explore this idea.
In the data we already have, there are some other interesting general trends to note; for example, the trajectories of all journals are roughly the same pre-2000. JAP started publishing I-O work a lot earlier than JOB, but their growth curves are very similar after accounting for the horizontal shift in their lines (i.e., their y-intercepts). The most notable changes across all journals occur in 2000 where almost all of the curves are disrupted, with several journals arcing up or down, then again in 2006. The first of these can probably be best explained as “because the internet,” but the cause of the 2006 shifts is unclear to me.
As a side note, this analysis took me a bit under an hour using R from “dataset of 11180 publications” to “exporting figures,” a lot of which was spent making those figures look nice. If you don’t think you could do the same thing in R in under an hour, consider completing my data science for social scientists course, which is free and wraps around interactive online coding instruction in R provided by datacamp.com, starting at “never used R before” and ending with machine learning, natural language processing, and web apps.
Is your New Year’s resolution to learn R or to teach data science to psychology graduate students? If so, I have some great news for you. I have now freely released my course materials and lecture videos on R designed to teach data science to social scientists (in my case, psychologists). You can find them on datascience.tntlab.org.
The course combines four primary teaching techniques:
- Online assignments using datacamp.com, which teaches you and provides real-time feedback on brand new programming and data science skills.
- A lecture video that contextualizes what you learned on datacamp.com to social science (mostly psychology, but it is broadly applicable).
- A project applying learned skills to a social scientific context that you can complete on your own.
- A debriefing video that explains how the skills you learned in DataCamp and in lecture are used to meet project requirements.
If you are teaching or taking a class, there is zero cost to you or students. You can request premium DataCamp course access for you and your students for free. If you are doing this on your own, self-paced, you will need to pay for access to DataCamp (which is pretty reasonable).
I have previously used this course to teach psychology graduate students, usually in groups of 10 to 20. Almost all start with zero programming experience and end up as decent intermediate-level programmers. You can too!
So far, I have provided a sample syllabus, final exam, final project, weekly projects, and a complete course schedule. This initial release also includes 4 weeks of video content (both lectures and project debriefing), and I’ll be releasing the next 4 in a month, and the last 4 a month after that.
Please let me know how you like it!
I am also considering offering a live version of the course some time in the future. Please complete this survey if you think you might be interested in completing a version of this data science course live, such as via a MOOC or even in person.
When trying to learn about machine learning, one of the biggest initial hurdles for social scientists, or even traditional statisticians, are the differences in terminology. The gap between the way social scientists talk about statistical concepts and the way machine learning experts talk about the same concepts is so vast that many social scientists do not even realize that they are running statistical analyses that involve machine learning! You may be using machine learning and not even realize it! So here’s a glossary to convert words you see into machine learning into words you already know.
Importantly, I’ve written these in an order where the build on each other. So you might start reading from the top and then just stop whenever it’s gone over your head… because it’s just going to get worse!
- algorithm: Any carefully defined step by step process that converts inputs into outputs. When you run an ANOVA in SPSS, you convert raw data (input) into an ANOVA table (output). Thus, you have executed an ANOVA algorithm. When you calculate a mean on pencil and paper by adding numbers together (step 1) and then dividing by the number of items (step 2), you have also executed an algorithm.
- learning: Fundamentally, the term “learning” in this context just refers to any procedure that allow you to make predictions about new data given current data. For example, 1-predictor univariate ordinary least squares (OLS) regression allows you to make predictions of y given x. Thus, OLS regression is a learning algorithm. Factor analysis allows you to make predictions of latent categories among your variables. Thus, factor analysis is also a learning algorithm.
- statistical learning algorithm: Any statistical procedure that you already know how to do involves a statistical learning algorithm; it enables you to make population predictions from given data by solving a step-by-step mathematical manipulation of a dataset. Regression, factor analysis, cluster analysis, and ANOVA all involve statistical learning.
- supervised learning: A learning algorithm with a known DV. Examples of supervised statistical learning algorithms are OLS regression and ANOVA using the formulas you already know.
- unsupervised learning: A learning algorithm without a known DV. Examples of unsupervised statistical learning algorithms are exploratory factor analysis, principal components analysis, or cluster analysis.
- cost function: When using a learning algorithm, the cost function is the number you’re trying to minimize. In OLS regression, this is the mean square error (i.e., simply speaking, the average difference in y between each observed value and its associated predicted value). When you find the “line of best fit,” you know you have found it because it is the combination of predictor weights for which the cost function is at its lowest possible value given all possible combinations of predictor weights. For example, in y=bx+a where the line of best fit occurs when b = 1.1, then your cost function will produce a higher value when b = 1.0 or b = 1.2. In a sense, minimized cost in this situation occurs at the bottom of a parabola depicting all possible values of b.
- regularization: In OLS regression, as the number of predictors approaches (or exceeds) the sample size, R^2 will approach and eventually equal 1. This is because you run out of degrees of freedom; you have a perfectly predictive model because you have enough predictors to model every tiny little variation in y. The problem with that is that in a new sample, you will have nowhere close to that R^2 of 1. To deal with that problem, regularization adds an additional term to the cost function that penalizes model complexity. Literally. It adds to mean square error. It is the sum of mean square error and something else. That additional term is called a regularization term, and there are several different types. There is also a weight associated with that term called a regularization parameter. So if you want a big penalty, that parameter is large, and if you want a small penalty, it approaches 0. In any regularized regression formula, if the regularization parameter is set to to zero, you’re just doing OLS regression (i.e., there is no penalty for model complexity).
- ridge regression: Ridge regression is a type of regularized regression. Instead of solving for the values of b which minimize mean square error, it solves for the values of b which minimize the sum of the mean square error and the squared predictor weights. For example, a ridge regression line of y = 2x + 1 has been optimized to the mean square error plus 4. Thus, for ridge, the sum of squared predictor weights is the regularization term, and the mean square error plus the regularization term is its cost function.
- lasso regression: Lasso (really LASSO: least absolute shrinkage and selection operator) regression is another type of regularized regression. Instead of solving for the values of b which minimize mean square error, it solves for the values of b which minimize the sum of the mean square error and the sum of the absolute values of the predictor weights. For example, a lasso regression line of y = 2(x1) – 3(x2) + 1 has been optimized to the mean square error plus 5. Thus, for lasso, the sum of the absolute values of the predictor weights is the regularization term, and the mean square error plus the regularization term is its cost function.
- L1 regularization: This is the regularization parameter used in lasso regression: the sum of the absolute values of the predictor weights.
- L2 regularization: This is the regularization parameter used in ridge regression: the sum of the squared predictor weights.
- elastic net: Elastic net is a type of regularized regression that combines L1 and L2 regularization and allows you to choose how much of each regularization penalty you want to impose. For example, if you set this value to 0.5, you end up penalizing by half the value of the L1 regularization term and half the value of the L2 regularization term. If you set this value to 0.1, you penalize by 90% of the value of the L1 term and 10% of the value of the L2 term.
- hyperparameter: These are external settings given to algorithms. In the case of elastic net, the choice of balance between L1 and L2 is an example of a hyperparameter. These are sometimes called tuning parameters.
- model tuning: This refers to the testing of multiple hyperparameters given one dataset to determine which most effectively minimizes cost. It is typically done iteratively, e.g., by testing a range of all possible hyperparameter values and choosing the final set of values that minimizes cost.
- machine learning algorithm: In all of these regressions so far, we have statistical formulas that can be used to create parameter (B) estimates. Each of these cost functions can be rewritten mathematically to make b solvable in the same way that solving for b in y=bx+a only requires multiplying the correlation between x and y with the ratio of the standard deviation of y to the standard deviation of x. But there are cost functions for which that formula is not known. In those cases, the easiest way to solve for b is a series of educated guesses, each of which gives the algorithm a bit more information about what the “right” answer is likely to be. Any algorithm that teaches itself in this way is machine learning. This means that all statistical procedures you currently know could be solved with either statistical learning or machine learning.
- stochastic gradient descent: The most common way for an algorithm to teach itself is by making educated guessed in sequence. When you do this with the goal of minimizing a cost function, you are engaging in gradient descent. For example, let’s imagine you’re in the same situation that a machine learning algorithm is in for the y=bx+a case. You know the mean square error formula, i.e., the cost function, but you don’t know the formula for b or a. If you were to make guesses about the value of b knowing only the formula for mean square error, you’d probably start with 1. So great: given my dataset, when b=1.0, what is mean square error? Let’s say it’s 4. I don’t know where to go from here, so I’ll choose a random direction. Let’s try b=1.2 – what’s mean square error this time? Let’s say this time cost is 4.5. My error went up! So that must be the wrong direction. So next time, I’ll try b=0.8. This step-by-step process (i.e., algorithm) is stochastic gradient descent. If you can imagine all possible mean square errors given all possible values for b, it would form a parabola, and you hope each guess gets you closer and closer to the bottom of the curve, i.e., 1 predictor means you need to figure out which way is “down” in a 2-dimensional space. That means with more predictors, you need more dimensions; so for 150 predictors, you need to progress down the curve in a 151-dimensional space. There are many other types of gradient descent, but stochastic (which is a fancy statistician term for “random”) is the most common (and easiest).
- hyperplane: This is the mathematical approach used to figure out which way is “down” in stochastic gradient descent. In a sense, it is the tangent to the curve you’re trying to travel down, except in a lot more dimensions than we usually think about curves. Hyperplanes can also be used to define groups in n-dimensional space in the same way that a line can be drawn through a scatter plot to divide twice groups of points based upon their values on two variables simultaneously. In three dimensions, you can’t use a line, so you use a plane. In four dimensions, you can’t use a plane either, so you use a hyperplane. And then they’re all called hyperplanes, no matter how many additional dimensions you’re looking at.
- learning rate: This is how far down the curve you jump each time you make a guess.
- cross-validation: Fundamentally, this is the same concept as in the social sciences, which is an approach to determine how generalizable the estimates in a model are to other samples. However, in machine learning, the way you approach cross-validation is a bit different, which is possible due to the large samples commonly found here, especially those associated with “big data.”
- big data: Such a general and overused term that it now has near-zero value. It used to mean, “datasets that cannot be analyzed with traditional methods for any of a variety of reasons,” but the datasets that applied to even 3 years ago no longer fall within that definition. Unless you’re at the point where the data you’re collecting literally cannot be viewed in SPSS (for whatever reason), you probably don’t have big data.
- 10-fold cross-validation: This is a very common type of cross-validation in machine learning in which a dataset is divided into 10 parts (i.e., folds), then a machine learning algorithm is applied to create your specified predictive model 10 times: using the data in 9 folds to predict the 10th, repeated for all 10. This procedure creates a distribution of R^2s and mean square error terms. The reason you might want to do that is that by creating a shared output metric across all of the algorithms you try, you can compare the distributions of these statistics across different algorithms in a meaningful way. For example, if you tried to predict y from 200 ‘x’ variables and found equal R^2s for two different machine learning algorithms (each already tested to determine it is using ideal hyperparameters given the dataset) yet the distribution of mean square errors across folds was wider for algorithm #1 than #2, you’d probably trust algorithm #2 to replicate out of sample more than you trust algorithm #1.
There are so many new terms in machine learning, but I hope that gives you some understanding of the most basic. As you can see, in many cases, the concepts are just a relatively small twist on other concepts you already know. At the very least, you need to be able to hold an intelligent conversation with the data scientists on your team, and hopefully this will help with that. If you have any requests or your own definitions for terms, please share them!