Skip to content

A Complete Course for Social Scientists on Data Science Using R

2018 January 3
by Richard N. Landers

RIs your New Year’s resolution to learn R or to teach data science to psychology graduate students? If so, I have some great news for you. I have now freely released my course materials and lecture videos on R designed to teach data science to social scientists (in my case, psychologists).  You can find them on

The course combines four primary teaching techniques:

  1. Online assignments using, which teaches you and provides real-time feedback on brand new programming and data science skills.
  2. A lecture video that contextualizes what you learned on to social science (mostly psychology, but it is broadly applicable).
  3. A project applying learned skills to a social scientific context that you can complete on your own.
  4. A debriefing video that explains how the skills you learned in DataCamp and in lecture are used to meet project requirements.

If you are teaching or taking a class, there is zero cost to you or students.  You can request premium DataCamp course access for you and your students for free.  If you are doing this on your own, self-paced, you will need to pay for access to DataCamp (which is pretty reasonable).

I have previously used this course to teach psychology graduate students, usually in groups of 10 to 20. Almost all start with zero programming experience and end up as decent intermediate-level programmers. You can too!

So far, I have provided a sample syllabus, final exam, final project, weekly projects, and a complete course schedule. This initial release also includes 4 weeks of video content (both lectures and project debriefing), and I’ll be releasing the next 4 in a month, and the last 4 a month after that.

Please let me know how you like it!

I am also considering offering a live version of the course some time in the future. Please complete this survey if you think you might be interested in completing a version of this data science course live, such as via a MOOC or even in person.

Translating the Words Used in Machine Learning into Human Language

2017 December 7
by Richard N. Landers

When trying to learn about machine learning, one of the biggest initial hurdles for social scientists, or even traditional statisticians, are the differences in terminology. The gap between the way social scientists talk about statistical concepts and the way machine learning experts talk about the same concepts is so vast that many social scientists do not even realize that they are running statistical analyses that involve machine learning! You may be using machine learning and not even realize it! So here’s a glossary to convert words you see into machine learning into words you already know.

Importantly, I’ve written these in an order where the build on each other.  So you might start reading from the top and then just stop whenever it’s gone over your head… because it’s just going to get worse!

  • algorithm: Any carefully defined step by step process that converts inputs into outputs.  When you run an ANOVA in SPSS, you convert raw data (input) into an ANOVA table (output).  Thus, you have executed an ANOVA algorithm.  When you calculate a mean on pencil and paper by adding numbers together (step 1) and then dividing by the number of items (step 2), you have also executed an algorithm.
  • learning: Fundamentally, the term “learning” in this context just refers to any procedure that allow you to make predictions about new data given current data.  For example, 1-predictor univariate ordinary least squares (OLS) regression allows you to make predictions of y given x.  Thus, OLS regression is a learning algorithm.  Factor analysis allows you to make predictions of latent categories among your variables.  Thus, factor analysis is also a learning algorithm.
  • statistical learning algorithm: Any statistical procedure that you already know how to do involves a statistical learning algorithm; it enables you to make population predictions from given data by solving a step-by-step mathematical manipulation of a dataset.  Regression, factor analysis, cluster analysis, and ANOVA all involve statistical learning.
  • supervised learning: A learning algorithm with a known DV.  Examples of supervised statistical learning algorithms are OLS regression and ANOVA using the formulas you already know.
  • unsupervised learning: A learning algorithm without a known DV.  Examples of unsupervised statistical learning algorithms are exploratory factor analysis, principal components analysis, or cluster analysis.
  • cost function: When using a learning algorithm, the cost function is the number you’re trying to minimize.  In OLS regression, this is the mean square error (i.e., simply speaking, the average difference in y between each observed value and its associated predicted value).  When you find the “line of best fit,” you know you have found it because it is the combination of predictor weights for which the cost function is at its lowest possible value given all possible combinations of predictor weights.  For example, in y=bx+a where the line of best fit occurs when b = 1.1, then your cost function will produce a higher value when b = 1.0 or b = 1.2.  In a sense, minimized cost in this situation occurs at the bottom of a parabola depicting all possible values of b.
  • regularization: In OLS regression, as the number of predictors approaches (or exceeds) the sample size, R^2 will approach and eventually equal 1.  This is because you run out of degrees of freedom; you have a perfectly predictive model because you have enough predictors to model every tiny little variation in y.  The problem with that is that in a new sample, you will have nowhere close to that R^2 of 1.  To deal with that problem, regularization adds an additional term to the cost function that penalizes model complexity.  Literally.  It adds to mean square error.  It is the sum of mean square error and something else.  That additional term is called a regularization term, and there are several different types.  There is also a weight associated with that term called a regularization parameter.  So if you want a big penalty, that parameter is large, and if you want a small penalty, it approaches 0.  In any regularized regression formula, if the regularization parameter is set to to zero, you’re just doing OLS regression (i.e., there is no penalty for model complexity).
  • ridge regression: Ridge regression is a type of regularized regression. Instead of solving for the values of b which minimize mean square error, it solves for the values of b which minimize the sum of the mean square error and the squared predictor weights.  For example, a ridge regression line of y = 2x + 1 has been optimized to the mean square error plus 4.  Thus, for ridge, the sum of squared predictor weights is the regularization term, and the mean square error plus the regularization term is its cost function.
  • lasso regression: Lasso (really LASSO: least absolute shrinkage and selection operator) regression is another type of regularized regression. Instead of solving for the values of b which minimize mean square error, it solves for the values of b which minimize the sum of the mean square error and the sum of the absolute values of the predictor weights.  For example, a lasso regression line of y = 2(x1) – 3(x2) + 1 has been optimized to the mean square error plus 5.  Thus, for lasso, the sum of the absolute values of the predictor weights is the regularization term, and the mean square error plus the regularization term is its cost function.
  • L1 regularization: This is the regularization parameter used in lasso regression: the sum of the absolute values of the predictor weights.
  • L2 regularization: This is the regularization parameter used in ridge regression: the sum of the squared predictor weights.
  • elastic net: Elastic net is a type of regularized regression that combines L1 and L2 regularization and allows you to choose how much of each regularization penalty you want to impose.  For example, if you set this value to 0.5, you end up penalizing by half the value of the L1 regularization term and half the value of the L2 regularization term.  If you set this value to 0.1, you penalize by 90% of the value of the L1 term and 10% of the value of the L2 term.
  • hyperparameter: These are external settings given to algorithms.  In the case of elastic net, the choice of balance between L1 and L2 is an example of a hyperparameter.  These are sometimes called tuning parameters.
  • model tuning: This refers to the testing of multiple hyperparameters given one dataset to determine which most effectively minimizes cost. It is typically done iteratively, e.g., by testing a range of all possible hyperparameter values and choosing the final set of values that minimizes cost.
  • machine learning algorithm: In all of these regressions so far, we have statistical formulas that can be used to create parameter (B) estimates.  Each of these cost functions can be rewritten mathematically to make b solvable in the same way that solving for b in y=bx+a only requires multiplying the correlation between x and y with the ratio of the standard deviation of y to the standard deviation of x.   But there are cost functions for which that formula is not known.  In those cases, the easiest way to solve for b is a series of educated guesses, each of which gives the algorithm a bit more information about what the “right” answer is likely to be.  Any algorithm that teaches itself in this way is machine learning.  This means that all statistical procedures you currently know could be solved with either statistical learning or machine learning.
  • stochastic gradient descent: The most common way for an algorithm to teach itself is by making educated guessed in sequence.  When you do this with the goal of minimizing a cost function, you are engaging in gradient descent.  For example, let’s imagine you’re in the same situation that a machine learning algorithm is in for the y=bx+a case.  You know the mean square error formula, i.e., the cost function, but you don’t know the formula for b or a.  If you were to make guesses about the value of b knowing only the formula for mean square error, you’d probably start with 1.  So great: given my dataset, when b=1.0, what is mean square error?  Let’s say it’s 4.  I don’t know where to go from here, so I’ll choose a random direction.  Let’s try b=1.2 – what’s mean square error this time?  Let’s say this time cost is 4.5.  My error went up! So that must be the wrong direction.  So next time, I’ll try b=0.8.  This step-by-step process (i.e., algorithm) is stochastic gradient descent.  If you can imagine all possible mean square errors given all possible values for b, it would form a parabola, and you hope each guess gets you closer and closer to the bottom of the curve, i.e., 1 predictor means you need to figure out which way is “down” in a 2-dimensional space.  That means with more predictors, you need more dimensions; so for 150 predictors, you need to progress down the curve in a 151-dimensional space.  There are many other types of gradient descent, but stochastic (which is a fancy statistician term for “random”) is the most common (and easiest).
  • hyperplane: This is the mathematical approach used to figure out which way is “down” in stochastic gradient descent.  In a sense, it is the tangent to the curve you’re trying to travel down, except in a lot more dimensions than we usually think about curves.  Hyperplanes can also be used to define groups in n-dimensional space in the same way that a line can be drawn through a scatter plot to divide twice groups of points based upon their values on two variables simultaneously. In three dimensions, you can’t use a line, so you use a plane.  In four dimensions, you can’t use a plane either, so you use a hyperplane. And then they’re all called hyperplanes, no matter how many additional dimensions you’re looking at.
  • learning rate: This is how far down the curve you jump each time you make a guess.

    A visualization of learning rate during stochastic gradient descent, showing two possible solutions for a minimized cost function. You can also imagine a hyperplane being created at each point that directs the algorithm which way to descend.

  • cross-validation: Fundamentally, this is the same concept as in the social sciences, which is an approach to determine how generalizable the estimates in a model are to other samples.  However, in machine learning, the way you approach cross-validation is a bit different, which is possible due to the large samples commonly found here, especially those associated with “big data.”
  • big data: Such a general and overused term that it now has near-zero value.  It used to mean, “datasets that cannot be analyzed with traditional methods for any of a variety of reasons,” but the datasets that applied to even 3 years ago no longer fall within that definition.  Unless you’re at the point where the data you’re collecting literally cannot be viewed in SPSS (for whatever reason), you probably don’t have big data.
  • 10-fold cross-validation: This is a very common type of cross-validation in machine learning in which a dataset is divided into 10 parts (i.e., folds), then a machine learning algorithm is applied to create your specified predictive model 10 times: using the data in 9 folds to predict the 10th, repeated for all 10.  This procedure creates a distribution of R^2s and mean square error terms.  The reason you might want to do that is that by creating a shared output metric across all of the algorithms you try, you can compare the distributions of these statistics across different algorithms in a meaningful way.  For example, if you tried to predict y from 200 ‘x’ variables and found equal R^2s for two different machine learning algorithms (each already tested to determine it is using ideal hyperparameters given the dataset) yet the distribution of mean square errors across folds was wider for algorithm #1 than #2, you’d probably trust algorithm #2 to replicate out of sample more than you trust algorithm #1.

There are so many new terms in machine learning, but I hope that gives you some understanding of the most basic. As you can see, in many cases, the concepts are just a relatively small twist on other concepts you already know. At the very least, you need to be able to hold an intelligent conversation with the data scientists on your team, and hopefully this will help with that. If you have any requests or your own definitions for terms, please share them!

I-O Psychologists That Have Published in Science or Nature

2017 September 29
tags: ,
by Richard N. Landers

Science and Nature are, for better or worse, often regarded as the pinnacle of scientific achievement. Unfortunately for I/O psychology, the topics we study typically don’t fall within the sorts of things that they publish. Despite that, a handful of psychologists have managed it!

You might wonder how we came across this information. The reason is that we have collected a complete list of every I-O researcher currently employed in an I-O psychology doctoral program in the SIOP directory of graduate programs as well as every publication they have ever published, sortable by a huge number of interesting features, created by scraping Elsevier’s Scopus.  We’re using this database to create rankings of interdisciplinarity among I-O psychology programs and faculty for a special feature to be published in TIP around May 2018 alongside several other new ranking systems. Until then, we’ll be releasing interesting little snippets as we poke around.  Thanks to Bo Armstrong for suggesting this particular search!

Importantly, I don’t have any publication information for people not employed in I-O psychology doctoral programs as either permanent or temporary faculty. So if you’re an I-O and have a Nature or Science publication I missed, let me know, and I’ll add it here. As we construct unique, interesting data slices, it will also help us understand where the problems in our dataset are so that we can fix them before you see the final version!

Without further ado, and in alpha order:

  1. Michele Gelfand at the University of Maryland’s Social-Decision-Organizational Sciences program, with an article in Science titled Differences Between Tight and Loose Cultures: A 33-Nation Study
  2. Nathan Kuncel and Sarah Hezlett at the University of Minnesota’s I-O Psych program, with an article in Science titled Standardized Tests Predict Graduate Students’ Success
  3. Mark Roebke, an I/O PhD student at Wright State, was part of the Open Science Collaboration that published an article in Science titled Estimating the Reproducibility of Psychological Science

We also have two honorable mentions for I-O adjacent faculty:

  1. John Antonakis at the University of Lausanne business school, with an article in Science titled Predicting Elections: Child’s Play!
  2. Michael Frese at the University of Lueneburg business school, with an article in Science titled Teaching Personal Initiative Beats Traditional Training in Boosting Small Business in West Africa

That’s it!! I-Os have never published in Nature and only a few times in Science! Better than nothing, I suppose…