A Plain-English Explanation of Correlation
[kml_flashembed publishmethod=”dynamic” fversion=”9.0.0″ movie=”http://rlanders.net/correlation_simulator.swf” width=”550″ height=”300″ targetclass=”flashmovie”]
[/kml_flashembed]
It came to my attention recently (primarily through my wife’s hunting for online resources on the topic) that there are no (or at least few) easy-to-understand plain-English explanations of what a correlation is or how to appropriately interpret it. Since percentages (which are fairly straightforward) and correlation (which is not) seem to have become the de facto statistics-to-be-reported by news agencies, this seems like a hole that needs to be plugged.
After all, the first Google result to “What is a correlation?” should not have formulas on it! If you can understand formulas, you probably wouldn’t be Googling “What is a correlation?” to begin with!
So here you go – my explanation of correlation as I would explain it to my undergraduate classes.
Correlation is an index of linear relationship between variables. Each of these concepts requires a little explanation.
Variable is simply the scientific way to refer to a set of measured numbers. For example, if I went to a classroom and measured the height of every person in that class, height would be a variable, and all of my measurements would be values contained within that variable. We call them variables because the values they contain vary – they can be any number of numbers gathered for any particular measurement, although outside forces usually constrain them to particular ranges.
An index is called an index because is not direct measure of anything, which is one of the reasons most people find them so confusing. You see a correlation of 0.5 and think “Well, 0.5 of what?” The answer is – “of nothing.” Indexes (including correlation) have no unit of measurement. For comparison, length is not usually measured with an index because its units are meaningful (e.g. 0.5 inches).
Covariance is actually the fancy measurable way to describe “how much are two variables related.” For example, imagine you have two paired variables:
X = 1, 2, 3
Y = 4, 5, 6
If you imagine those data points as pairs (e.g. [1,1], [2,4], [3,6]), you’ll see a clear pattern. When the first number goes up, the second number goes up predictably, which means we have a perfect relationship. If you graph those points, you’ll also notice that they form a straight line – this is what we call a linear relationship. Now replace Y with a second set of data:
Y = 4, 5, 7
If you imagine this new set of Ys with the previous set of Xs, you’ll still see a pattern, but it’s not quite as predictable as it used to be. You could still draw a line through the middle of your data, but it wouldn’t quite touch every point. It still a strong positive (+) relationship, but it is no longer perfect.
The more predictable one set of numbers is from another, the higher the covariance will be. The problem is that the numbers themselves are still meaningful. For example, consider these two sets of two variables:
X1 = 1, 2, 3
Y1 = 4, 5, 6
X2 = 4, 5, 6
Y2 = 7, 8, 9
The strength of these two relationships is obvious – both are perfect relationships. However, because the numbers themselves are larger in the second dataset, the covariance will also be larger. That’s a problem. We don’t want to know what is the raw amount of predictability between the two datasets. We want to be able to compare them directly.
Thus was born the need for correlation – a standardized covariance. Instead of using covariance, which gets bigger with bigger numbers, statisticians decided to create a scale so that the relationship between any pairs of data, no matter what their original units were, could be compared directly.
Thus, correlation was anchored at two numbers to indicate the magnitude of the relationship: 0 and 1. 1 represents the perfect relationship described above. 0 represents the total absence of a relationship.
But what about situations where one piece of data goes in reverse? Consider these variables:
X = 1, 2, 3
Y = 6, 5, 4
This relationship is still perfect, but the direction between the two is reversed. This is called a negative (-) relationship. Positive and negative thus indicate the direction of the linear relationship, while the magnitude represents the strength of the linear relationship.
A correlation is thus the combination of an indicator of direction (- or +) and a number representing the relationship’s magnitude (0 to 1). Overall then, it appears as if correlations range from -1 to +1. But it’s important to remember that a -1 is just as strong (and just as perfect!) a relationship as a +1.
Now that you have all the background knowledge, try out that Flash program above to play with correlations!
Keep N=200 and simulate the following correlations in this order to see what they look like:
- +1
- +.95
- +.7
- +.3
- 0
- -.3
- -.7
- -1
You should notice that at 0, the data just looks like a big shapeless cloud while at +1, it’s a very straight line. When you change to negative correlations, that’s still true, but the line changes direction. Move the sliders back and forth to get a feel for what correlations tend to look like.
Another thing you should notice: correlation is simply used to describe data. If a correlation is found between two variables, that doesn’t tell you anything about whether or not one variable caused the other. The golden rule of correlation is simple and even sort of rhymes: correlation is not causation. Causation can only be proven through carefully designed experiments – statistics alone do not have the power to prove anything caused anything else.
If you’d like more information on the interactive correlation simulator above, please check out this additional information on my professional webpage. If you are using this page as part of a lesson in your classroom, I’d appreciate a comment letting me know who you are, what institution you work for, and what course you are using it in. The tool above can also be used to teach a little bit about sampling error!
And finally, if you are an educator or represent an organization interested in partnering with a major research university to use technology-enhanced hiring or training techniques like this tool, please leave a comment or visit my laboratory (the webpage of which may still be under construction!).
Previous Post: | College Courses for $99/month? |
Next Post: | We’re One of the Top 100 |
Great post! You are right that the layman needs a little bit more help understanding this idea. It is fundamental to understanding pretty much any statistics report in a meaningful way.
I tend to water down m explanation a bit further. Here’s my brief explanation:
Correlation is an estimate of how related two variables are. The scale runs from 0 to 1, where 0 indicates a complete lack of a relationship and 1 indicates a perfect relationship. If you see a (-) sign in front of the correlation, that means that the relationship is negative, so as one variable increases, the other decreases.
I think talking about it in terms of a 0 to 1 scale makes it more familiar for the uninitiated.
The problem that I get when explaining it that briefly is people getting this question wrong:
Which of these is the strongest correlation? a) .3 b) -.7
So as a result, I try to emphasize the difference between magnitude and direction. Plus I think having the visual (the Flash program I wrote above) helps a great deal.
I will admit however that if I was trying to explain this a manager who only wanted to learn what he absolutely had to, my explanation would probably be a bit closer to yours!
I tend towards the simpler explanation. Answering the question completely tends to generate more questions than the manager really wants answered.
I find that limiting myself to the question’s intent works pretty well; though people still regard me as some sort of magician…
By the way, that flash application is a great explanatory tool. Have you though about adding a “best fitting line” to represent the relationship? This might help people make the connection between correlation and linear relationships.
I did think about that, but ran out of time to integrate it by the time I wanted to put this post online. Maybe I will take a crack at implementing it tonight…
On a Friday night? you must be about as cool as me =)
Well, my wife tells me it’s a problem that I find work to be a good use of my free time. She’s probably right!
But anyway – a bright red regression line has now been added (you may need to refresh the page).
On your previous point – yes – I agree re: managers. But this is something the world needs to know!!
I have refreshed, but I don’t see the red line you mentioned.
That’s odd… It should be updated! You might need to Ctrl-Refresh to load background items, depending on what browser you’re using.
Brilliant! I’ve actually bookmarked this now…
Thanks Richard! I’ll be using this in PSY 3711. Very cool!
Great post, Richard! Your wife sounds like mine 😉
I am a disabled senior – I was very smart in school years ago but even these explanations when I asked for simple /layman’s terms for correlation and linear relationship just boggle me and tangle me . I need much simpler definitions and it is making me feel very stuck – stranded.
Correlation, of the type I wrote about here, simply means that for a given set of data, values on one variable can be predicted from the other. For example, let’s say we measured the length of people’s left index finger and right index finger in inches. That might create a dataset like this:
L R
3.25 3.23
3.05 3.06
3.43 3.44
If I gave you this dataset, even you didn’t know what the numbers means, and I asked, “If L = 4, what would you guess R equals?” you’d probably say “My best guess is 4.” You can do that because the correlation here is very very strong – almost perfect. When you already know L, you can guess the value of R and be correct within a very small margin. When expressed numerically, that’s a correlation of “almost 1”.
You can then draw these three dots on a piece of graph paper, with L on the x-axis and R on the y-axis. The challenge then is to draw a single straight line that cuts between these three points most closely, i.e., with the smallest possible distances between the line and each point. This is the “linear relationship.”
In datasets where the values are less predictable (i.e., there is more noise), the correlation decreases. At the extreme, you observe a correlation of 0 when your data are cloud of dots, and knowing one value doesn’t help you predict the other whatsoever.