# A Plain-English Explanation of Correlation

[kml_flashembed publishmethod=”dynamic” fversion=”9.0.0″ movie=”http://rlanders.net/correlation_simulator.swf” width=”550″ height=”300″ targetclass=”flashmovie”]

[/kml_flashembed]

It came to my attention recently (primarily through my wife’s hunting for online resources on the topic) that there are no (or at least few) easy-to-understand plain-English explanations of what a correlation is or how to appropriately interpret it. Since percentages (which are fairly straightforward) and correlation (which is not) seem to have become the de facto statistics-to-be-reported by news agencies, this seems like a hole that needs to be plugged.

After all, the first Google result to “What is a correlation?” should not have formulas on it! If you can understand formulas, you probably wouldn’t be Googling “What is a correlation?” to begin with!

So here you go – my explanation of correlation as I would explain it to my undergraduate classes.

Correlation is an

**index**of

**linear relationship**between

**variables**. Each of these concepts requires a little explanation

**.**

**V****ariable **is simply the scientific way to refer to a set of measured numbers. For example, if I went to a classroom and measured the height of every person in that class, *height* would be a variable, and all of my measurements would be values contained within that variable. We call them variables because the values they contain **vary** – they can be any number of numbers gathered for any particular measurement, although outside forces usually constrain them to particular ranges.

An **index **is called an index because is not direct measure of anything, which is one of the reasons most people find them so confusing. You see a correlation of 0.5 and think “Well, 0.5 of what?” The answer is – “of nothing.” Indexes (including correlation) have no unit of measurement. For comparison, length is not usually measured with an index because its units *are *meaningful (e.g. 0.5 inches).

**Covariance*** *is actually the fancy measurable way to describe “how much are two variables related.” For example, imagine you have two paired variables:

X = 1, 2, 3

Y = 4, 5, 6

If you imagine those data points as pairs (e.g. [1,1], [2,4], [3,6]), you’ll see a clear pattern. When the first number goes up, the second number goes up predictably, which means we have a **perfect relationship**. If you graph those points, you’ll also notice that they form a *straight line* – this is what we call a **linear relationship**. ** **Now replace Y with a second set of data:

Y = 4, 5, 7

If you imagine this new set of Ys with the previous set of Xs, you’ll still see a pattern, but it’s not quite as predictable as it used to be. You could still draw a line through the middle of your data, but it wouldn’t quite touch every point. It still a strong **positive (+) relationship**, but it is no longer perfect.

The more predictable one set of numbers is from another, the higher the covariance will be. The problem is that the numbers themselves are still meaningful. For example, consider these two sets of two variables:

X1 = 1, 2, 3

Y1 = 4, 5, 6

X2 = 4, 5, 6

Y2 = 7, 8, 9

The strength of these two relationships is obvious – both are perfect relationships. However, because the numbers themselves are larger in the second dataset, the covariance will also be larger. That’s a problem. We don’t want to know what is the **raw** amount of predictability between the two datasets. We want to be able to compare them directly.

Thus was born the need for correlation – a **standardized** covariance. Instead of using covariance, which gets bigger with bigger numbers, statisticians decided to create a scale so that the relationship between any pairs of data, no matter what their original units were, could be compared directly.

Thus, correlation was anchored at two numbers to indicate the **magnitude **of the relationship: 0 and 1. 1 represents the perfect relationship described above. 0 represents the total absence of a relationship.

But what about situations where one piece of data goes in reverse? Consider these variables:

X = 1, 2, 3

Y = 6, 5, 4

This relationship is still perfect, but the direction between the two is reversed. This is called a **negative (-) relationship**. Positive and negative thus indicate the **direction** of the linear relationship, while the magnitude represents the **strength **of the linear relationship.

A correlation is thus the combination of an indicator of direction (- or +) and a number representing the relationship’s magnitude (0 to 1). Overall then, it appears as if correlations range from -1 to +1. But it’s important to remember that a -1 is just as strong (and just as perfect!) a relationship as a +1.

Now that you have all the background knowledge, try out that Flash program above to play with correlations!

Keep N=200 and simulate the following correlations in this order to see what they look like:

- +1
- +.95
- +.7
- +.3
- 0
- -.3
- -.7
- -1

You should notice that at 0, the data just looks like a big shapeless cloud while at +1, it’s a very straight line. When you change to negative correlations, that’s still true, but the line changes direction. Move the sliders back and forth to get a feel for what correlations tend to look like.

Another thing you should notice: correlation is simply used to **describe **data. If a correlation is found between two variables, that doesn’t tell you anything about whether or not one variable **caused** the other. The golden rule of correlation is simple and even sort of rhymes: **correlation is not causation**. Causation can only be proven through carefully designed experiments – statistics alone do not have the power to prove anything caused anything else.

If you’d like more information on the interactive correlation simulator above, please check out this additional information on my professional webpage. If you are using this page as part of a lesson in your classroom, I’d appreciate a comment letting me know who you are, what institution you work for, and what course you are using it in. The tool above can also be used to teach a little bit about sampling error!

And finally, if you are an educator or represent an organization interested in partnering with a major research university to use technology-enhanced hiring or training techniques like this tool, please leave a comment or visit my laboratory (the webpage of which may still be under construction!).

Previous Post: | College Courses for $99/month? |

Next Post: | We’re One of the Top 100 |

Great post! You are right that the layman needs a little bit more help understanding this idea. It is fundamental to understanding pretty much

anystatistics report in a meaningful way.I tend to water down m explanation a bit further. Here’s my brief explanation:

Correlation is an estimate of how related two variables are. The scale runs from 0 to 1, where 0 indicates a complete lack of a relationship and 1 indicates a perfect relationship. If you see a (-) sign in front of the correlation, that means that the relationship is negative, so as one variable increases, the other decreases.

I think talking about it in terms of a 0 to 1 scale makes it more familiar for the uninitiated.

The problem that I get when explaining it that briefly is people getting this question wrong:

Which of these is the strongest correlation? a) .3 b) -.7

So as a result, I try to emphasize the difference between magnitude and direction. Plus I think having the visual (the Flash program I wrote above) helps a great deal.

I will admit however that if I was trying to explain this a manager who only wanted to learn what he absolutely had to, my explanation would probably be a bit closer to yours!

I tend towards the simpler explanation. Answering the question completely tends to generate more questions than the manager really wants answered.

I find that limiting myself to the question’s intent works pretty well; though people still regard me as some sort of magician…

By the way, that flash application is a great explanatory tool. Have you though about adding a “best fitting line” to represent the relationship? This might help people make the connection between correlation and linear relationships.

I did think about that, but ran out of time to integrate it by the time I wanted to put this post online. Maybe I will take a crack at implementing it tonight…

On a Friday night? you must be about as cool as me =)

Well, my wife tells me it’s a problem that I find work to be a good use of my free time. She’s probably right!

But anyway – a bright red regression line has now been added (you may need to refresh the page).

On your previous point – yes – I agree re: managers. But this is something the world needs to know!!

I have refreshed, but I don’t see the red line you mentioned.

That’s odd… It should be updated! You might need to Ctrl-Refresh to load background items, depending on what browser you’re using.

Brilliant! I’ve actually bookmarked this now…

Thanks Richard! I’ll be using this in PSY 3711. Very cool!

Great post, Richard! Your wife sounds like mine