# Computing Intraclass Correlations (ICC) as Estimates of Interrater Reliability in SPSS

If you think my writing about statistics is clear below, consider my student-centered, practical and concise Step-by-Step Introduction to Statistics for Business for your undergraduate classes, available now from SAGE. Social scientists of all sorts will appreciate the ordinary, approachable language and practical value – each chapter starts with and discusses a young small business owner facing a problem solvable with statistics, a problem solved by the end of the chapter with the statistical kung-fu gained.

This article has been published in the *Winnower*. You can cite it as:

Landers, R.N. (2015). Computing intraclass correlations (ICC) as estimates of interrater reliability in SPSS. *The Winnower 2*:e143518.81744. DOI: 10.15200/winn.143518.81744

You can also download the published version as a PDF by clicking here.

Recently, a colleague of mine asked for some advice on how to compute interrater reliability for a coding task, and I discovered that there aren’t many resources online written in an easy-to-understand format – most either 1) go in depth about formulas and computation or 2) go in depth about SPSS without giving many specific reasons for why you’d make several important decisions. The primary resource available is a 1979 paper by Shrout and Fleiss^{[1]}, which is quite dense. So I am taking a stab at providing a comprehensive but easier-to-understand resource.

Reliability, generally, is the proportion of “real” information about a construct of interest captured by your measurement of it. For example, if someone reported the reliability of their measure was .8, you could conclude that 80% of the variability in the scores captured by that measure represented the construct, and 20% represented random variation. The more uniform your measurement, the higher reliability will be.

In the social sciences, we often have research participants complete surveys, in which case you don’t need ICCs – you would more typically use coefficient alpha. But when you have research participants provide something about themselves from which you need to extract data, your measurement becomes what you get from that extraction. For example, in one of my lab’s current studies, we are collecting copies of Facebook profiles from research participants, after which a team of lab assistants looks them over and makes ratings based upon their content. This process is called coding. Because the research assistants are creating the data, their ratings are my scale – not the original data. Which means they 1) make mistakes and 2) vary in their ability to make those ratings. An estimate of interrater reliability will tell me what proportion of their ratings is “real”, i.e. represents an underlying construct (or potentially a combination of constructs – there is no way to know from reliability alone – all you can conclude is that you are measuring *something* consistently).

An intraclass correlation (ICC) can be a useful estimate of inter-rater reliability on quantitative data because it is highly flexible. A Pearson correlation can be a valid estimator of interrater reliability, but only when you have meaningful pairings between two and only two raters. What if you have more? What if your raters differ by ratee? This is where ICC comes in (note that if you have qualitative data, e.g. categorical data or ranks, you would not use ICC).

Unfortunately, this flexibility makes ICC a little more complicated than many estimators of reliability. While you can often just throw items into SPSS to compute a coefficient alpha on a scale measure, there are several additional questions one must ask when computing an ICC, and one restriction. The restriction is straightforward: you must have the same number of ratings for every case rated. The questions are more complicated, and their answers are based upon how you identified your raters, and what you ultimately want to do with your reliability estimate. Here are the first two questions:

- Do you have consistent raters for all ratees? For example, do the exact same 8 raters make ratings on every ratee?
- Do you have a sample or population of raters?

If your answer to Question 1 is no, you need ICC(1). In SPSS, this is called “One-Way Random.” In coding tasks, this is uncommon, since you can typically control the number of raters fairly carefully. It is most useful with massively large coding tasks. For example, if you had 2000 ratings to make, you might assign your 10 research assistants to make 400 ratings each – each research assistant makes ratings on 2 ratees (you always have 2 ratings per case), but you counterbalance them so that a random two raters make ratings on each subject. It’s called “One-Way Random” because 1) it makes no effort to disentangle the effects of the rater and ratee (i.e. one effect) and 2) it assumes these ratings are randomly drawn from a larger populations (i.e. a random effects model). ICC(1) will always be the smallest of the ICCs.

If your answer to Question 1 is yes and your answer to Question 2 is “sample”, you need ICC(2). In SPSS, this is called “Two-Way Random.” Unlike ICC(1), this ICC assumes that the variance of the raters is only adding noise to the estimate of the ratees, and that mean rater error = 0. Or in other words, while a particular rater might rate Ratee 1 high and Ratee 2 low, it should all even out across many raters. Like ICC(1), it assumes a random effects model for raters, but it explicitly models this effect – you can sort of think of it like “controlling for rater effects” when producing an estimate of reliability. If you have the same raters for each case, this is generally the model to go with. This will always be larger than ICC(1) and is represented in SPSS as “Two-Way Random” because 1) it models both an effect of rater and of ratee (i.e. two effects) and 2) assumes both are drawn randomly from larger populations (i.e. a random effects model).

If your answer to Question 1 is yes and your answer to Question 2 is “population”, you need ICC(3). In SPSS, this is called “Two-Way Mixed.” This ICC makes the same assumptions as ICC(2), but instead of treating rater effects as random, it treats them as fixed. This means that the raters in your task are the only raters anyone would be interested in. This is uncommon in coding, because theoretically your research assistants are only a few of an unlimited number of people that could make these ratings. This means ICC(3) will also always be larger than ICC(1) and typically larger than ICC(2), and is represented in SPSS as “Two-Way Mixed” because 1) it models both an effect of rater and of ratee (i.e. two effects) and 2) assumes a random effect of ratee but a fixed effect of rater (i.e. a mixed effects model).

After you’ve determined which kind of ICC you need, there is a second decision to be made: are you interested in the reliability of a single rater, or of their mean? If you’re coding for research, you’re probably going to use the mean rating. If you’re coding to determine how accurate a single person would be if they made the ratings on their own, you’re interested in the reliability of a single rater. For example, in our Facebook study, we want to know both. First, we might ask “what is the reliability of our ratings?” Second, we might ask “if one person were to make these judgments from a Facebook profile, how accurate would that person be?” We add “,k” to the ICC rating when looking at means, or “,1” when looking at the reliability of single raters. For example, if you computed an ICC(2) with 8 raters, you’d be computing ICC(2,8). If you computed an ICC(1) with the same 16 raters for every case but were interested in a single rater, you’d still be computing ICC(2,1). For ICC(#,1), a large number of raters will produce a narrower confidence interval around your reliability estimate than a small number of raters, which is why you’d still want a large number of raters, if possible, when estimating ICC(#,1).

After you’ve determined which specificity you need, the third decision is to figure out whether you need a measure of absolute agreement or consistency. If you’ve studied correlation, you’re probably already familiar with this concept: if two variables are perfectly consistent, they don’t necessarily agree. For example, consider Variable 1 with values 1, 2, 3 and Variable 2 with values 7, 8, 9. Even though these scores are very different, the correlation between them is 1 – so they are highly consistent but don’t agree. If using a mean [ICC(#, k)], consistency is typically fine, especially for coding tasks, as mean differences between raters won’t affect subsequent analyses on that data. But if you are interested in determining the reliability for a single individual, you probably want to know how well that score will assess the real value.

Once you know what kind of ICC you want, it’s pretty easy in SPSS. First, create a dataset with columns representing raters (e.g. if you had 8 raters, you’d have 8 columns) and rows representing cases. You’ll need a complete dataset for each variable you are interested in. So if you wanted to assess the reliability for 8 raters on 50 cases across 10 variables being rated, you’d have 10 datasets containing 8 columns and 50 rows (400 cases per dataset, 4000 total points of data).

A special note for those of you using surveys: if you’re interested in the inter-rater reliability of a scale mean, compute ICC on that scale mean – not the individual items. For example, if you have a 10-item unidimensional scale, calculate the scale mean for each of your rater/target combinations *first* (i.e. one mean score per rater per ratee), and then use that scale mean as the target of your computation of ICC. Don’t worry about the inter-rater reliability of the individual items unless you are doing so as part of a scale development process, i.e. you are assessing scale reliability in a pilot sample in order to cut some items from your final scale, which you will later cross-validate in a second sample.

In each dataset, you then need to open the **Analyze** menu, select **Scale**, and click on **Reliability Analysis**. Move all of your rater variables to the right for analysis. Click **Statistics** and check **Intraclass correlation coefficient** at the bottom. Specify your model (One-Way Random, Two-Way Random, or Two-Way Mixed) and type (Consistency or Absolute Agreement). Click **Continue** and **OK**. You should end up with something like this:

In this example, I computed an ICC(2) with 4 raters across 20 ratees. You can find the ICC(2,1) in the first line – ICC(2,1) = .169. That means ICC(2, k), which in this case is ICC(2, 4) = .449. Therefore, 44.9% of the variance in the mean of these raters is “real”.

So here’s the summary of this whole process:

**Decide which category of ICC you need.**- Determine if you have consistent raters across all ratees (e.g. always 3 raters, and always the same 3 raters). If not, use ICC(1), which is “One-way Random” in SPSS.
- Determine if you have a population of raters. If yes, use ICC(3), which is “Two-Way Mixed” in SPSS.
- If you didn’t use ICC(1) or ICC(3), you need ICC(2), which assumes a sample of raters, and is “Two-Way Random” in SPSS.

**Determine which value you will ultimately use.**- If a single individual, you want ICC(#,1), which is “Single Measure” in SPSS.
- If the mean, you want ICC(#,k), which is “Average Measures” in SPSS.

**Determine which set of values you ultimately want the reliability for.**- If you want to use the subsequent values for other analyses, you probably want to assess consistency.
- If you want to know the reliability of individual scores, you probably want to assess absolute agreement.

**Run the analysis in SPSS**.- Analyze>Scale>Reliability Analysis.
- Select Statistics.
- Check “Intraclass correlation coefficient”.
- Make choices as you decided above.
- Click Continue.
- Click OK.
- Interpret output.

- Shrout, P., & Fleiss, J. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86 (2), 420-428 DOI: 10.1037/0033-2909.86.2.420 [↩]

Previous Post: | Learn About Our University Library Through Minecraft |

Next Post: | New Research Links Social Media Marketing and Purchase Intentions |

Thank Dr. Landers. Very informative. Just what I was looking for in my research on ICC and inter-rater reliability.

Glad to help! This has actually become one of my most popular posts, so I think there really was a need here. Just be sure to note that this terminology is the Shrout and Fleiss terminology – for example, some researchers refer to ICC(1,k) as ICC(2), especially in the aggregation/multilevel models literature.

Dr Landers,

This was very helpful. I just have one question for you. In my research I am averaging my raters to develop a new variable; therefore is it correct that the Average Measures ICC is the coefficient I am most interested in?

Yes, if you’re using the average of the ratings you collected in further analyses, the reliability of that variable will be the “average measures” (,k) version.

Thank you so much. I had looked to other informations and have still some doubts. Your explanations are so clear. You really know how to explain to non-statistic people.

Congratulations for your didactic ability.

Sofía

In my case, I have only two raters (a sample) rating many individuals each (around 50), and rating them according to different criteria (for example: accuracy, speed, etc).

According to your explanations, I should apply 2-way random ANOVA with absolute agreement, since I am interested not only in consistency, but also in the reliability of a single rater.

My doubt is if I am able to apply this ICC, or as I only have two raters, if it would be preferable to apply a Pearson correlation.

Thank you very much.

Best wishes,

Sofía

Hello again,

Actually I am trying to validate a assessment rating scale, which evaluates different criteria by observing a subject.

Two observers rated then 50 subjects with this scale.

In order to validate the scale, I should validate each one of its questions (criterion).

I should do an ICC for each one of my criteria, isn’t it?

I guess if Sig < 0.05 for a criteria and if I obtain a high intraclass correlation (what should be considered high? greater than 0.75?) I can deduce that that particular criterion is consistent and reliable, even if I only had two raters?

Am I right?

Thank you very much.

Sofía

ICC only assesses reliability; this is a distinct concept from validation. For a measure to be valid, it must also be reliable, but reliability alone does not guarantee validity. So computing ICC should be only part of a larger validation strategy.

To assess ICC when looking at a scale, compute the scale mean for each rater separately first. Then compute ICC. If you are using the same 2 raters for every person, and you want to assess the reliability of a single rater, you are correct: you need ICC(2,1), which is 2-way Random with Absolute Agreement.

I would not worry about statistical significance testing in this context. A statistically significant ICC just tells you that it’s unlikely that your ICC was drawn from a population where the true ICC was 0. That’s not a very interesting question for what you’re doing. Instead, just see how high ICC is.

The general recommendation by Cohen was that reliability be above 0.7, which means 70% of observed variance is “real” variance. But it really depends on what you’re willing to accept, or what the research literature suggests is typical/necessary. Any non-perfect reliability will decrease the validity of your measure, so it’s mostly a matter of how large you are willing that effect to be.

Thank you very much for your reply.

Then, I understand that I can do ICC with only two raters (they are the same two raters for every person) to test the reliability of my scale.

However, I do not understand why should I do the mean… do you mean:

a) the mean of the ratings of all questions for each person and for each rater

b) the mean of the ratings of a single question given by one rater to all the people

I guess you meaned option a), because otherwise, for option b), if my scale has 13 questions to rate, I should first of all calculate the mean of the answers of rater1 for all those 13 questions, then the mean of rater2. If I do this I would have 13 values for each of the raters, and then, doing the ICC with just two measures (the mean of each rater) I guess it does not make sense?

However, even if you mean option a), wouldn’t that give a reliability of the whole test (composed by 13 questions) but not of every single question?

Can´t I do:

1) the ICC for each question calculating the ICC comparing all the ratings given by each rater to all the people for that particular question.

2) the mean, according to option a) to calculate the ICC of the whole scale.

Thank you very much for your time and for your fast response.

Best wishes

Sorry – I should been clearer. By “scale,” I assumed you meant you have multiple items (questions) assessing the same construct.

You should assess reliability for each distinct measurement concept. So if you had 13 questions that all assessed “happiness,” you should compute ICC on the mean of those items. If you have 13 questions that all assess different things, you should compute 13 ICCs. But I will warn you that ICC tends to be low if you are assessing any psychological constructs with only single items each.

Thank you very much for your reply. Now it is clear.

On the other hand, do you know about any site where I can find a more detailed explanation about how to validate a scale? (not only reliability, but also validity).

Best wishes,

Sofia

Thank you Dr. Landers, this way very helpful!

May I have one question, just to make it clear for myself.

If I have a 15-item interview (with a scale 1-2-3) and 10 raters (population, and always the same 10 raters) rate 5 patients. I am interested in the average reliability of the raters and also I would like to know the reaters’ individual “performance”. This would be a “Two way mixed” in SPSS.

Should I then create a database for each item of the interview (that’d be 15 databases) and run “Two way mixed” and “Absolute agreement”, right? And then computing the mean of the results? Or can I do that I create a database for each patient with 15 rows (items) and 10 columns (raters) and run the Two way mixed?

I guess I am a little bit confused what does “add ,k to the ICC” or “add ,1” mean?

THank you very much for your help!

Best regards

Mark

@Sofia – Validation is a much, much more complicated concept than reliability. I don’t know of any websites that explain it very clearly, although there are entire textbooks devoted to it.

@Mark – Are the 15 questions assessing the same thing? If so, compute a scale mean and then run your statistics on that. You would only look at reliability by item if you were going to use those 15 items individually later. If that’s what you’re going to do, yes, you would need separate SAV files for each database.

,k means you are referring to an ICC of the average ratings. ,1 means you are referring to an ICC of an individual rater. You cannot use ICC to identify the “performance” of any particular rater; instead, all you can tell is the reliability of raters on average or the reliability of a single rater taken at random from your population.

Thank you very much, Richards. Your comments have been of great help. Do you know if there is any statistical forum where to ask doubts online?

Best wishes,

Sofía

Validation isn’t so much statistics as it is a family of research methods. But sure – one statistics forum you might find useful is http://stats.stackexchange.com/

Dear Dr. Landers,

thank you for your answer!

THe 15 item measure the same psychological construct (borderline disorder), but not the same thing, since the items about different symptomps of the disorder – I think this is what you asked. So, later in the research these items will be used (asked) separately and will be evaluated by the raters based on the interview.

So, if I get it right, in this case I’ll need separate SAV files for each item and then compute a mean and that’ll be the overall ICC of the raters.

I can I use ICC for binary data (e.g. 1=borderline, 2=not borderline)? Because we would like to not just compute ICC for the individual items and the overall interview, but we’d like to compute ICC for the final diagnosis (borderline/not borderline).

…and thank you, now I understand what ‘single measure’ means!

Regards, and thank you very much for your kind answers!

Mark Berdi

@Mark – if you are interested in assessing the reliability of symptom ratings, then you need ICCs for each item. If you are interested in assessing inter-rater reliability in general (i.e. the reliability relevant to any future statistics computed using a “borderline score”), you’ll want to compute the scale means for each rater and compute ICC for those means. You should not take an average of ICCs, as that’s not an interpretable value – if you’re interested in the ICC of overall scales, that’s what you should compute the ICC on.

For binary data, you could use ICC, but it is not recommended – I would look into Fleiss’ kappa.

Thank you!

I understood it. I just computed what I needed.

Best regards

Mark Berdi

Hello Dr. Landers,

First, thank you for an excellent resource. We are following your method to conduct interrater reliability. In reference to question 2: sample v. population of raters, what criteria do you use to determine the response?

Additionally, we have found an error will result when raters are in perfect agreement (e.g., all 3 raters assign a score of 2 for a given item). Is this due to the lack of variance and inability to proceed with additional calculations?

Any advice or direction is welcomed.

Sincerely,

Emily and Laura

@Emily & Laura – It just depends on what you want to generalize to. If your raters are the only raters that you ever want to worry about, you have a population. If they are a random sample of such raters, you have a sample. For example, if you have three people watch videos and make ratings from them, the three people you have watching the videos are only three possible raters – you could just as easily have chosen another three drawn from the same population. Therefore, they are a sample of raters.

As for your second question, if your raters are always in perfect agreement, you have no need for ICC. Your reliability is 100% (1.0), so there is nothing to assess.

Thank you for this helpful website. I just want to be clear about question 1. If I have 100 fifteen second clips of children misbehaving and I have a single rater (Sally) rate from 1-7 how bad the misbehavior is for each clip and I have Sam give the same misbehavior ratings to 33 of those clips, it sounds like I have answered “yes” to question 1. Is that right?

And if there is any deviation from this (e.g. I have Bill rate another 33 clips that Sally rated), I answer “no”. Is that also correct?

@Camilo – I’m not sure that I’m clear on your premise, but I will take a stab at it.

If Sally and Sam rate identical clips, then yes – you have “yes” to Q1. However, if you have 2 ratings for 33 clips and only 1 rating for 66 clips, you can only calculate ICC for those 33 clips. If you want to generalize to all the clips (e.g. if you were using all 100 clips in later analyses), you’d need to use the “Single Measure” version of ICC, since you only have 1 rater consistently across all ratees.

If you had Bill rate an additional 33 clips, you’d still have 2 ratings for 66 clips and 1 rating for 33 clips, so nothing has changed, procedurally. However, because you have a larger sample, you’d expect your ICC to be more accurate (smaller confidence interval).

The only way to use the “Average Measures” version of ICC is to have both raters rate all clips (two sets of 100 ratings).

Hi Dr Landers,

This is by far the most helpful post I have read.

I still have a question or two though..

I am looking at the ICC for 4 raters (sample) who have all rated 40 cases.

I therefore think I need the random mixed effects model (ICC, 2,k)

In SPSS i get a qualue for single measures and average measures and I am not sure which i want. My assumption is average measures?

Also i have seen variations in how ICC values are reported and I wondered if you knew the standard APA format, my guess would be [ICC(2, k) = ****].

Any guidance would be very much appreciated.

Many thanks, Anne-Marie.

@Anne-Marie – the “k” indicates that you want Average Measures. Single Measure would be ,1. If you’re using their mean rating in later analyses, you definitely want ,k/Average Measures.

As for reporting, there is no standard way because there are several different sources of information on ICC, and different sources label the different types slightly differently (e.g. in the multilevel modeling literature, ICC(2,1) and ICC(2,k) are sometimes referred to as ICC(1) and ICC(2), respectively, which can be very confusing!).

If you’re using the framework I discuss above, I’d recommend citing the Shrout & Fleiss article, and then reporting it as: ICC(2,4) = ##.

Dr Landers,

Thank you so much!

All calculated and all looking good!

Anne-Marie.

Very useful and easy to understand. I am currently completing my dissertation for my undergraduate physiotherapy degree and stats is all very new. This explained it really easily. Thanks again!

I am training a group of 10 coders, and want to assess whether they are reliable using training data. During training, all 10 code each case, but for the final project, 2 coders will code each case and I will use the average of their scores. So for the final project, I will calculate ICC (1, 2), correct? Then what should I do during training–calculating ICC (1, 10) on the training cases will give me an inflated reliability score, since for the real project it will be only 2 coders, not 10?

@Catherine – It’s important to remember that reliability is situation-specific. So if you’re not using the training codes for any later purpose, you don’t really need to compute their reliability. You would indeed just use ICC(1,2) for your final reliability estimate on your final data. However, if you wanted to get your “best guess” now as to what reliability will be for your final coding, using your training sample to estimate it, you could compute ICC(1,1) on your training sample, then using the Spearman-Brown prophesy formula to identify what the reliability is likely to be with 2 raters. But once you have the actual data in hand, that estimate is useless.

Hello,

I have a small question that builds on the scenario described by Emily & Laura (posted question and answer on March 26, 2012 in regards to the error that results when raters are in perfect agreement).

In my case, only a portion of my 21 raters are in perfect agreement, and so SPSS excludes them from the reliability analyses, resulting in (what I would think to be) an artificially low ICC value, given that only ‘non-agreeing’ raters are being included. Is there a way to deal with this?

Many thanks,

Stephanie

@Stephanie – My immediate reaction would be that ICC may not be appropriate given your data. It depends a bit on how many cases you are looking at, and what your data actually looks at. My first guess is that your data is not really quantitative (interval or ratio level measurement). If it isn’t, you should using some variant of kappa instead of ICC. If it is, but you still have ridiculously high but not perfect agreement, you might simply report percentage agreement.

Dear Dr. Landers,

You are indeed correct – my data are ordered categorical ratings. 21 raters (psychiatrists) scored a violence risk assessment scheme (comprised of 20 risk factors that may be scored as absent, partially present, or present – so a 1,2,3 scale) for 3 hypothetical patients. So I am trying to calculate the reliability across raters for each of the 3 cases.

My inclination was to use weighted kappa but I was under the impression it was only applicable for designs with 2 raters?

Thanks again,

Stephanie

Ahh, yes – that’s the problem then. ICC is not really designed for situations where absolute agreement is common – if you have interval level measurement, the chance that all raters would score exactly “4.1” (for example) is quite low.

That restriction is true for Cohen’s kappa and its closest variants – I recommend you look into Fleiss’ kappa, which can handle more than 2 raters, and does not assume consistency of raters between ratings (i.e. you don’t need the same 3 raters every time).

Thank you

Can you explain to me the importance of upper and lower Bound (CI 95%)?. “How to take advantage of” CI?

That’s not really a question related to ICC. If you are looking for general help on confidence intervals, this page might help: http://stattrek.com/estimation/confidence-interval.aspx

Thank you for this clear explanation of ICCs. I was wondering if you might know about them in relation to estimate intrarater reliability as well? It seems as if the kappa may provide a biased estimate for ordinal data and the ICC may be a better choice. Specifically, I’m interested in the intrarater reliability of 4 raters’ rating of an ordinal 4 point clinical scale that evaluates kidney disease, each rater rated each patient 2 times.

I’m using SAS and can’t seem to get the ICC macro I used for interrater reliabilty to work. I’m wondering if the data need to be structured with rater as variables (as you say above)? If so, do you know if I would include the two measurements for each patient in a single observation or if I should make two observations per patient?

Thank you-

Lisa

I don’t use SAS (only R and SPSS), so I’m afraid I don’t know how you’d do it there. But I don’t think I’d do what you’re suggesting in general unless I had reason to believe that patients would not change at all over time. If there is true temporal variance (i.e. if scores change over time), reliability computed with a temporal dimension won’t be meaningful. ICC is designed to assess inter-rater consistency in assessing ratees controlling for intra-rater effects. If you wanted to know the opposite (intra-rater consistency in assessing ratees controlling for inter-rater effects), I suppose you could just flip the matrix? I am honestly not sure, as I’ve never needed to do that before.

Thanks for yuor thoughts.

Hello Richard,

Thanks a lot for your post. I have one question: I built an on-line survey that was answered by 200 participants. I want to know if the participants did not show much variance in their answers so to know that the ratings are reliable. For that matter, I am running a One-way random effects model and then I am looking at the average measures.

That’s the way I put it:

Inter-rater reliability (one-way random effects model of ICC) was computed using SPSS (v.17.0). One-way random effects model was used instead of Two-way random effects model because the judges are conceived as being a random selection of possible judges, who rate all targets of interest. The average measures means of the ICC was 0.925 (p= 0.000) which indicates a high inter-rater reliability therefore reassuring the validity of these results (low variance in the answers within the 200 participants).

I would be happy if you could tell me whether a one-way random effects model of ICC and then looking at the average measures is the way to go.

Thank you,

John

@John – Your study is confusing to me. Did you have all 200 participants rate the same subject, and now you are trying to determine the reliability of the mean of those 200 participants? If so, then you have done the correct procedure, although you incorrectly label it “validity.” However, I suspect that is not what you actually wanted to do, since this implies that your sample is n = 1. If you are trying to assess variance of a single variable across 200 cases, you should just calculate the variance and report that number. You could also compute margin of error, if you want to give a sense of the precision of your estimate. If you just want to show “variance is low,” there is no inferential test to do that, at least in classical test theory.

Thank you for your answer, Richard.

I built an on-line survey with a 100 items.

Subjects used a 5 point Lickert scale to indicate how easy was to draw those items.

200 subjects answered the on-line survey.

I want to prove the validity of the results. If the ICC cannot be used for that matter. Can I use the Cronbach’s alpha?

Thanks for your time.

Best,

John

@John – Well, “reliability” and “validity” are very different concepts. Reliability is a mathematical concept related to consistency of measurement. Validity is a theoretical concept related to how well your survey items represent what you intend them to represent.

If you are trying to produce reliability evidence, and you want to indicate that all 100 of your items measure the same construct, then you can compute Cronbach’s alpha on those items. If you have subsets of your scale that measure the same construct (at least 2 items each), you can compute Cronbach’s alpha for each. If you just asked 100 different questions to assess 100 different concepts, alpha isn’t appropriate.

If your 100 items are all rating targets theoretically randomly sampled from all possible rating targets, and you had all 100 subjects rate each one, you could calculate ICC(2). But the measure count depends on what you wanted to do with those numbers. If you wanted to compute the mean for each of the 100 rating subjects and use that in other analyses, you’d want ICC(2, 200). If you just wanted to conclude “if one person rates one of these, how reliable will that person’s ratings be?”, then you want ICC(2, 1).

Thank you for your answer, Richard.

I will use Chronbach’s alpha for internal consistency. That way I can say that my survey measured the same general construct: the imageability of the words.

I will use ICC (2,1)(two-way random, average measures) for inter-rater reliability. That way I can say if the subjects answered similarly.

#I am interested in the mean of each item to enter the data in a regression analysis (stepwise method). I am interested then in saying that a word like ‘apple’ has a mean imageability of 2 out of 5. I am not interested in the mean of the answers of each subject. That is, subject 1 answered 3 out of 5 all the the time.

#After running those two statistics, is it OK to talk about validity?

Thanks again.

Best,

John

You can certainly *talk* about validity as much as you want. But the evidence that you are presenting here doesn’t really speak to validity. Statistically, the only thing you can use for this purpose is convergent or discriminant validity evidence, which you don’t seem to have. There are also many conceptual aspects of validity that can’t be readily measured. For example, if you were interested in the “imageability” of the words, I could argue that you aren’t really capturing anything about the words, but rather about the people you happened to survey. I could argue that the 200 pictures you chose are not representative of all possible pictures, so you have biased your findings toward aspects of those 200 pictures (and not a general property of all pictures). I could argue that there are cultural and language differences that change how people interpret pictures, so imageability is not a meaningful concept anyway, unless it is part of a model that controls for culture. I could argue that color is really the defining component, and because you didn’t measure perceptions of color, your imageability measure is meaningless. I suggest reading Shadish, Cook, and Campbell (2002) for a review of validation.

But one key is to remember: a measure can be reliable and valid, reliable and invalid, or unreliable and invalid. Reliability does not ensure validity; it is only a necessary prerequisite.

Thank you for the note on validity,Richard.

Please let me know if the use I give to the Chronbach’s alpha for internal consistency and the ICC (2,1)(two-way random, average measures) for inter-rater reliability is reasonable. Those are eery statistics for me, hence, it is important for me to know your take on it.

About the imageability ratings. We gave 100 words to 200 subjects. The subjects told us in a 5 point scale how did they think the words can be put on a picture.

Thanks again.

Best,

John

ICC(2,1) is two-way random, single measure. ICC(2,100) would be your two-way random, average measures. Everything you list is potentially a reasonable thing to do, but it really depends on what you want to ultimately do with resulting values, in the context of your overall research design and specific research questions. I would at this point recommend you bring a local methods specialist onto your project – they will be able to help you much more than I can from afar.

Thank you for your answer, John.

I will try to ask a local methods specialist. So far, the people around me are pretty rusty when it comes to helping me out with all this business.

Thanks again.

Best,

John

Hello Richard,

Thank you for the wonderfull post, it helped very much!

I have a question aswell, if you can still answer, since it’s been a while since anyone posted here.

I have a survey with 28 items, rated 1 -7, by 6 people. I need to know how much agreement is between the 6 on this rating and if possible the items that reach the highest rating (that persons agree most on). It’s a survey with values (single worded items) and I have to do an aggregate from these 6 people’s rating, if they agree on the items, so I can later compare it with the ratings of a larger dataset of 100 people. Let us say that the 6 raters must be have high reliability because they are the ones to optimaly represent the construct for this sample of 105 (people including the raters).

Basicaly: 1.I need to do an aggregate of 6 people’s ratings (that’s why I need to calculate the ICC), compare the aggregate of the rest of the sample with this “prototype”, see if they correlate.

2. Determine the items with largest agreement.

What I don’t understand is if I use the items as cases and run an ICC analysis, two way mixed and than look at the single or average?

Hope this isn’t very confusing and you can help me with some advice.

Thank you,

Andra

@Andra – It really depends on which analysis you’re doing afterward. It sounds like you’re comparing a population of 6 to a sample of 99. If you’re comparing means of those two groups (i.e. you using the means in subsequent analyses), you need the reliability of the average rating.

Dear Richard,

Thank you for your answer. I think though that I’ve omitted something that isn’t quite clear to me: the data set in spss will be transposed, meaning I will have my raters on columns and my variables on rows? Also in the post you say to make a dataset for every variable but in this case my variables are my “cases” so I assume I will have only a database with 9 people and 28 variables on rows and compute the ICC from this?

I have yet another question, hope you can help me. What I have to do here is to make a profile similarity index as a measure of fit between some firm’s values as rated by the 9 people and the personal values of the rest of the people, measured on the same variables. From what I understand this can be done through score differences or correlations between profiles. Does this mean that I will have to substract the answers of every individual from the average firm score or that I’ll have to correlate each participant’s scores with that of the firm and then have a new variable that would be a measure of the correlation beween the profiles? Is there any formula that I can use to do this; I have the formula for difference scores and correlation but unfortunatelly my quite moderate mathematical formation doesn’t help ..

I sincerelly hope that this isn’t far beyond the scope of your post and you can provide your expertize for this!

Appreciatively,

Andra

@Andra – It does not sound like you have raters at all. ICC is used in situations like this: 6 people create scores for 10 targets across 6 variables. You then compute 6 ICCs to assess how well the 6 people created scores for each of the 10 targets. It sounds like you may just want a standard deviation (i.e. spread of scores on a single variable), which has nothing to do with reliability. But it is difficult to tell from your post alone.

For the rest of your question, it sounds like you need a more complete statistical consult than I can really provide here. It depends on the specific index you want, the purpose you want to apply it toward, further uses of those numbers, etc. There are many statisticians with reasonable consulting rates; good luck!

Okay Richard, I understand, then I can’t use the ICC in this case. Thought I could because at least one author in a similar paper used it … something like this (same questionnaire, different scoring, theirs had q sort mine is lickert):

“As such, the firm informant’s Q Sorts were averaged, item by

item, for each [] dimension representing each firm.Using James’ (1982) formula, intra-class correlation coefficients (ICCs) were calculated for ascertaining the

member consensus on each dimension. ICC was calculated by using the following formula[]”

Maybe the rWG would be better?

Anyhow, thank you very much for trying to answer, I will re-search the online better for some answers

All the best,

Andra

Hi Richard,

I just wanted to check something.

If i have 28 cases, and 2 raters rating each case, from a wider population of raters, I think I need…

Two-way Random effects model

ICC 2, 2. for absolute agreement.

When I get the output, do I then report the average measures ICC value, rather than the single measures, as I want to identify the difference between raters?

Many thanks, Anne-Marie.

@Anne-Marie – Reliability doesn’t exactly assess “the difference between raters.” You are instead capturing the proportion of “true” variance in observed score variance. If you intend to use your mean in some later analysis, you are correct. If you are just trying to figure out how consistently raters make their ratings, you probably want ICC(2,1).

Hi Richard, I want to calculate ICC(1) and ICC(2) for the purpose of aggregating individual scores into group level scores. Would the process be similar to what you described for inter-rater reliability.

I would imagine that in the case the group would be equivalent to the rater. That is, instead of looking within rater versus between rater variance, I want within group versus between group variance? Therefore would I just put the grouping IDs along the columns (rather than rater) and the score for each group member along the row?

However, I have varying group sizes, so some groups have 2 member and some have as many as 20. It seems like that could get quite messy… Maybe there is some other procedure that is more appropriate for this?

Dear Richard

thank you, this makes things much clearer concerning how to choose the appropriate ICC.

Can I ask if you have any paper examples on how to report ICCs?

Thank you for your time

Best wishes

Laura

Dear Dr. Landers,

Thank you very much for the information on this forum!

Unfortunately, after studying all the responses on this forum I still have a question:

In total I have 292 cases from which 13 items were rated on a 4-point rating scale. Coder 1 has rated 144 cased, Coder 2 has rated 138 cases and 10 cases were treated as training cases. Three items were combined in a composite score and this score will be used in future studies. So in future studies I want to report this composite score on all 292 cases.

The question now is of course: “What is the intercoder reliability between Coder 1 and Coder 2 on the composite score?” To assess reliability between the coders I want to compare the composite score of both Coders on 35 cases (these were scored by both Coder 1 and Coder 2).

I think I have to use the ICC (2) and measure absolute agreement. Is this correct?

My doubt is if I have to look at the ‘single measure’ or ‘average measure’ in SPSS?

I hope you can help me. Many thanks in advance.

Janneke

@Gabriel – You can use ICC as part of the procedure to assess if aggregation is a good idea, but you can’t do this if group sizes vary substantially. I believe in this case, you want to use within-groups r, but I’m not sure – I don’t do much research of this type.

@Laura – Any paper that reports an ICC in your field/target journal would provide a great example. This varies a bit by field though – even ANOVA is presented different across fields.

@Janneke – You should use “absolute agreement” and “single measure.” Single measure because you’re using ratings made by only one person in future analyses. If you had both raters rate all of them, and then used the mean, you would use “average measure.”

Dear Dr. Landers,

Thank you very much for your quick and clear answer.

Best regards, Janneke

Dear Dr. Landers,

Thank you for such wonderful article. I am sure it is very beneficial to many people out there who are struggling.

I am trying to study the test-retest reliability of my questionnaire (HIV/AIDS KAP survey).

The exact same set of respondents completed the same questionnaire after a 3-week interval. Though anonymous, for the same respondent, I can link both tests at time1 and time2.

Through a lot a of researching, literatures suggest Kappa coefficients for categorical variables and ICC for continuous variables (please comment if I get that wrong).

However, I am still uncertain regarding the model of ICC in SPSS that I should use for my study.

The bottom line conclusion that I would like to make is that the questionnaire is reliable with acceptable ICC and Kappa coefficients.

I would really appreciate your suggestion.

Thank you very very much.

Teeranee

@Teeranee – Neither ICC nor Kappa are necessarily appropriate to assess test-retest reliability. These are both more typically estimates of inter-rater reliability. If you don’t have raters and ratees, you don’t need either. If you have a multi-item questionnaire, and you don’t expect scores to vary at the construct level between time points, I would suggest the coefficient of equivalence and stability (CES). If you’re only interested in test re-test reliability (e.g. if you are using a single-item measure), I’d suggest just using a Pearson’s correlation. And if you are having two and only two raters make all ratings of targets, you would probably use ICC(2).

Dr. Landers,

Thank you very much for your reply.

I will look into CES as you have suggested.

Thank you.

Dear Dr. Landers,

Many thanks for your summary, it very, very helpful. I wanted to ask you one more question, trying to apply the information to my own research. I have 100 tapes that need to be rated by 5 or 10 raters and I am trying to set pairs of raters such that each rater codes as few tapes as possible but I am still able to calculate ICC. The purpose of this is to calculate inter-rater reliability for a newly developed scale.

Thank you very much.

@Violeta – I’m not sure what your question is. Reliability is a property of the measure-by-rater interaction, not the measure. For example, reliability of a personality scale is defined by both the item content and the particular sample you choose to assess it (while there is a “population” reliability for a particular scale given a particular assessee population, there’s no way to assess that with a single study or even a handful of studies). Scales themselves don’t have reliability (i.e. “What is the reliability of this scale?” is an invalid question; instead, you want to know “What was the reliability of this scale as it was used in this study?”). But if I were in your situation (insofar as I understand it), I would probably assign all 10 raters to do a handful of tapes (maybe 20), then calculate ICC(1,k) given that group, then use the Spearman-Brown prophecy formula to determine the minimum number of raters you want for acceptable reliability of your rater means. Then rate the remainder with that number of raters (randomly selected for each tape from the larger group). Of course, if they were making ratings on several dimensions, this would be much more complicated.

Dear Dr. Landers,

Thank you very much for your clarifications and for extracting the question from my confused message. Thank you very much for the solution you mentioned, it makes a lot of sense, and the main impediment to applying it in my project is practical – raters cannot rate more than 20 tapes each for financial and time reasons. I was thinking of the following 2 scenarios: 1) create 10 pairs of raters, by randomly pairing the 10 raters such that each rater is paired with 2 others; divide the tapes in blocks of 10 and have each pair of raters code a block of 10 tapes; in the end I would calculate ICC for each of the 10 pairs of raters, using ICC(1) (e.g., ICC for raters 1 and 2 based on 10 tapes; ICC for raters 1 and 7 based on 10 tapes, etc.); OR, 2) create 5 pairs of raters, by randomly pairing the 5 raters such that is rater is paired with one other rater; divide the tapes in blocks of 20 and have each pair of raters code a block of 20 tapes; I would end up with 5 ICC(1) (e.g., ICC for raters 1 and 5 based on 20 tapes, etc.). Please let me know if these procedures make sense and if yes, which one is preferrable. Thank you very much for your time and patience with this, it is my first time using this procedure and I am struggling a bit with it. I am very grateful for your help.

You need ICC(1) no matter what with different raters across tapes. It sounds like you only want 2 raters no matter what reliability they will produce, in which case you will be eventually computing ICC(1,2), assuming you want to use the mean rating (from the two raters) in subsequent analyses. If you want to assess how well someone else would be able to use this scale, you want ICC(1,1).

As for your specific approaches, I would not recommend keeping consistent pairs of raters. If you have any rater-specific variance contaminating your ratings, that approach will maximize the negative effect. Instead, I’d use a random pair of raters for each tape. But I may not be understanding your approaches.

Also, note that you’ll only have a single ICC for each type of rating; you don’t compute ICC by rater. So if you are having each rater make 1 rating on each tape, you’ll only have 1 ICC for the entire dataset. If you are having each rater make 3 ratings, you’ll have 3 ICCs (one for each item rated). If you are having each rater make 3 ratings on each tape but then take the mean of those 3 ratings, you’ll still have 1 ICC (reliability of the scale mean).

Thank you very much, this is so helpful! I think that it is extraordinary that you take some of your time to help answering questions from people that you don’t know. Thank you for your generosity, I am very grateful!

Dear Richard,

I have two questions:

– Can you use the ICC if just one case (person) is rated by 8 raters on several factors (personality dimensions)? So not multiple cases but just one.

– Can you use ICC when the original items (which lead to a mean per dimension) are ordinal (rated on a 1-5 scale)? So can I treat the means of ordinal items as continues or do I need another measure of interrater reliability?

I hope you can help me!

1) No; ICC (like all reliability estimates) tells you what proportion of the observed variance is “true” variance. ICC’s specific purpose is to “partial out” rater-contributed variance and ratee-contributed variance to determine the variance of the scale (given this measurement context). If you only have 1 case, there is no variance to explain.

2) ICC is for interval or higher level measurement only. However, in many fields in the social sciences, Likert-type scales are considered “good enough” to be treated as interval. So the answer to your question is “maybe” – I’d recommend consulting your research literature to see if this is common.

Dr. Landers,

In your response to Gabriel on 8/15/12, you indicated that ICC(1) is appropriate for ratings made by people arranged in teams when the teams vary in size (i.e., number of members). However, a reviewer is asking for ICC (2) for my project where my teams also vary in size. How would you respond?

Many kind thanks!

Looking back at that post, I actually don’t see where I recommended that – I was talking about ICC in general. As I mentioned then, the idea of aggregation in the teams literature is more complex than aggregation as used for coding. For an answer to your question, I’d suggest taking a look at:

Bliese, P. (2000). Within-group agreement, non-independence, and reliability. In K. Klein and S. Kozlowski (Eds), Multilevel theory, research, and methods in organizations. San Francisco: Jossey-Bass (pp. 349-381).

Hofmann, D. A. (2002). Issues in multilevel research: Theory development, measurement; and analysis. In S. Rogelberg (Ed.), Handbook of Research Methods in Industrial and Organizational Psychology. Malden, MA: Blackwell.

Klein, K. l, Dansereau, F., and Hall, Rl (1994). Levels issues in theory development, data collection, and analysis. Academy of Management Review, 19, 195-229.

Thank you! Bliese (2000) was particularly helpful.

Dr. Landers,

Regarding consistency across raters – I have an instrument with more items (30) and 100 persons rating the items. If I analyse the internal consistency of the scale in SPSS and choose the ICC two way random option, can I use the result for the ICC as indicating level of consistency on scale items across my 100 raters? Is that correct or are there other calculations I must perform? I need to know if the ratings of the items are consistent to form an aggregate.

Thank you, hopefully you can provide some advice!

I don’t think you’re using the word “rating” in the methodological sense that it is typically used related to ICC. By saying you have 100 persons rating the items, you imply that those are not the “subjects” of your analysis. Based on what you’re saying, you have 100 _subjects_, which are each assessed with 30 items (i.e. 3000 data points). If so, you probably want coefficient alpha (assuming your scale is interval or ratio level measurement).

Dr Landers,

Thank you for the answer. Perhaps I did not express the issue properly. The 30 items refer to an “individual” (in a way put, or a larger entity, an organization) so the respondents would be assessing someone else on the 30 items, not self-rating themselves.Could I use the ICC as mentioned, by choosing the option from reliability analysis in SPSS in this case, or similarly, more calculations need to be performed?

Anxiously waiting for a response …

It depends. How many people are being assessed by the raters, and a related question, how many total data points do you have? Are the 100 raters scoring the 30 items for 2 others, 3 others, etc?

For example, it’s a very different problem if you have 100 raters making ratings on 50 subjects, each making 2 sets of ratings on 30 items (a total of 3000 data), versus 100 raters making ratings on 100 subjects, each making 1 rating on 30 items (also a total of 3000 data).

If you have 100 raters making ratings on 100 subjects (1 set of 30 ratings each), you have perfectly confounded raters and subjects, and there is no way to determine the variance contributed by raters (i.e. ICC is not applicable). If you have more than 1 rating for each subject, yes – you can calculate ICC for each of your 30 items, or if it’s a single scale, you can calculate scale scores and compute a single ICC on that. If you want to treat the items on the scale as your raters, that’s also a valid reliability statistic, but it still doesn’t assess any inter-rater component.

Hi Dr. Landers..

I came across this useful information as I was searching about ICC.

Thank you for this information explained in a clear way.

I have a question, though.

If, I have 5 raters marking 20 sets of essays. First, each raters will mark the same 20 essays using analytic rubric. Then, after an interval of 2 weeks, they will be given the same 20 essays to be marked with different rubric, holistic rubric. The cycle will then repeated after 2 weeks (after the essays collected from Phase 2), with the same 20 essays & analytic rubric. Finally after another 2 weeks, the same 20 essays and holistic rubric. Meaning, for each type of rubric, the process of rating/marking the essays will be repeated twice for interval of 1 month.

Now, if I am going to look for inter rater reliability among these 5 raters and there are 2 different rubrics used, which category of ICC should I use? Do I need to calculate it differently for each rubric?

If I am looking for inter rater reliability, I don’t need to do the test-retest method & I can just ask the raters to mark the essays once for each type of rubric, and use ICC, am I right?

And can you help me on how I should compute intra rater reliability?

Thank you in advance for you response.

Regards,

Eme

It depends on how you are collecting ratings and what you’re going to do with them. I’m going to assume you’re planning on using the means on these rubrics at each of these stages for further analysis. In that case, for example:

1) If each rater is rating 4 essays (i.e. each essay is rated once), you cannot compute ICC.

2) If each rater is rating 8 essays (i.e. each essay is rated twice), you would compute ICC(1,2).

3) If each rater is rating 12 essays (i.e. each essay is rated thrice), you would compute ICC(1,3).

4) If each rater is rating 16 essays (i.e. each essay is rated four times), you would compute ICC(1,4).

5) If each rater is rating 20 essays (i.e. each essay is rated by all 5 raters) you would compute ICC(2,5)

You’ll need a separate ICC for each time point and for each scale (rubric, in your case).

Intra-rater reliability can be assessed many ways, depending on what your data look like. If your rubric has multiple dimensions that are aggregated into a single scale score, you might use coefficient alpha. If each rubric produces a single score (for example, a value from 0 to 100), I’d suggest test re-test on a subset of essays identical (or meaningfully parallel) to these. Given what you’ve described, that would probably require more data collection.

‘Thank you, Dr Landers, I think it makes sense now. Basically the 30 items make a scale. My 100 raters rate one individual (the organization in which they work, which is the same for all of them) on these 30 items or the scale. So 100 people rate, each of them, every item in the scale once for a rated “entity”. Then, at least if I have understood it correctly, I can report the ICC for the scale and use it as argument for aggregation. Hope I make more sense now .. Please correct me if it is still not a proper interpretation.

It sounds like you have one person making one rating on one organization. You then have 100 cases each reflecting this arrangement. This still sounds like a situation better suited to coefficient alpha, since you are really assessing the internal consistency of the scale, although ICC(2,k) should give you a similar value. You cannot assess the reliability of the raters (i.e. inter-rater reliability0 in this situation – it is literally impossible, since you don’t have repeated measurement between raters.

If by aggregation, you mean “can I meaningfully compute a scale mean among these 30 items?”, note that a high reliability coefficient in this context does not alone justify such aggregation. In fact, because you have 30 items, the risk of falsely concluding the presence of a single factor based upon a high reliability coefficient is especially high. See http://psychweb.psy.umt.edu/denis/datadecision/front/cortina_alpha.pdf for discussion of this concept.

Dr. Landers,

I am trying to determine reliability of portfolio evaluation with 2 raters examining the same portfolios with an approximate 40-item Likert (1-4) portfolio evaluation instrument.

Question 1: If I am looking for a power of .95, how can I compute the minimum number of portfolios that must be evaluated? What other information do I need?

Question 2: After perusing your blog, I think I need to use ICC (3) (Two-way mixed) because the same 2 raters will rate all of the portfolios. I think I am interested in the mean rating, and I think I need a measure of consistency. Am I on the right track?

I appreciate your time and willingness to help all of us who are lost in Statistics-World.

Cindy

For Q1, I assume you mean you want a power of .95 for a particular statistical test you plan to run on the mean of your two raters. This can vary a lot depending on the test you want to run and what your other variable(s) looks like. I suggest downloading G*Power and exploring what it asks for: http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/

For Q2, are your two raters the only two people that could ever conceivably rate these portfolios? If so, you are correct – ICC(3,2), consistency. If not (for example, if other people trained equally well as these two could do the same job), you probably want ICC(2,2), consistency. ICC(3) is very uncommon, so unless you have a compelling reason to define your two raters as a complete population of raters, I’d stick with ICC(2).

Hi,

This post was very helpful – thank you for putting the information together.

Any thoughts on sample size? If I would like to estimate interrater reliability for various items on a measure, any thoughts on how many cases I should have for each item?

For instance, if I want to know if social workers can provide a reliable measure of impulsitivity (one-item Q on an assessment form with 3 ordinal response options) when first meeting a child, how many reports of impulsivity (N= how many children?) would I need for this to be an accurate determination of interrater reliability?

And is there a suggested sample size of social workers?

Thank you –

I think your conceptualization of reliability may not be quite right. Reliability is itself a function of the raters and the situation they are in – in other words, you don’t check “if” social workers can provide a reliable measure, but instead you have them make ratings and check to see how reliable _they were_. So sample size/power isn’t really an issue – if you have fewer ratings, they will be less reliable. If you have more ratings, they will be more reliable. A reliability question would be, “how many social workers do I need to reliably measure impulsivity given this set of children?” But if you’re interested in determining the reliability of a single social worker as accurately as possible (i.e. minimizing the width of the confidence interval around ICC) to determine how accurate a social worker drawn from your population of social workers might be when making a judgment on their own, that is a different question entirely.

Thank you for getting back to me so quickly.

Yes, you’re right. I wasn’t conceptualizing this accurately. And you’re also right in speculating that I want to know how reliable a single social worker would be when asked to make a judgement on their own, using the provided scale.

After looking more closely at previous posts, it seems I have a situation similar to Stephanie that will require me to use Fleiss’ kappa?

It depends on what your scale looks like. If they are making ratings on an interval- or ratio-level scale, you can (and should) still use ICC. Kappa is for categorical ratings. You would also need to use the Spearman-Brown prophecy formula to convert your kappa to reflect the reliability of a single rater, whereas with ICC, you can just look at ICC(#, 1).

Dr. Landers,

Although there have already been asked so many questions, I still have one. Hopefully you can help me out (because I am totally lost).

I have done a prétest with 36 advertisements. Every advertisement can be categorized in one of the 6 categories. 15 People have watched all the 36 advertisements and placed each advertisement individually in one of the six categories (So for example: Ad1 – Cat. A; Ad2 – Cat D; Ad1 – Cat B, etc). They also have rated how much they liked the advertisement (from 1 – 5).

In my final survey I want to use one or two advertisements per category. But to find out which advertisement belongs in what category I have done the pretest (so which advertisement is best identified as the advertisement that belongs in thát category) (I want to measure the effects of the different category advertisements on people). The advertisements with the highest ratings (that the raters placed the most often in that category) will be used.

My mentor says I have to use inter-rater reliability. But I feel that Cohen’s Kappa is not usable in this case (because of the 15 raters). But I have no idea how or which test I háve to use.

Hopefully you understand my question and hopefully you can help me out!

Best regards,

Judith

This is not really an ICC question, but I might be able to help. Your mentor is correct; you need to assess the reliability of your ratings. 15 raters is indeed a problem for Cohen’s kappa because it limits you to two raters; you should use Fleiss’ kappa instead – at least for the categorical ratings. You could use ICC for the likability ratings.

Thank you for your quick answer!

However, I use SPSS but the syntax I found online are not working properly (I get only errors). Is there some other way in SPSS (20.0) to calculate Fleiss?

Thank you!

Thank you, Dr. Landers!

Dear Dr Landers,

I’m an orthopaedician trying to assess if three observers measuring the same 20 different bones ( length of the bone) using the same set of calipers and the same method have a large difference when they measure it. I’ve calculated the ICC for intra-observer variation using intraclass 2 way mixed(ssps 16). Do I have to calculate three different ICC’s or can I get all three sets of data together and get a single ICC for all of us? We’ve measured the 20 different bones and the raters were the same. The variables were continuous in mm. And if i cant use ICC what should I use instead to measure inter observer variation between 3 observers?

Another small doubt- during the process of reading about this I chanced upon this article suggesting CIV or coefficient of inter observer variability-

Journal of Data Science 3(2005), 69-83, Observer Variability: A New Approach in Evaluating

Interobserver Agreement , Michael Haber1, Huiman X. Barnhart2, Jingli Song3 and James Gruden

Now do I need to use this as well?

I was referred here by another website that I was reading for information about ICC- 1) i really appreciate that you’ve taken time out to reply to all the questions, and 2)very lucidly explained ( ok, ok, admit I didn’t get the stuff about k and c etc but that’s probably because I m an archetypal orthopod.

regards,

Mathew.

ICC is across raters, so you’ll only have one ICC for each variable measured. So if length of bone is your outcome measure, and it’s measured by 3 people, you’ll have 1 ICC for “length of bone.” ICC also doesn’t assess inter-observer variation – rather the opposite – inter-observer consistency.

There are different standards between fields, so your field may prefer CIV as a measure of inter-rater reliability (I am not familiar with CIV). I can’t really speak to which you should use, as your field is very different from mine! But normally you would not need multiple measures of reliability. I doubt you need both – probably one or the other.

Dear Dr. Landers,

I wanted to know, whether my raters agree with each other and whether anyone of my 10 raters rated so differently, that I have to exclude him. Is the ICC the right measure for that? And is it in my case better to look at the single measures or the average measures?

Regards,

Martina

There isn’t anything built into ICC functions in SPSS to check what ICC would be if you removed a rater; however, it would probably be easiest just to look at a correlation matrix with raters as variables (and potentially conduct an exploratory factor analysis). However, if you believe one of your raters is unlike the others, ensure you have a good argument to remove that person – you theoretically chose these raters because they come from a population of raters meaningful to you, so any variation from that may represent variation within the population which would not be reasonable to remove.

For single vs. average measure, it depends on what you want to do with your ratings. But if you’re using the mean ratings for some other purpose (most common), you want average measures.

Thank you very much for your answer!

Regards,

Martina

Dear Dr. Landers,

Thanks very much. I stuck with ICC.

regards,

Mathew

Dr. Landers,

I have individuals in a team who respond to 3 scale items (associated with a single construct). I would like to aggregate the data to team level (as I have team performance data). How do I calculate the ICC(1) and ICC(2) to justify aggregation? Each of the teams have varying size.

As I mentioned before in an earlier comment, the idea of aggregation in the teams literature is more complex than aggregation as used for coding. For an answer to your question, I’d suggest taking a look at:

Bliese, P. (2000). Within-group agreement, non-independence, and reliability. In K. Klein and S. Kozlowski (Eds), Multilevel theory, research, and methods in organizations. San Francisco: Jossey-Bass (pp. 349-381).

Hofmann, D. A. (2002). Issues in multilevel research: Theory development, measurement; and analysis. In S. Rogelberg (Ed.), Handbook of Research Methods in Industrial and Organizational Psychology. Malden, MA: Blackwell.

Klein, K. l, Dansereau, F., and Hall, Rl (1994). Levels issues in theory development, data collection, and analysis. Academy of Management Review, 19, 195-229.

Dear Dr. Landers,

Your explanation is very helpful to non-statisticians. I have a question. I am doing a test-retest reliability study with only one rater, i.e. the same rater rated the test and retest occasions. Can you suggest which ICC I should use?

Thanks for your advice in advance!

Unfortunately, it’s impossible to calculate inter-rater reliability with only one rater. Even if you want to know the inter-rater reliability of a single rater, you need at least 2 raters to determine that value. If you just want an estimate of reliability, you can use test-retest reliability (a correlation between scores at time 1 and time 2), but it will by definition assume that there is no error in your rater’s judgments (which may or may not be a safe assumption).

Dear Dr. Landers,

I just wanted to express my appreciation for your “labor of love” on this site. Your willingness to provide help (with great patience!) to others struggling with reliability is much appreciated.

From a fellow psychologist…

Jason King

Dear Dr. Landers,

I will be grateful if you can spare some time to answer this.

I am translating an English instrument to measure medication regimen complexity in Arabic. The instrument has three sections and helps to calculate the complexity of medication regimen. I have selected around 10 regimen of various difficulty/complexity for this purpose and plan to use a sample of healthcare professionals such as pharmacists, doctors and nurses to rate them as per the instrument. How do I calculate the minimum number of regimen (though I have selected 10 but can select more) and raters (health care professionals) if I want to test the Inter-rater reliability?

I am not sure I completely understand your measurement situation. It sounds like you want to assess the inter-rater reliability of your translated measure, but I am not sure to what purpose. Are you planning on using this in a validation study? Are you trying to estimate the regimen complexity parameters for your ten regimens? An instrument does not itself have “reliability” – reliability in classical test theory is the product of the instrument by measurement context interaction. So it is not valuable to compute reliability in a vacuum – you only want to know reliability to ultimately determine how much unreliability affects the means, SDs, etc, that you need from that instrument.

Thank you for the comments Dr. Landers,

I want to use the instrument in a validation study to demonstrate the Arabic translation is valid and reliable. For validity, I will be running correlations with the actual number of medications to see if the scale has criterion related validity (increase in number of medications should go hand in hand with increase in complexity scores) and for reliability measurement I was thinking of performing ICC.

Thanks,

Tabish

That is my point – the way you are phrasing your question leads me to believe you don’t understand what reliability is. It is a nonsense statement to say “I want to demonstrate this instrument is reliable” in the context of a single validation study. A particular measure is neither reliable nor unreliable; reliability is only relevant within a particular measurement context, and it is always a matter of degree. If you’re interested in determining how unreliability affects your criterion-related validity estimates in a validation study, you can certainly use ICC to do so. Based upon your description, it sounds like you’d want ICC(1,k) for that purpose. But if you’re asking how many regimen/raters you need, it must be for some specific purpose – for example, if you want a certain amount of precision in your confidence intervals. That is a power question, and you’ll need to calculate statistical power for your specific question to determine that, using a program like G*Power.

Dear Dr Landers

Thank you vefry much for the clear explanation of ICC on your website. There is only one step I do not understand.

We have developed a neighbourhood assessment tool to determine the quality of the neighbourhood environment. We had a pool of 10 raters doing the rating of about 300 neighbourhoods. Each neighbourhood was always rated by two raters. However, there were different combinations (or pairs) of raters for the different neighbourhoods.

This appears a one-way random situation: 600 (2×300) ratings need to be made, which are distributed across the 10 raters (so each rater rated about 60 neighbourhoods).

We now have an SPSS dataset of 300 rows of neighbourhoods by 10 columns of raters. It is however not a *complete* dataset, in the sense that each rater did not rate about 240 (=300-60) neighbourhoods.

I am not sure how we should calculate the ICC here/ or perhaps should input the data differently

Most grateful for your help

ICC can only be used with a consistent number of raters for all cases. Based on your description, it seems like you just need to restructure your data. In SPSS, you should have 2 columns of data, each containing 1 rating (the order doesn’t matter), with 300 rows (1 containing each neighborhood). You’ll then want to calculate ICC(1,2), assuming you want to use the mean of your two raters for each neighborhood in subsequent analyses.

This has been extremely helpful to our purposes when choosing how to assess IRR for a coding study. I want to be sure I am interpreting some things correctly. We currently have 3 coders being trained in a coding system, and “standard” scores for the training materials that we are trying to match. We are trying to get up to a certain IRR before coding actual study data.

The way I understand it, we want to run two-way random ICC using mean ratings and assessing for consistency, using ratings from our coders and the standard ratings. This should should show reliability when our coders and the standard codes agree (or are consistent). When these ICC stats (we rate many dimensions) are above, say, .8, we should be ready to code real data.

We also may be interested in two-way random, single ratings, assessing for absolute match, between each one of our coders and the standard scores, as a measure of how well each individual is coding a certain dimension.

These three coders will be randomly coding real study data with some duplication for future IRR calculations. I think I am understanding the model we will use for those calculations will be different because we will not always have the same set of coders rating every subject, like we do with our training materials.

Am I on the right track?

If you want to assess how well you are hanging together with a particular value in mind – which it sounds like you are, since you have some sort of “standard” score that you are trying to get to – you will want agreement instead of consistency. Otherwise, if Coder #1 said 2, 3, 4 and Coder #2 said 4, 5, 6, you’d have 100% consistency but poor agreement.

You would also not want to include the standard ratings in computing ICC, since those would be considered population values – otherwise, you are including both populations and samples in your ICC, which doesn’t make any sense. You would be better off assessing ICC with only your sample, then conducting z-tests to compare your sample of raters with the population score (I’m assuming this is what you mean by “standard” – some sort of materials for which there is a accepted/known score – if this isn’t true, my recommendation won’t apply). Then compute a standardized effect size (probably Cohen’s d) to see, on average, how much your sample deviates from the standard scores. You would then need to set some subjective standard for “how close is close enough?”.

You would only need single measures if you wanted to see how well a single rater would code on their own. If you’re always going to have three raters, that isn’t necessary. However, if you are planning on using different numbers of raters in the future (you mention “with some duplication”), you’ll need to calculate ICC(1,1) to get an estimate of what your reliability will be in the future. You can then determine how reliable your ratings will be if you always have pairs versus always have trios doing ratings by using the Spearman-Brown prophesy formula on that ICC.

It sounds a little like what you’re trying to do is say “Now that we’ve established our raters are reliable on the test cases, we know they will be reliable on future samples.” That is unfortunately not really how reliability works, because reliability is both rater-specific and sample-specific. You have no guarantee that your reliability later will be close to your reliability now – for example, you can’t compute ICC(2,3) and then say “because our ratings are reliable with three raters, any future ratings will also be reliable.” That is not a valid conclusion. Instead, you can only compute ICC(1,1) to say “this is how reliable a single rater is most likely to be, as long as that rater is drawn from the same population as our test raters and our sample is drawn from the same population as our test population.”

So to summarize… if you plan to always have exactly 3 raters in the future for all of your coding, you should use ICC(2,3) as your best estimate of what ICC will be for your study data coding. If you plan to use any other number of raters, you should use ICC(1,1) and then use the Spearman-Brown prophesy formula to see what reliability is likely to be for your prophesied number of raters.

Okay. We do understand that the ICCs we’re calculating in training are not going to guarantee any level of reliability in future coding. The idea is in training to get to a point where we’re seeing the same things in our training material, and they agree with what has been seen in them before (the “standard” scores). The “standard” scores are of course still someone else’s subjective ratings and I don’t think we want to treat them as 100% accurate themselves. I think it’s ultimately, at least after the many hours of training we’re going through, more important that our coders are seeing the same things (reliable with each other) than that we agree all the time with the “standard” score.

But, we will not have these same three coders coding study material. We will have one (in many cases) and two (in some cases, to compute the actual IRR with study material), but not three. And, the study is going to go on long enough that in the future it will be coders trained in the future, not those we are training now, due to lab research assistant turnover, etc.

So it sounds like we should be assessing each coder’s reliability using ICC(1,1) between each coder and the standard score to estimate that coder’s future reliability? What if, as I say, agreement with other coders in training seems ultimately more important than absolute agreement with the standard score?

Thank you for your extensive reply. This is clearly somewhat above my level (I have not had grad stats yet!) but fortunately, I am not the final arbiter of these decisions. I will take this information upwards so we are sure to do this right!

If you’re going to be using different numbers of raters at any point in the future, and you want a prediction of that ICC in the future, your only option is ICC(1). Based on what you’re saying, I believe you’re interested in ICC(1,1) for agreement. I would not include the “standard” score, since this won’t be available to you in your actual study data and will affect your ICC unpredictably (it could artificially inflate or deflate the value). But you still might be interested in calculating Cohen’s d between your sample and that score, just to see how far off your coder sample is from that score, on average.

You also mention that in the future, you’ll be using 1 or 2 raters. If you do so, you will be unable to calculate ICC in the future, since ICC requires a consistent number of raters for all cases, and at least 2 raters. In that case, you could get an estimate of ICC by having two raters code a subset of the data and then calculating ICC(1,1) – by convention, you probably want at least 25% of your sample coded this way, but this isn’t a hard rule. However, that would not technically be ICC for your study – it would only be an estimate of that value (in a sense, an estimate of an estimate of reliability).

I’ll also mention that grad stats typically does not cover this area – although you can’t see e-mail addresses as a reader, most of the comments here are actually from faculty. Reliability theory is usually covered in a measurement, psychometrics, or scale development course (sometimes in research methods, but not always). Not everyone has access to such courses.

Thank you once more for being so helpful. I understand, I believe, that in training, I want to calculate ICC(1,1) for agreement with our three trainees’ ratings, since in the future we will be using single raters (not a mean) and we want to get the best estimate of reliability for any single rater since we will not be coding as a group. I don’t know if I’m phrasing all of this right, but I do think I understand (or am starting to).

And with real study data, we can never compute an actual IRR for all study data since most study data will be coded by only one coder. But 25% (as you say) is what we are planning to have coded by two coders, so we can estimate IRR (or as you say, estimate an estimate!).

And it seems we will still use ICC(1,1) with our two coders on that data, to get that estimate of an estimate.

If the applications I just finished submitting are viewed favorably, I will be a grad student next fall. Not faculty for a long time (if ever!)

Hi Dr. Landers,

Is there a way to get agreement statistics for each of the individual raters using SPSS?

Thanks!

Stefanie

I am not quite sure what you mean by “agreement statistics.” ICC assesses reliability – true score variance expressed as a proportion of observed variance. If you just want to see if a particular rater is not rating consistently with others, there are a couple of ways to do it. The easiest is probably to treat each rater as an item in a scale, and then calculate a coefficient alpha (Analyses > Scale > Reliability, I think), with the “scale if item deleted” option set. You can then see if alpha increases substantially when a particular item (rater) is removed. You can also do so manually with ICC by computing ICC with and without that rater. If you REALLY wanted to, you could also compute confidence intervals around ICC and compare with and without particular raters, but that is probably overkill.

Hi there,

This is an incredibly helpful article and thread. I want to make sure I am about to use an ICC appropriately.

1.) Can an ICC be used when item responses vary in number. For example, one question has 5 possible answers (5 point likert scale) while another question is a dichotmous Yes/no, and yet another question is a 3 answer yes/no/i don’t know.

2.) I have 4 different sections of a scale that are rated by parents and children. I am trying to determine an ICC for each section based on how well the family agrees with each other. Not every child of the parent will necessarily be participating. Which ICC is the appropriate one to use?

Best

Elissa

1) ICC is not appropriate unless you have interval or ratio level measurement. Your 5-point scale is probably fine (depending on the standards of your field). The Yes/No could be fine if you dummy code it, but I wouldn’t use ICC here. ICC is absolutely not appropriate for Yes/No/Don’t Know. You want some variant on kappa for this.

2) You cannot use ICC unless you have a consistent number of raters for ALL cases. So ICC is not appropriate in the context you are describing.

Dear Dr. Landers

i have a querry regarding which statistic to use for computing inter-rater reliability for my data. i have a collection of websites each of which being rated for a few dimensions by a number of raters. the number of raters is same for each website. the ratings are qualitative (i.e. good, average and poor denoted by 1, 2 and 3 respectively) for all the dimensions except one in which ranks are given. please guide me which statistic should i use for computing inter-rater reliability for my data. is it fleiss kappa that i should use? if yes, then how (using spss)? if not then which other should i use?

please reply as soon as possible

thanks

namita

If your ratings are all ordinal (good/average/poor and 1st/2nd/3rd are effectively the same measurement-wise), Fleiss’ kappa is a good choice. I don’t think you can do it in SPSS; but it is very easy to conduct in Excel.

Dr. Landers,

I am doing interrater reliability with a group of nurses (5) who staged 11 pressure ulcers. Based on your explanation, I should use 2 way mixed model with absolute agreement? Correct ? Thanks for your help, MDG

If they are all the same five nurses, yes. And assuming that the ratings you are talking about have a “real world” meaning, then yes – you are most likely interested in absolute agreement.

thank you dr landers.

i must clarify that in my questionnaire, there are certain aspects of websites on which each website (included in the list) is to be evaluated as good i.e.1/average i.e.2/poor i.e.3 and in the last, the users are asked to chose which websites according to them are top five (ranking them as 1,2,3,4,5). now should i calculate the reliability of these two types of items separately, i.e. the reliability of the evaluation part by using fleiss kappa and that of the ranking item separately by using kripendorff’s alpha?

please help me out

You should calculate reliability for each scale on which ratings are being made. Since you have a rating quality scale and also a ranking quality scale, you have two scales. Then, each scale should be assessed with an appropriate type of reliability.

so is it right to calculate reliability of rating scale using fleiss kappa and that of ranking scale using kripendorff’s alpha?

It sounds like both of your scales are ordinal, so you should probably be using the same approach for both. Kappa could theoretically be used for both. But for ordinal data, you would usually use a weighted kappa. I am not familiar with Kripendorff’s alpha, so I don’t know if that’d be appropriate.

dear sir

thank you very much for your suggestions.

Dear Dr. Landers

I have a question regarding which statistic to use for computing inter-rater reliability for my data. I have three raters who have rated images by their quality (0= non-diagnostic, 1= poor quality… etc. 5= excellent). The raters have looked 9 images stacks (with sightly different imaging parameters) and scored 10 image slices from each image stack. Can I use ICC (two-way random) to measure inter-rater reliability? or does it make any sence since the raters are in consensus in most of the cases?

Consensus doesn’t influence which statistic is appropriate – it should just be close to 1 if they are mostly in consensus. The only exception would be if they agree 100% of the time – then you would not be able to calculate ICC. I’d actually say that your scale is double-barreled – i.e. you are assessing both quality (1-5) and ability to be used as a diagnostic tool (0 vs 1-5). In that case, I’d probably use ICC for the quality scale and kappa for the diagnostic element (recoding 0 as “no” and 1-5 as “yes”). Given that you have high agreement, this would probably make a stronger case – i.e., you could say all raters agreed on diagnostic-appropriateness for 100% of cases, and ICC was .## for quality ratings. But that is somewhat of a guess, since I don’t know your field specifically.

Also, I’m assuming that your three raters each looked at the same 10 slices from the 9 stacks. If they looked at different slices, you cannot compute reliability (you must have replication of ratings across at least two raters to compute a reliability estimate).

Thank you. This was very helpful. So is it right to use 2 way random model with absolute agreement?

If your three raters are always the same three people, and they are all rating the same targets, yes – that is what I would use for the quality ratings component.

Hi Richard,

I wanted to run ICC(1) one-way random in SPSS. I have varying number of judges for each subject: each subject was rated by anywhere from 2 to 4 judges. All my subjects are in rows and the judges are in columns (judge1 – judge4). The problem is SPSS deletes missing data listwise and thus only subjects rated by 4 judges were included in the ICC calculation. Any suggestions?

Thanks in advance!

Jason

I’m afraid you won’t like my answer. Because ICC determines reliability given a particular number of raters [e.g. ICC(2,1) vs ICC(2,4)], one of its assumptions is that the number of raters is held constant. If you’re using the mean of this scale for a paper and can’t get 4 raters consistently, what I’d suggest doing is taking the first two raters for each case, calculating ICC(1,2) and then describing your calculated ICC as a conservative underestimate of the “true” reliability given your rater structure.

In reality, you’re going to have less reliable estimates of the mean when you have 2 raters than when you have 4. You could theoretically calculate ICC(1,2), ICC(1,3), and ICC(1,4) to determine reliability given each configuration, but I find that is generally not worthwhile.

I appreciate the quick reply. I poked around and figured out that using HLM I can estimate ICC(1, k) when k is not a constant. ICC(1) can be easily computed from the variance components in the HLM null model output.

On an unrelated issue, I ran into the ICC(2) versus ICC(1, k) issue you mentioned above: “some researchers refer to ICC(1,k) as ICC(2), especially in the aggregation/multilevel models literature”. When a reviewer does that, requesting for ICC(2) while probably meaning ICC(1, k), what would be a good way to respond?

The HLM ICC is not precisely an ICC, if I recall my HLM correctly. But it has been a while; hopefully it will be close enough for your purposes!

As for the reviewer, I find the best approach is to ask the editor (assuming you have a relatively friendly editor) who may be willing to relay your request for clarification to the reviewer. If you don’t want to take that route, I’d report the one you think s/he meant in the text of your revision and then explain in your author’s reply why you did it that way.

I selected ratings with only k = 3 judges and obtained the ICC(1) and ICC(1, 3) for the ratings in both SPSS and HLM 7.0. Both programs gave identical results. So that gave me some confidence in HLM’s computation of ICC.

Thanks for the advice!

Jason

Yes, that makes sense; as I recall, they should be identical as long as you have a consistent number of raters for each case and your sample size is relatively large. The differences occur when you have a variable number of raters; I believe it may change something about the sampling distribution of the ICC. If I’m remembering this right, the original (Fischer’s) formula for ICC was unbiased but required a consistent number of raters; modern formulas for ICC are based on ANOVA and do not require equal raters but are biased upward as a result. I suspect SPSS uses the first formula. But anyway, for your purposes, I doubt it really matters.

Hello Dr. Landers,

Thank you for this helpful post on ICC for interrater reliability. I am wondering if you have any insight on what would be an acceptable reliability coefficient (single-measures) for two raters that are rating the exact same cases. I have computed an ICC(2,1); the second rater has coded 20% of my data, as I am hoping to demonstrate the reliability of one rater who coded the entire set. I have not found many helpful resources for acceptable ICCs for interrater reliability in social science and educational research … some say above .70 for research purposes, while .80-.90 would be needed for major decisions (as with high-stakes testing, etc.). I got an ICC(2,1) = .76. I am happy enough with this, but I have been looking for a good citation to support the acceptability of this correlation and not having much luck.

Thanks for your time!

This is not going to be precisely an answer to your questions, but hopefully it will be helpful to you. All types of reliability are essentially the same in a particular measurement context. The difference between different estimators is what you treat as error variance. For example, a test re-test reliability coefficient treats differences over time as error, whereas coefficient alpha treats differences between items as error. In contrast, ICC(1) treats differences between rating targets as error, and ICC(2) treats differences between rating targets AND differences between raters as error. But all of these estimates try to get the same piece of information: what is the proportion of true variance out of observed variance? This is why the “best” measures of reliability tend to be the most conservative – because they consider multiple sources of error simultaneously (for example, the coefficient of equivalence and stability or generalizability theory overall).

The practical outcome of all of that is this: the lower your reliability, the more attenuated your effects will be (because you are mismeasuring your intended constructs) and the less likely you will find statistical significance even if the effect you want to find is present (i.e. increased chance of a Type II error). Low reliability also increases the standard error, making it less likely that any particular person’s scores actually represent their own true score (which is why very high reliability is recommended for high-stakes testing; even if group-level decisions are still predictive of performance, individual decisions will vary more than they should).

So the short answer to your question is this: since all reliability is the same, whatever standard you found for one type should be the same for other types, i.e. a .7 is a .7. But in practice, the lower that number, the less likely you find statistical significance (plus a variety of other negative consequences).

Thanks for the thorough response! That does answer my question. So am I correct in my interpretation that the ICC(2) is relatively conservative (it is the “better” measure) because it controls for both rater and ratee effects? I am actually preparing for my dissertation defense and I have a top-notch stats prof on my committee. Your blog and response have been extremely helpful!

In comparison to ICC(1), it is more liberal, because you must assume that your rater effects are the same across all of your ratings and then partial that out. But if that assumption is true, it should also be more accurate. ICC also still doesn’t account for temporal or other source of error. I’d strongly recommend reading Ree & Caretta 2006 in Organizational Research Methods and Cortina 1993 in Journal of Applied Psychology – they will fill many gaps in what I’m able to tell you on a blog post and comments!

Thank you! I will look at those articles. I appreciate all of your help!

Dear mr. Landers,

Thank you for your helpful post. You give really good exlanations. However, I still have some uncertainties about my own research, about which I have to write my masterthesis. I have conducted a study based on data stemming from a larger study on the same topic. Because I don’t gathered the data myself, and didn’t took off the measures, I have come to face some difficulties. One of the measures, of which I want to report the reliability in my Method’s section, consist of four stories from participants rated on a 7-point scale by two independent coders. From the ratings of those four stories, one composite score was created which yields the final score that is used in the analysis. Only I don’t know the separate scores of each coder for the different stories or the composite score. I only have one complete dataset. What I do have is the scores from one of the coders for 10 subjects and of that same 10 subjects I have the scores from the inventer of the measure, which did not rate the population of my study. Is it possible to assess interrater reliability between those to, even if I don’t know the scores for the other coder from my data? Wil it be possible to say something about the reliability of the test? I was thinking I could use a two way mixed ICC, absolute agreement but I don’t know if this is appropriate in this case. I hope you can help me. Thanks in advance!

Since reliability is population-specific, there is no way to calculate inter-rater reliability accurately in this context. The ICC you are talking about would be the reliability of mean ratings on that population for those 10 subjects, which is not the number you need. You must have all ratings to calculate reliability in this context. Otherwise, you are basically asking, “Can I calculate an ANOVA looking for interactions but not collect data for one of the conditions?”

Thank you very much for your quick response!! I was already afraid for that.. I will see what I can do now. Do you know if there is any other measure which I can use to say something about the reliability of the test? I hope that, in some way, I will be able to get to know the scores of each coder.

I’m afraid not. You can’t determine the reliability of scores you don’t have. What some people do in the context of meta-analysis is conduct a sort of “mini-code,” i.e. having the coders re-code a subset of the dataset and examine reliability on the subset only. But there are assumptions attached to that (e.g. that the recode they completed is parallel to the scores that were not recoded). But that is the only potential option I can think of.

Hello Dr. Landers,

We have calculated an ICC for inter–rater reliability and would like to make sure we are calculating and interpreting it correctly. We developed a 6-item evaluation tool to measure “Clinical Reasoning” ability in Medical Students & Residents. (They hear a case, and write a summary, and 3 raters have used this tool to rate their summary). There are 6 items on the tool, each rated with a 0, 1, or 2, for a total possible score of 12.

All 3 raters rated every participant, and we have a sample of raters, thus we computed a Two-Way Random Model. We are interested in Absolute Agreement.

We have four questions:

1. Single Measures refers to how accurate a single rater would be if he/she used this tool in the future, while Average Measures refers to the actual reliability of the 3 raters’ scores? Is this correct?

2. Cronbach’s Alpha is reported when we run the analysis. What does it mean in this context?

3. When we entered the data, we entered the total score (of the 6 item tool) for each participant across the three raters. So, we had 3 columns representing the 3 raters. Do we calculate the ICC on the data as it is or should we calculate the mean of total scores for each rater and run the ICC on the mean?

4. How do we know if EACH ITEM on the tool is reliable? Should we calculate an ICC for each item on the tool, in addition to the total score? (I hope this makes sense).

Thank you in advance for any guidance you may be able to provide.

I have 4 judges that are being used in combination as rater1 and rater2 to rate 30 responses. The four judges could be either rater 1 or rater2 for any response. Since I’m interested in the average rating is ICC(2,k) the correct procedure? Thank you.

Since you have inconsistent raters, you need ICC(1,k).

Thank you Dr. Landers for pointing out my oversight. Yes, I will be using ICC(1,k).

I was wondering, especially in my situation, what would have been the consequence of using a weighted Kappa? I had initially, planned to use that procedure.

Weighted kappa is going to be more appropriate for ordinal data – you can’t use ICC at all in that context. For interval+ data, I believe that weighted kappa approaches ICC(2,1) as sample size increases (at least, that seems to be the ICC referred to here: http://epm.sagepub.com/content/33/3/613.full.pdf+html). I believe weighted kappa will always be lower than ICC, but that’s a bit of a guess. I have honestly not looked into it too deeply, because I would just use kappa for ordinal and ICC for interval/ratio.

Hi, a lot of things to learn here: thank you! Could you, please, help me?

I have a group of 18 raters that rated twice 40 ultrasound images. It was a Likert scale : nothing, a little, a lot and full of.. Shoud I use ICC or k to test intra-rater / inter-rater agrement?

This really depends entirely upon the standards of your particular field. In psychology, for better or worse, we typically treat Likert-type psychological scales as interval-level measurement, in which case you would use ICC. But there are many fields where this is not standard practice. If it is not standard practice in your field, you should use kappa (or a variant). The easiest way to tell is if people typically report means of their Likert-type scales. If they report means, they are already assuming interval measurement. If they report medians only, they are probably assuming ordinal measurement.

I am struggling to write a null hypothesis for my doctoral dissertation. I am performing an ICC to determine reliability with 2 raters from a pool of 8 to rate 30 portfolios. Each portfolio will be rated 2 times. Any suggestions? My hypothesis is that there is no difference in the ratings… right?

If you are talking about the hypothesis test reported by SPSS, the null would be something like “this sample’s ICC is drawn from a population where ICC = 0”. I don’t know that I’ve ever seen a hypothesis test of an ICC reported, however, because you are usually just interested in the effect size (ie. what proportion of observed variance is estimated to be true variance?). The hypothesis test doesn’t really tell you anything useful in that context.

I am wondering about the assumptions that one needs to meet for calculating an ICC. Are they they same as for ANOVA? I spent quite a bit of reading on the internet about it but it is difficult to get a clear answer.

If I did interval-level measurement and would like to calculate an ICC to justify aggregating across different raters (I know which type to choose etc.), do we have to care about issues of normality and linearity? Are there any other assumptions that one needs to consider?

Is there an alternative to the ICC? Weighted kappa?

Your help will be much appreciated!

Yes, the assumptions are the same. ICC(2) is essentially a two-way ANOVA, with a random rater effect and random ratee effect. ICC(3) assumes a fixed rater effect. There is an assumption of normality of ratings and independence among raters and ratees. There is no assumption of independence of raters in ICC(1). Neither ANOVA nor ICC assume linearity. But if the normality assumption is not met, you should probably use a different measure of inter-rater reliability – kappa is one such alternative. Regardless, if your goal is aggregation, you can’t meaningfully aggregate using a mean if the normality and independence assumptions are not met – so such a change might fundamentally alter your analytic approach.

Thanks for your fast reply!

Just to clarify: The normality assumption refers to all the ratings per ratee, right? In other words, I need to check for normality of ratings for each individual ratee.

Technically, that is true. But in practice, you usually don’t have enough raters to reasonably make any conclusions about normality by ratee either way. ANOVA also assumes equality of variances between IV levels (in this case, between raters), so if that assumption is met, normality by rater is probably sufficient evidence – at least, that is what I would check. As with ANOVA, ICC is robust to minor violations of the normality assumption anyway – if everything looks vaguely bell-shaped-ish, you are probably safe.

Hello, Dr. Landers,

I have the same 2 raters rating a sample of 1000 people on the scale with 10 items with 5-point Likert scale treated as interval. I am interested in the measures of absolute agreement between the 2 raters. I plan to perform ICC(2,1) to calculate absolute agreement for each of 10 items.

But I am also interested in the total or mean agreement coefficient.

Is there a way I could calculate average ICC for a scale based on individual item ICCs? Or the only way is to use scale means for rater 1 and 2?

Thanks!

What you are saying is contradictory. If the 10 items are not part of the same scale, you would compute ICC for each of the 10 items. If the 10 items were part of the same scale, you would compute the scale mean for each rater and then compute ICC. There is no common situation I can imagine where you would do both, except perhaps in the context of scale development. “Mean reliability” is not usually a meaningful concept, because reliability as you would normally need to know it is sample-specific – the only situation where you’d identify mean reliability might be in the context of meta-analytic work (where you have access to a distribution of reliabilities).

Thank you for your comment. Yes, it is a part of reviewing a new scale and I wanted to provide as much insight about the scale as possible.

Dear Dr. Landers,

I would like to ask you a question about how to analyze inter-rater agreement.

I created a prosodic reading scale with 7 items. Each item has four possible options, each option is perfectly described.120 children were evaluated, 2 raters rate 60 children, and others 2 rater the other 60 children. To analyze inter-rater agreement, shoul I uses Cronbach? Kappa or ICC?

Thanks in advance, Nuria

Cronbach’s alpha is not appropriate, given your measurement structure. I am not sure what you mean by “four possible options.” If they are Likert-type measurement (implied by your use of the word “scale” and means the items could be considered interval level measurement, e.g. Very little, little, much, very much), and if you want to use the mean of this scale for something else, you should compute the mean score for each rater and then compute ICC(1,2) on the means. If all those “ifs” are true except you want to know “in the future, how well could a single individual rate a child on this scale?”, you should determine ICC(1,1). If you don’t have interval-level measurement, some form of kappa will be needed.

Dear Dr. Landers,

Thanks for your soon reply. It have been very helpfull.

I think I have a Likert-type measurement given that each options go from 1 to 4, and implies 1= a lower level in prosodic reading and 4 = the higher level. But the description in the scale is bigger. So, as you said, in this situation I should use ICC.

Another question is as I have 4 raters ( 2 evaluated the middle of the sample and the other two the other half. Could I do ICC( 2,1) with the first 2 raters, and the another ICC (2,1) with the other 2 raters, and finally report this two results?

Thanks again, Nuria

You should use ICC if your field generally treats such scales as interval – many do not.

You could report the results separately only if you treat the two halves as separate studies. If you plan to compute statistics on the combined sample, you must use ICC(1).

Dear Dr. Landers,

I have read your wonderful post and still have a couple of questions about ICC.

Three raters have rated 50 cases based on a validated 6 item scale using a 5 point Likert measurement. Three items of the scale measure one construct (e.g. expertise) and the other three measure another construct (e.g. trustworthiness). How would I determine the ICC between the raters? And how do I deal with the two constructs with respectively three items each? Should I compute the mean score for each rater per item or per construct? Or do you suggest something else?

Additionally, one rater only rated about half of the cases, while the other two rated all 50. How can I treat this problem.

Thanks in advance,

Monika

Compute the mean score per scale (one each for expertise and trustworthiness), then compute ICC on the mean scores. Unless you are using the individual items in later analyses, you do not need to know the interrater reliability of each item.

I would compute ICC(1,2), since you have a minimum of 2 raters for all 50. I would then compute means for use in later analyses across all three raters. Your calculated ICC will be an underestimate for cases where you have 3 raters, but you can still take advantage of the reduced error variance. You could also just drop the rater than only examined half of the cases and use ICC(2,2).

Dear Dr. Landers,

Thanks for your advice! Like you had suggested, I computed the mean scores per construct (trustworthiness and expertise) and computed ICC(1,2) on these scores. This resulted in an ICC value of .702 for the consistency measure and .541 for the absolute measure. I am now wondering:

– What do these ICC values tell me?

– What is an acceptable ICC value? .70 like Chronbach’s alpha?

– Are there other measures for interval variables I can use to check the inter-rater reliability between three raters?

Best,

Monika

This indicates that in terms of consistency (e.g. 1,2,3 is 100% consistent with both 2,3,4 and 4,5,6), your raters are assessing the same underlying construct (although it may not be the one you intend) 70% of the time. 30% of their ratings are error. In terms of absolute agreement (e.g. 1,2,3 is in agreement only with 1,2,3), your raters are assessing the same underlying numbers 54% of the time.

.70 is a moderately strong reliability, for both ICC and alpha. You really want in the .8s or .9s if possible. The side effect of low reliability is that relationships and differences you observe become attenuated; that is, if your reliability is too low, things that should have been statistically significant will not be, and observed effect sizes will be biased downward from their population values.

There are certainly many ways to examine inter-rater reliability; but for inter-rater reliability of interval/ratio data, ICC is the most common.

Dear Dr. Landers,

Firstly, what a privilege to read such a well written and student friendly article; clearly explaining the whole process. Additionally, I found it very refreshing that you have found the time to answer all the questions that have been posted. I do have a little question of my own if possible, but understand if you have had enough of us all!

I have a simple design whereby 5 raters assess all 20 muscle reaction times (ratio data). They repeat it again on a second day. I need to assess both intra and inter reliability.

For inter-rater reliability, I have taken the mean of the two days for each rater, and used ICC(2,1) as I am interested in absolute agreement and single measures. However, what statistic would you use for intra-rater reliability/ test-retest between the days? I know you mentioned Pearsons earlier in response to a post, but recent texts have recommended that because Pearsons is a relative form of reliability, a better approach is to use ICC for the relative form of reliability and SEM as an absolute measure of reliability. I wonder if you could share your views?

Thanks…. Pedro

The specifics of this differ a great deal by field – different areas have different “best practices” that are common, which I can’t really speak to. I’m not sure which “SEM” you are referring to (there are several), but this might be best practice in your field. In psychology, the answer is “it depends.”

If you believe that scores over the two days _should_ change (i.e. if differences between days can be due to variance in true score between days), then it is not appropriate to calculate reliability over time at all. If you believe that scores over the two days _should not_ change, then I am honestly not sure what you would do – it is not a common measurement situation in psychology to assess the same construct twice over time without expecting it to change.

Dear Dr. Landers,

Thank you for your unbelievably quick response; I have waited months for people in the statistics department of my University to get back to me. The SEM I was refering to was Standard Error of Measurement calculated as SEM = SD x SquareRoot (1 – ICC), but I shall consult my field on this.

Thanks for you honest reply regarding reliability over time. I would not expect them to change, but I shall investigate further.

One final question if I may… would it ever be appropriate to compute both an ICC to assess consistency and a seperate one to assess absolute agreement and report together?

Thank you once again!

Ahh… using SEM would be a very unusual standard. Although it does capture reliability, it is an unstandardized reliability (in the terms of the original measurement). Usually, when we report reliability (on theses, in journal articles, etc) we are doing so to give the readers a sense of how reliable our measures are, in a general sense (i.e. a reliability coefficient of .80 indicates 80% of the observed variance is “true” variance). When you report the SEM, that information is lost (i.e. the SEM is interpreted, “on average, how far do sample means fall from the population mean, given this sample size”). That is not usually terribly meaningful to a reader trying to evaluate the merits of your research.

Remember, reliability and measures of reliability are different. Each measure x measurement situation interaction has one and only one “reliability” – you are just trying to figure out the best number to capture that reliability given what you’ll be doing with those numbers later. It’s not the right approach to say “I’ll just calculate both” because then you are implying that both are meaningful in some way.

Instead, you should identify your goal in reporting reliability, and then choose a measure to capture that given your anticipated sources of error variance (e.g. time, inter-rater, inter-item, etc). If you are going to calculate statistics that only rely upon consistency on this data later (e.g. means, correlations, odds ratios), a consistency measure is appropriate. If you are going to use statistics that require precision, an absolute measure is appropriate. If you’re going to use both, then use both. But don’t calculate both just to report both.

Thank you for your advice; exceptionally clear, detailed and helpful.

Thank you.

Hi Dr. Landers,

Thank you so much for your article- my research supervisors and I have found this extremely helpful.

We do, however, have one question. You mention that there is one restriction for using ICC: there must be same number of ratings for every case rated. Unfortunately one of our data sets does not meet this requirement due to internet issues during online data collection. Is there some sort of accepted correction or cut-off, or must we simply collect more data?

Thank you,

Katelin

Well, keep in mind what you are saying. If you are using mean ratings, having different numbers of raters means that the amount of information you have about the true score for each case will be different. For example, one set of scores might be made up of 90% true scores and 10% error while another set might be 60% and 40%. That may not be desirable for your analytic purposes.

If that doesn’t matter to you, I believe there are techniques to come up with a mean reliability for your sample (in fact, it might even be as simple as a sample size-weighted mean of reliabilities calculated upon each set with different numbers of raters) but I have not used these techniques myself and am not sure what would be best in this particular regard. My approach is usually to randomly choose a subset of raters for each case so that all cases have an equal number of raters and then report the resulting ICC as a “lower” bound – assuming all raters are from the same population, adding raters should only increase reliability, after all. If your only purpose in calculating the reliability is for reporting (and not because you are worried that your scores contain too much error; i.e. your reliability is already high using this technique), this is probably sufficient.

Dear Dr. Landers,

Thank you for the clear information about ICC.

I still have a question about the ICC I would like to ask.

I had 33 parent couples fill in a questionnaire of 45 questions (5-likert sale) about their child.

I used Kappa for the inter-rater reliability of the individual questions and now I would like to use an ICC to measure inter-rater reliability of fathers and mothers on the 7 subscales.

Which ICC should I use? And should I use the single or average measures as shown in the SPSS Output?

Thanks in advance,

Florien

Since you have different raters for every case, you must use ICC(1). However, keep in mind that this assumes you believe the two parents to have identical true scores (i.e. you don’t expect their ratings to differ by anything but chance). Single vs. average depends upon your answers to the questions described in the article, i.e. it depends upon what you want to do with your subscale estimates. However, your question doesn’t really make sense to me, because if you used Kappa, that implies that you don’t have interval/ratio measurement and cannot compute ICC in the first place.

Dear Dr. Landers,

Thank you for your simple explanation. I would like to ask you about the significant level. I have conducted observations and have been rated by two coders. I’m using the two way mixed effect ICC and using the cutoff points of <0.4 is poor. From all 78 observations, there are several observations which have the average measure value higher than 0.40, but the significant level of more than 0.05. Do the sig. level play a role in selecting which observation to be included in my ancova analyses?

Thank you very much for your help on this matter.

Regards,

Iylia Dayana

I am mostly confused by your comment, because it is never appropriate to include or exclude cases as a result of reliability analyses. You should always include all cases unless you have a specific reason to suspect poor quality data (e.g. an error in coding). Otherwise, removing cases capitalizes on sampling error and artificially inflates your test statistics – removing them is unethical, in this case. If you calculated reliability correctly, you should also not have ICCs for each case – rather, you should have ICCs for each variable. Two ratings of one variable across 1000 cases will produce only one ICC. So I am not sure how to answer your question.

Dear Dr Landers,

This was extremely useful, thank you.

How would you calculate degrees of freedom for ICC? I have two fixed raters, and 30 observations.

Simon Morgan

P.s. I have used a two-way mixed effects model with absolute agreement – there are only two raters and the same sample of observations are being rated. If one rater is to go on to rate the entire population of observations, does this mean single measures is the most relevant statistic to use?

Thank you in advance

I am not sure what DF is for ICC, because I have never needed to test the significance of an ICC (this is a fairly unusual need – you’re asking “in a population where ICC = 0, would I expect to find an ICC this large or larger in a sample of this size?”).

As for your second question, yes – if you’re trying to generalize from two raters on a sample to one rater on another sample, you’d want the single measures version – but it is important to note that you will only have an estimate of reliability in your second sample, not an actual measure of reliability in that sample.

Dear Dr Landers,

I was wondering if you could confirm whether ICC is suitable for an experiment I’m analysing?

We have 6 pairs of raters; each person in the pair rates their own performance on 10 tasks and is rated by the other person in the pair (so 20 observations, 2 x 10 observations of self, 2 x 10 of other) .

We want to see how similar/different ratings are for own versus other. I’ve calculated the ICC for each of the pairings (ICC2), but is there a way to get an overall sense of rater agreement across all 12 raters for own/other? Can you average the ICC or is that too crude?

Also, in your opinion is two-way random the correct method (as opposed to ICC3 mixed?).

Many thanks

It sounds like you want the inter-rater reliability on each pair’s ratings. This is easy if you’re not interested in ratings on both members of the pair, i.e. if you have self and other from raters 1 and 2 respectively, but are only interested in ratings made on rater 1 OR rater 2. In such a case, you’d use ICC(1). In your case, however, you have non-independent data because you have two target ratings, i.e. if you included every rating target as a case, you’d have paired cases in your dataset. ICC does not have any way to handle this because it violates the independence assumption (you have introduced a confound of rater type and the individual rater).

In any case, you definitely don’t want ICC(2) or ICC(3) – you don’t have consistent raters for every case. These could only be used if you had the same two people rating all of your cases.

In your situation, I’d probably calculate two ICC(1)s – inter-rater reliability of ratings of self and inter-rater reliability of ratings of other. If they are not meaningful (if self and other are experimentally identical), then you could take a mean of these two ICC(1)s.

its good to know what you have explained

but i need to learn as i have seen in some papers as well

average inter scale correlation (AVISC)

sir what is this and how we can calculate it

using some softwate or what other means are possible

regards

I have never heard of AVISC, so I can’t help you there; also, this article is about ICC.

Thanks so much, will give it a try!

sir i am posting a refernce from one research a paper

paper

An empirical assessment of the EFQM Excellence Model:

Evaluation as a TQM framework relative to the MBNQA Model

journal of operations management 2008

Discriminant validity

Three approaches were used to assess discriminant

validity (Ghiselli et al., 1981; Bagozzi and Phillips, 1982).

First, for all scales Cronbach’s alpha was higher than the

average inter scale correlation (AVISC) (see 4th column in

Table 5). Second, the average correlation between the scale

and non-scale items (6th column in Table 5)was lower than

between the scale and scale items (5th column in Table 5).

That’s fine – you will still need to research it on your own. I am not familiar with it. It’s not a statistic I’ve heard of, so it might just be the mean intercorrelation between every possible pair of scales (likely squared, averaged, and taking a square root) but I do not know for sure. You will need to read the paper and figure it out.

Dear Dr. Landers,

thank you very much for the article, which has really helped me. However, a few questions remain.

During a scale development process, we have constructed 40 items with the following structure: First, a problem situation is described. Then, four (more or less effective) possible solutions to the problem are presented, which can be rated on a 5 point Likert scale.

As part of the development process, we conducted an expert survey, where 14 problem solving experts rated these four possible solutions (for each item) regarding their effectivity in order to solve the problem (on a 5 point Likert scale; they did exactly the same as the “normal” participants will do later).

In order to assess inter-rater agreement (with the goal of detecting “bad” items), I calculated ICC(2, 14) (consistency) for each Item (40 flipped datasets with 4 rows and 14 columns).

My questions:

Is this the right ICC I have chosen? For most Items, my ICC is very high (> .90). Is this “normal”? Descriptively, the agreement is quite good, but far from perfect.

In addition, I want to calculate an ICC for the whole scale (all 40 Items together; Items assess different facets of the same construct). If I remember right, you stated that in order to calculate the ICC of a scale, one should calculate the mean of the scale (for each participant) and then calculate ICC with these means. To me, this doesn´t make any sense, as I have only one row in my dataset then, which makes it impossible to calculate anything.

Sorry for the long post; I guess the structure of my scale is a bit more complex than usual.

Thanks very much for your reply!

Tom

This actually sounds like identical to something we have in I/O psychology called a situational judgment test. You might find helpful the more targeted discussions of reliability in the SJT literature. I am not super-familiar with SJTs, and there are probably specific techniques used for SJTs that will address the problem you are having. But in any case, you are right to suspect an ICC over .9 – any time you see a number that high, something is probably wrong somewhere.

I’m not sure your dataset is set up right. If you have 40 items with four situations each, you essentially have 160 items being rated. That would be 160 datasets, if you were interested in the reliability of each item on a target sample. Normally, you would take a mean of each dimension or scale and examine that instead (e.g. if you had four dimensions, you’d average across 40 items for each dimension within each rater and calculate ICC across four datasets). ICC may not be appropriate because you have no rating targets. We’d normally be interested in ICC when making ratings on some target sample – e.g. if 14 raters were examining 160 items on each of (for example) 50 experimental subjects (i.e. 14 x 160 x 50 = 112000 ratings).

In your case, you are missing the “subject” dimension – there is no target sample. The way you’ve set up your datasets, you are treating each problem situation as an experimental subject, i.e. you are assuming the problem situation is an independent rating target (which it may not be, since they are dependent upon their attached situation), and ICC assesses how consistently raters assess problem situations across the four solutions. So you may be violating the independence assumption of ICC, and you also may not have a valid sample (i.e. each of your four solutions must be considered a random sample from a population of solutions). I suspect there is a more standard way to examine reliability in SJTs, but I honestly don’t know what it is – but it is probably worthwhile for you to look into that literature for how reliability is assessed during SJT development.

Okay, thank you very much for the long and fast reply, many things do seem clearer right now. And especially thanks for the hint with situational judgement tests. I will look into it.

Dear Dr. Landers,

I’ve read all the questions on this excellent post, but I’m still not sure what to do about my own research. I’ve 25 essays, each rated by two raters on six criteria (resulting in a score between 1 and 10 for each criterium). In total there are five raters, each of them rated between 8 and 12 essays. I want to know how consistent the raters are in their rating.

I thought that an ICC, one way, would be most useful and that I should look at te average measure. So ICC (1,5) is the connotation in that case? My questions are:

1. Am I right about choosing ICC 1?

2. I’ve put the raters in the columns, the score for each criterium in the row. So some raters have 6 (criteria) x12 scores (essays) and others have 6×8 scores. Is it a problem that some raters rated more cases than others. Or should I have computed a mean for each criterium so that there a 7 rows instead of 48 for one rater and 72 for another?

3. In the next phase of my study the same 5 raters will rate another 25 essays, each essay rated by two of them. The same six criteria will be used but the depth of the description of the criteria is different from the first situation. If I do the same analysis I want to see if the second condition leads to a more consistent rating. Is that the right way to do it?

Many thanks if you could help me out with this.

1. ICC(1) is the right choice since you are using different raters for every case. You need to ensure that your five raters are split up fairly randomly though amongst rating targets.

2. That is not the right setup. You should have 12 columns (6 criteria x 2 ratings) and 12 rows (12 essays). You then compute ICC on each criterion pair, one at a time (6 analyses to produce six ICCs, the inter-rater reliability of each criterion).

3. In this case, you’d probably want to look at the confidence interval of ICCs produced the first time and ICCs produced the second time to see if they overlap (no overlap = statistically significant = different ICCs). I am not sure if the sampling distribution is the same in these two cases though, so that may not be a valid comparison. But that is the best I can think of, given what you’ve said here. If you are interested in determining changes in ratings from one set of criteria to the other, I’d probably have had half of your raters rate all 25 essays with the old descriptions and the other half of your raters do so a second time with the new descriptions, using a simple independent-samples t-test to analyze the results (or possibly MANOVA).

Dear Dr. Landers,

I am hoping you can provide some insight on my proposed design for rater allocation. Currently, I am proposing to use a one-way random model, as all subjects are not scored by all raters. I have 20 subjects to be scored, and have access to 8-9 raters. Therefore, I have 2 options:

(1) 8 raters. Pair raters so each pair scores 5 subjects. This results in a large overlap (e.g., rater pair 1,2 will score subjects 1-5; rater pair 3,4 will score subjects 6-10, etc).

(2) 9 raters. Raters not paired. As such, each subject assigned unique rater pair (e.g., rater pair 1,2 scores only 1 subject).

Since I am using a one-way random design, I would think the 2nd option would be most appropriate since each subject is assigned a random pair of raters. However, since the subject factor is the only source of variation, I am not sure it matters either way.

Any insight would be sincerely appreciated!

Many Thanks!

It doesn’t really matter as far as ICC is concerned which approach you take as long as its assumptions are met, especially in this case that your sample of raters is randomly drawn from a population of raters. However, I would choose the second approach for basically the reason you mention. Because there is a possibility that one of your raters is not as high quality as the others (i.e. that you don’t actually have a completely random sample of raters), I would consider a little safer to randomly distribute them among ratees – that way you wouldn’t have a consistent bias within the rater pair. But if you feel confident that your raters meet the assumption, it doesn’t really matter.

Thank you for your very helpful response. I have another follow-up question. As I previously stated, I am using a one-way random model with 20 subjects. I wish to calculate the reliability of 3 different components of a scoring rubric (ordinal scale):

(1) composite score, range 0-50.

(2) sub components #1, range 0-6.

(3) sub components #2, range 0-2.

I was planning on using a ICC(1,1) model, along with percent agreement to help determine if low ICC values are a result of low variability. For the sub component range of 0-2, I am not sure if an ICC model is appropriate. I know another option is weighted kappa, but given my one-way design, I am thinking this would not work. Fleiss (1973) showed the weighted kappa & ICC were equivalent; however, this was for the 2-way model.

Any insight is appreciated!

It has been a while since I read Fleiss (1973), but I believe that is a correct interpretation – kappa is equivalent to ICC(2) when data are ordinal and coded appropriately (integers: 1, 2, 3…k). However, I don’t see any reason you’d use different reliability estimates for a scale and its subscales.

Thank you for the very helpful page describing the differences between ICCs. I have an interesting study design and I am wondering if you could give some insight on how to get a 95% CI for the ICC?

In this design we have three separate groups of raters (~n=30, n=36, n=45) that were each asked to evaluate (yes/no) three different sets of 9 subjects each. Each set of 9 subjects was chosen to represent a range of disease severity, and the idea was that each set of 9 subjects would be fairly comparable. It is easy enough to calculate an ICC and 95% CI for each set of 9 subjects, but the challenging thing is to combine them to get an overall average ICC estimate and 95% CI for this average.

This would be simple to do if an SE for the ICC was provided and normal theory was used to construct the CI. However, all implementations that I can find for the ICC (in particular, I am using ICC(2,1)) provide a 95% CI but no estimate for the variance of the ICC.

Unfortunately, bootstrapping doesn’t seem reasonable here due to the fact that the 9 subjects were not randomly selected (not to mention the fact that there are only 9/survey). I also considered Fleiss’ kappa for multiple raters rather than ICC, but there is only an SD available for the null hypothesis (of no agreement) rather than an SD for the sample.

Do you have any ideas?

Thank you,

Anja

Since the 3 sets of 9 were not randomly selected, you are explicitly stating that you do not have a randomly drawn sample from a given population. A confidence interval will not be meaningful in such a context (nor will the mean, for that matter). But if you want to assume that your raters are all drawn from the sample population of raters and your subjects are drawn from the same population of subjects, I would probably just throw them all into one dataset and use ICC(1,1). If you have any such assumption violations, however, that won’t be a valid approach – and I don’t know of any alternate approach that would get around that.

Thank you for the very prompt reply! I did find an implementation of ICC that will run for the combined data in spite of the high degree of missingness –since each set of raters rates a different set of 9 subjects and there are no overlaps between the sets of raters. (I had tried this approach originally but the original ICC implementation that I found would not run.)

Yes, you are right I should use ICC(1,1) since the subjects are not randomly sampled.

It is interesting that the ICC(1,1) within each group was higher than the combined data (although the CIs overlapped):

group1: 0.45; 0.26 < ICC < 0.76

group2: 0.33; 0.17 < ICC < 0.66

group3: 0.40; 0.23 < ICC < 0.72

combined ICC(1,1) : 0.18; 0.12 < ICC < 0.30

I agree that the interpretation of a mean and a 95% CI are odd in this case (both within each set and across the three sets). It is agreement among randomly selected raters across a set of cases that represent the disease spectrum (and not the average case evaluated in practice). This design was chosen to help balance the sets and also to be able to observe evaluations at both ends of the disease spectrum. However, in terms of evaluating "overall agreement" this is certainly a study design limitation.

Thank you again for your help!

Actually, I did some probing around and I think the combined ICC is low due to the missing data which violates the ANOVA assumption that the ICC calculation relies on. Probably the best bet is to bootstrap the surgeons.

Dear Dr Landers,

I would be grateful if you could help me with an interpretation query?

I have calculated 2-way random ICC, absolute values on some rating scale data (13 raters, all scoring 16 ratees on 12 independent performance measures ). We are interested in the answers to 2 questions,

1) From a theoretical perspective (for developing the performance measurements), in this study how reliable were the raters at assessing each of the 12 performance measures?;

2) From a practical perspective, how reliable will these measurements be for assessing performance in the future, which will be done by a single rater?

To answer 1) am I right in thinking I need to look at the average measures coefficient and 2) the single measures coefficient?

If so, the coefficients are very different. The average rater coefficients are between .70 and .90 and therefore we had reasonable agreement. However, the single measure coefficients range from 0.18 to 0.51, which on the same scale are very poor. Does his mean our performance measures only likely to show reasonable reliability when used by multiple raters but not a single rater?

Many thanks

Yes, yes, and yes. At least as long there is no sampling problem with your raters (i.e. if one or more of your raters are substantially poorer quality than the others, or if there are non-randomly distributed individual differences among raters).

Great, thank you! We are investigating potential differences between raters next. They are a representative sample of our population of interest, but we suspect that some raters are better than others.

Dear Dr. Landers,

I am conducting a ICC (1,1) study. I am looking at both inter-rater and intra-rater reliablity using 20 subjects and 7 raters. However, this is in regards to a scoring system, so technically there is a “correct” answer. Therefore, even though I may find acceptable ICC values, it doesn’t mean they are valid. Consequently, I wish to compare an expert’s scores to that of my raters. This is to establish face validity, as it is subjective. I am not sure if I can simply use ICC (1,1)? I do not think a pearson correlation would be appropriate. I would be comparing 20 ratings computed from 7 raters to 20 ratings computed from 1 expert. It is also worth noting that the scoring scale is 0, 1, or 2 (ordinal). Technically, a pearson correlation would not be appropriate, only spearman’s.

Any insight is appreciated!

You are correct that ICC does not assess absolute agreement with a population value; only agreement among raters (i.e. do the ratings center around the same value?). There are a couple of approaches. Since you are using ICC, your data must be interval+ level measurement, so if you want to treat your expert’s score as perfect measurement (i.e. population values), I would suggest a simple z-test. No need to make it more complicated, especially if you’re just trying to provide evidence of face validity. Pearson’s/Spearman’s won’t work since you are assuming the expert’s judgment to contain population values (both Pearson’s and Spearman’s assume each variable is randomly drawn as a sample from a population).

Thank you for your helpful response! I have another question. I am also considering assessing the intra-rater reliability of both an expert (n=1) and trained raters (n=7). For the trained raters, I am using ICC(1,1) since no subject is scored by all raters. To use a ICC(2,1) design for trained raters isn’t plausible due to the time commitment required to score 20 subjects (needed for sufficient power). However, for the expert, it is possible to have him score all 20 subjects. Therefore, I can use ICC(2,1) for the expert.

I am wondering though if I should report ICC(1,1) for the expert, rather than ICC(2,1) to make results more comparable to trained raters? I am assuming the expert’s inter-rater reliability is superior to trained raters and I am not sure how to reflect this if I calculate each with a different ICC model.

Thank you for your help!

If you only have one expert rater, you can’t use ICC at all – ICC requires at least 2 raters (i.e. you must have a sample). So I am not sure what you mean by ICC(2,1) in this context. Any desire to calculate reliability for the expert also means that you are not assuming the expert rating to be error-free, which means you can’t use the z-test I recommended above. By definition, if you think unreliability will be a problem, you don’t have a population.

It might be helpful, in terms of research question framing, to think about which populations of raters you are interested in and which samples you actually have of those populations. One case does not a sample make. It sounds like you may have a n=1 sample from one expert population (useless for determining reliability since you need n>1 to have a sample) and one n=7 sample from a non-expert population. You either need to assume your expert is error-free or get a second expert.

I should have been more clear. Your z-test should work since I have an error free expert. This expert is allowed to use measurements to obtain correct scores.

However, for reliability, this same expert cannot use measurements to obtain “exact” scores. Therefore, his scores are not error free. My expert is from a sample from an expert population (n>1). I wish to calculate though the individual intra-rater reliability by having selected expert (n=1) score 20 subjects on 2 separate occasions. However, for the other non-expert raters (n=7), the same 20 subjects are scored, but using ICC (1,1). In this case, can I use ICC for the expert? And if so, which model?

Thank you again!

I don’t understand “I have an error free expert” followed by “his scores are not error free”.

You can’t use ICC if you have only one rater. Remember that all reliability estimates measure true score variance as a proportion of total observed variance. If you have zero observed variance between raters (sample of one), there is no variance to explain (reliability does not apply because you can’t divide by zero).

If you’re interested in test re-test reliability, you have 20 pairs of observations, so you can use a Pearson’s correlation between Time 1 and Time 2 data. But as an estimate of reliability, this does assume zero inter-rater variation.

At this point, you may need a statistical/methodological expert to be part of your project – it sounds like you have an unusual design, and someone on a local team may be best to work through these issues.

Dr Landers,

I wonder if you could clarify a few questions for me.

My study is examining inter-rater reliability of 3 raters who receive training on performing 5 tests compared to 2 raters who had no training.

Each rater performs all 5 tests on the same 20 subjects.

I would like to be able to generalise my findings to a wider population of raters ie will these tests be reliable in clinical practice.

Questions

1. Should I use a two -way random model.

2. Is it possible to have all raters combined to get a higher ICC value than each individual group of raters .

(e.g. Trained raters ICC value 0.93

untrained raters ICC value 0.83

all 5 raters ICC value 0.96)

If so why is this ?

Regards

Fran

1. Yes – I would use two 2-way random models, one for trained and one for untrained. Since you are trying to generalize to practice where a single person will be making that judgment, that would be ICC(2,1).

2. I assume you are asking why this can occur mathematically, and based upon your values, it looks like you are calculating ICC(2,k). The reason is because adding additional raters, as long as they are of similar skill at making ratings (same population), will always increase the ratio of true score variance to total variance (each additional rater will add error variance, but this variance will be uncorrelated with that of other raters; whereas the true score variance WILL be correlated). If you look at ICC(2,1), the values computed on all 5 raters should be somewhere between the ICC(2,1) of each of your other groups.

Dr. Landers,

I am hoping you could provide me with some guidance. I am the primary researcher and will be coding 70 videos using a 15-item scale. I will be the primary coder of all of the videos. I have two co-raters, each coding 35 of the videos. Which ICC should I use to run the reliability analyses considering that one of the raters remains constant while the other varies?

Thank you!

Diana

If raters are not 100% consistent, you must use ICC(1) because there is no stable rater effect to control for.

Thank you very much! My reliability was found to be fairly low (.5 range) and some variables had ICCs in the negative range. Qualitatively, the raters agreed on many of the cases and there was not a lot of variability in the ratings. Is there a way to account for the restriction of range and lack of variability when computing ICC? Thanks again!

I suppose you might be able to do a range restriction correction to get a population estimate of reliability, but that requires a lot of assumptions that are probably risky, and I’m not sure why you would do it in the first place. If ICC is negative, then some of your raters are responding in a reverse pattern from others – that is very severe disagreement relative the amount of variance available. A lack of variability will also play out in other statistics (attenuated correlations, for example), which exactly what a reliability estimate is supposed to tell you. That would suggest to me that you need to re-anchor your scale to increase variability (e.g. add more extreme response options).

Hi Dr. Landers. Discovering this site and reading the comments made my day. Thanks for being generous with your expertise.

Thank you so much! I’ve been scouring the internet trying to understand whether to use average or single measures and you are the only resource that explained this well enough for me to feel confident in my choice!!! Fabulous article!

Hi! Ditto on all those who commend your intellect and generosity! Here’s our scenario!

We would like to develop a new measure to support school admissions decisions. The measure has a rubric by which (once launched) any one rater (e.g., admission staff) would rate any # of new student applicants on 7 dimensions (e.g, academic, social, athletic…). Each dimension has only one rating/score, on a 5 pt likert scale 0-5. The total score would be the sum of the 7 ratings, possible range 0-35. in reality, any one student is rated by only one rater.

in measurement development efforts to date, 8 sample raters (e.g., staff) have used the rubric to each rate 15 different students (total: 120 students). no student has been rated yet by more than one test rater.

interest is in:

1. some indication that different raters can rate students similarly using the rubric once it’s launched.

2. some indication of the reliability of the measure, as used at this time only by the 7 “test” raters. and whether we could/would look at the total score (sum) as well as each of the 7 items.

the question is what’s the best next step w/ these test raters who have limited time to give. e.g.,

1. select a new sample of students (15-20 would be feasible) and ask ALL 8 raters to rate the same 15-20? or, alternatively with (i think) the same effect, ask 7 of the raters to rate the same 15 students the 8th rater has already rated.

2. create 4 pairs of raters among the 8, and ask each pair to rate its own new set of 15 new students so as to increase the total # of students rated (4 prs x 15 students = 60 students)?

or …! any other recommendation for some simple yet sound method for assessing this measure? Thanks VERY much for any suggestions!

If you can safely assume that each member of your staff is drawn from the same population of raters (i.e. if each is equally skilled at using this measure you’re developing), I would probably ask each of the raters to rate one student in your initial dataset that they haven’t rated yet (total of 240 ratings for 120 students), i.e. 15 more ratings for each rater. If those ratings are biased (e.g. if your staff are already aware of admissions decisions for those other students), then the second approach would be my second choice (60 students). However, I probably wouldn’t pair the raters – I would instead have a random pair rate each (i.e. still have each person rate 15 people, but randomly distribute the pairs). You definitely don’t want to take Approach 1 – the variability in 15 ratees just won’t be sufficient to get a stable estimate of reliability (your confidence interval would be quite large).

Thanks so much for your reply. Just to clarify your 1st suggestion: we’d ask each rater to rate 15 ratees (rather than “one”) that the rater has not rated before. If so, and if i’m thinking clearly (!), this suggests each ratee would end up with 2 ratings, 1 from 2 different raters. and, if so, is there preference if we 1) simply exchange full lists (of 15) among the raters vs. 2) take ~2 ratees from each of the 7 rater lists to distribute a different set of ~15 ratees to each of the “8th” raters? Thanks again!

Yes, that’s right. If you are absolutely comfortable assuming that all of your raters are equally skilled, you can just switch them in pairs since it is probably logistically easier. But I would usually recommend randomly distributing raters among ratees (so that each is rated by a different, random set of 2 raters). If you believe there to be some consistent trait across your lists of 15 that might be rated differently within group (e.g. if Group A is already more likely to be rated highly than Group B), then you might want to counterbalance groups across ratees such that raters always get a mix of groups. But that is really just insurance against unequal skill between raters or interactive effects between rater skill and target true score (i.e. if some raters are more skilled at rating people at the high end of the scale, and others are more skilled at the low end – not a very common situation anyway).

Sir, This is my problem.

There are ten sets of data obtained from ten mothers.

Seven specific questions( 7 variables) were selected from the questionnaire. Ten mothers were selected. Each mother was interviewed by all four data collectors where the same seven selected questions were asked by all 4 data collectors from each mother. The responses for the questions were given scores.

The columns are the scores given for each variable. The rows are the scores given by the four data collectors( there are 4 rows in each data set obtained from each mother).

If the response was’ Rarely’ the score=1, if the response was ‘Sometimes’ the score = 2, if the response was ‘usually’ the score=3, if the response was ‘always’ the score =4.

I want to see the level of agreement between the 4 data collectors in giving the scores for the 7 variables by checking the pearson r by using the data obtained from the 10 selected mothers.

You can’t use Pearson’s because you have four raters – Pearson’s only allows comparisons of 2. It also assumes consistency, but it sounds like you would want to know about agreement. As long as you’re comfortable considering the 1-4 scale to be interval level measurement (not a safe assumption in all fields), and if you’re using this data in other analyses, you’d want ICC(2,4). If you’re trying to generalize this measure to future uses by a single rater, you’d want ICC(2,1). If you can’t make the measurement assumption, you’d want Fleiss’ kappa (which is for categorical ratings across more than 2 raters).

For ICC, you will need to restructure your data. You’d want each rater/variable pair in columns (4 columns per variable) and independent cases in rows (10). Then you would calculate 7 ICCs, 4 variables/1 construct at a time.

Dear Dr. Landers,

thanks a lot for this helpful explanation! I was trying to follow your instructions, however SPSS stopped the operation telling me that I have not enough cases (N=0) for the analysis.

I did as you said: First, create a dataset with columns representing raters (e.g. if you had 8 raters, you’d have 8 columns) and rows representing cases. The difficulty in my case is that I do NOT have consistent raters across all ratees: I have 310 ideas being rated by 27 raters. Each rater rated only a subset of the 310 ideas, resulting in 3 ratings per idea (for some ideas only 2 ratings). Hence, most entries of my 310 x 27 dataset are empty. Is this the problem?

The ratings were based on 8 criteria with a 7-point likert scale each, and I have already used the mean values (e.g. 5.5) for each idea rating. I have run the ICC(1) analysis (one-way random).

I would appreciate any helpful comment. Thanks!

I just constructed a small dummy dataset, and figured out that SPSS was excluding all rows where NOT all raters did a rating. In your explanation you provide an example:

For example, if you had 2000 ratings to make, you might assign your 10 research assistants to make 400 ratings each – each research assistant makes ratings on 2 ratees (you always have 2 ratings per case), but you counterbalance them so that a random two raters make ratings on each subject.

This is similar to my case (except that I have 2 or 3 ratings per idea, this is not consistent). Now I am wondering how I should construct my dataset in SPSS

The problem is that you will have different reliability for each case – cases with 3 raters will be more reliable than those with 2 raters. You could come up with a way to assess “mean” ICC, but this is not a feature of SPSS. Easiest approach would be to only use two raters for each case, randomly selected from your three raters and report this as an underestimate of actual reliability in your dataset.

Thanks for your comment Dr. Landers! What I have done now is to use 260 ideas where 3 ratings are avilable and compute ICC(1,3) based on a 260 x 3 dataset. The selection of the 260 of 310 ideas can be deemed “random” because 1 rater simply did not make his ratings due to illness. The correlation coefficient (average) is 0.193 – which confirmed my supposition that the ideas have been rated quite inhomogeneously by the jury.

What’s been interesting is that the same ideas have been rated by idea contest participants, where I have 36 ratings per idea. Here I run ICC(1,36) analysis and the correlation is 0.771 – i.e. the ratings by the participants seem to be much more valuable (can I say this?). This was quite astonishing… although it partially could be explained by the larger number of raters per idea, I assume…

Although you might be able to consider it random, you still have fewer raters for some cases. That .193 is an overestimate of the reliability of your mean score. Also note that this assumes you are taking an average of judgments from each person – if there are any interpersonal processes that kick in later (e.g. If they discuss and come to group consensus), reliability will be even lower.

As for the other, that is not surprising – you have the equivalent of a 36-item scale. Almost any scale of that length would show at least reasonable reliability, even if it were multidimensional (which is likely is). In fact, the fact that it is only in the .7s means that your 36 are not really very consistent with one another. If you want to compare your two rarer samples directly, you need to compare the ICC(2,1), not the ICC(2,k). For more detail on scale length and reliability (applies equally well to ICC), see Cortina, 1993.

Hi Dr. Landers,

I have tried to make sense of your last comments on my results. In order to test the effect of “number of raters” on the ICC value, I have run a simulation where I compare 100-times ICC of random values (normally distributed ratings with significant sigma for 3 cases of 4 raters, 8 raters and 32 raters)… the result is stunning: While I only get around 30% (30/100) ICC values for the 4-rater-case above 0.7, for the 32-rater case (same random distribution of rating values) 100% of ICC values exceed 0.7

Now I understand why you said “Almost any scale of that length would show at least reasonable reliability”… But here’s the problem: What can I do to show (or argue) that the reliability of my 36 raters with ICC of 0.771 is actually quite bad?

If you’re interested, here’s my simulation: http://user-ideas.com/ICCtest.xlsm

Thanks again for helping

When in doubt, it is always good to check for yourself. It is one of the things I talk about in my grad research methods course – if you ever see a scale with a moderate reliability but a very large number of items, it is either not a well-designed scale or the underlying construct is multidimensional.

In your case, I would take one of two approaches: 1) cite Cortina, 1993 who talks about this very issue (if I’m remembering this correctly, the table that I think would be particularly interesting is one where he simulates the joint effects of dimensionality vs. scale length on coefficient alpha) or 2) compare ICC(1,1)s. It is not meaningful to compare an ICC(1,3) with an ICC(1,36) anyway for the reason you just simulated.

Now I have a final question

I would appreciate a lot if you had an answer: I have seen that with my 310 ideas and 3 raters per idea (average) I get a pretty “bad” ICC(1,r) value. Now what I would be interested in is the question, if the raters at least were able to (reliably) classify the ideas into groups (e.g. top-5%, top-10%, top-15% etc.), and especially if they rather agree on the “best 25 ideas”. I am a bit confused if this question is actually an inter-rater-reliability question, but I guess yes.

The problem is that I do NOT have rater consistency, i.e. raters only rating a subset of ideas (thats why I use ICC(1,r)). Hence, I cannot really simply calculate the sum or average of my 3 raters per idea, because this sum/average depends on which raters I select as first, second and third.

My feeling is that I can only calculate the overall sum (or average) of 3 ratings per idea. Would I then use ICC(1,1) for checking rating reliability? Thanks again for your help!

It sounds like you want to polytomize ratings and check their reliability instead of the actual data. If you do that, you are artificially attenuating the variance in the original ratings (i.e. rather unscientifically hiding variance you don’t like). That is not a valid approach.

The lack of reliability means that your data collection did not occur reliably. That is an end-result, not a problem to solve with data analysis. The scales might have been poorly specified, the raters might not have been trained appropriately, or any of a large variety of other internal validity issues or combinations of issues. If you have poor reliability, you have poor reliability. The only solution is a new round of data collection with those problems fixed.

Hi Dr. Landers,

Thank you for your very helpful post! I am a bit confused about whether I should be using a one-way or two-way random ICC. I am working on a study in which therapists’ sessions are rated to determine if they are adhering to a specific therapeutic orientation. We conduct periodic consensus meetings in which many raters rate the same sessions to see if their ratings are reliable. The specific raters tend to vary from session to session (as does the number of raters rating each session) but they are all from the same pool of raters given that all the raters work in the same lab. Since the specific raters differ from ratee to rate, I am inclined to go with a One-Way Random ICC. Does this sound correct?

Thank you very much for your help with this!

Yes, that’s right.

Dear Dr. Landers,

I’d like to compute ICC for my study in which I have a rating scale that assess a construct comprising of several dimensions (multidimensional scale); each of the 10 subjects were rated by 2 raters but I have 3 raters working in combination. My questions:

1. Should I calculate/use the mean of the scale (all items) or mean of each dimension for each rater? I gather I shouldn’t calculate ICC on individual items unless I’m in a scale development stage, right?

2. The type of ICC would be ICC 1 (one-way random), and the general standard would be at least 0.7?

Thanks so much in advance.

I would calculate means by dimension by rater by ratee, then examine ICC for each dimension across raters. If an overall mean would be meaningful (ha!) for your multidimensional scale, I’d also calculate the overall mean for each rater by ratee. These estimates give you somewhat different information, so it depends on what you want to use those scores for later.

You should use ICC(1) since your rater identities vary by ratee. 0.7 is a reasonable standard as far as standards go, and that is all most journals will expect. Imperfect reliability only serves to make your observed scores less accurate representations of true scores (and also makes it more difficult to achieve statistical significance). Bigger is better – aim for .9 in your scale development process, if have the choice.

Thanks so much Dr. Landers. I really appreciate it. I wish u wrote a stats book for psychology, not just for business :))

Hi Dr Landers,

Thank you for creating and maintaining such a helpful resource!

I’m considering using ICC for IRR in my current research.

I have ratio data and will have either 2 or 3 coders in total.

I have currently coded all the data (about 200 participants), but intend for subsequent coders only to code data for 50 or fewer participants, given that it’s a lengthy process.

3 questions:

1. Is ICC appropriate in this instance?

2. Would I be using ICC(1) because the additional coders will not be coding all 200 participants?

3. Is there a recognised proportion of the data that subsequent coders have to code, such that the IRR derived speaks to the reliability of the scoring system in general?

Thanks in advance,

Daniel

1. Yes.

2. It could be either. If you’re going to add another coder to yourself, i.e. a total of 2, and you both have assessed 50, then you can calculate ICC(2) on those 50 cases. But to get a more accurate measure, I’d recommend trying to get every case coded by at least 2 people – so, for example, if you’ve coded 200, to ask each of your other two raters to code 100 (and then calculated ICC(1)). Remember that you can only calculate ICC when you have 2 or more ratings – if you only have 1 rating, that score will not be used in reliability calculations.

3. This is a somewhat dangerous practice. When you calculate an estimate of reliability for your sample, you are trying to capture the unreliability inherent to your particular measurement situation. If you only code a subset, you are calculating an estimate of an estimate of reliability. So I’d recommend avoiding that. I have seen published research that takes this approach (e.g. in meta-analysis) but its appropriateness is going to vary widely by context. I don’t think you could get away with less than 50%, but the closer to 100% you can get, the better.

Hi Dr. Landers,

I have done repeated measurements to determine the repeatability of a device. I want to see how repeatable the measurements are between days instead of between raters. Is it possible to use the intraclass correlation in this case? There are about 10 people that are being measured and the measurements between subjects are not expected to be similar, if that makes a difference.

I suppose you could – the better approach would be to use hierarchical linear modeling to explicitly model the over-time effects and report the effect sizes estimates (which includes ICC, I believe) from those analyses. Your sample size is quite small for this, either way.

Hi Dr. Landers,

Thank you so much for this article! It’s helped deepen my understanding of the ICC statistic, even after several attempts at reading the Shrout & Fleiss article, haha.

I have a question about the ICC that I still have yet to answer – can you use the ICC statistic when you only have ONE rater? For my dissertation, I used a new therapy (ACT) to treat a musician with performance anxiety. I was the only therapist in this study, and I had 2 raters independently rate my adherence to the ACT manual, using a scale called the DUACRS which measures ACT adherence. I’ve noticed that most examples of when to use the ICC involve multiple raters and multiple ratees. However, I’m wondering if I should use it in my study to reflect the inter-rater-reliability of the 2 raters for my adherence (I’m the only ratee)?

One solution I have is that the DUACRS has 3 sections to it (one for ACT adherence, one for CBT therapy, and one for Behavior therapy), and instead of having multiple ratees as a variable, I can have the multiple therapy styles be rated? For example, rather than entering multiple ratees into the rows of SPSS, I can enter “therapy type” ? Obviously the columns in SPSS will still be for raters (rater 1, rater 2).

So, visually this would look like this:

Rater 1 Rater 2

Therapy Type

ACT % adherent % adherent

CBT % adherent % adherent

Behavioral % adherent % adherent

What do you think? I greatly appreciate any feedback you can give me, as my dissertation would benefit from your expertise!

Cheers,

Dave

Dr. Landers,

I apologize the formatting of my previous email was bad. Basically, the Y axis variable in my study should be THERAPY TYPE (ACT, CBT, Behavioral) and the X axis variable should be RATER (Rater 1, Rater 2). And there will be 6 pieces of data, a percentage of adherence for each of the 6 conditions.

I hope that clears it up for you! Again, I greatly appreciate your help and have recommended this article to my dissertation chair!

Dave

It really depends on what you are trying to do with these estimates. I think the tendency is for people to think “I need to calculate some sort of reliability estimate for this paper” without remembering that we report reliability for a reason – it attenuates the very relationships and differences we are trying to investigate. By reporting it, we are telling the reader how much smaller the relationships we found are than they should have been if we’d had perfect measurement – and that is one of the reasons we can’t draw many substantive conclusions about a null result.

In your case, I am not sure why you are calculating this variable and why you want to know its reliability. If you want to know the reliability of a single rater for some reason, the only way to estimate that is to have 2 raters and then correct downward. Remember inter-rater implies “between raters.” If you only have one rater, there’s no way to know how that person’s scores match up to those of other people.

The data arrangement you are describing seems to imply that cases entered into analysis are no longer independently drawn from a population of interest – that is violation of the measurement model. So I wouldn’t suggest it.

Dr. Landers

Thanks for you reply! I apologize for not making my case clearer- I actually have 2 raters, and 1 ratee (myself). The question was can I still use the ICC with only one ratee ? All the examples I’ve seen of the ICC online seem to involve multiple ratees and multiple raters (as you pointed out in your reply). However, when there’s just 1 ratee and two raters I’m not sure how to conceptualize it or how to set up the rows and columns in SPSS for data entry.

The only way I thought to set up in SPSS was – having the columns be for raters (rater 1, rater 2) and the rows be for “therapy type” (therapy A, therapy B, therapy C). Normally the rows are for all ratees, but since there’s only one of me that would yield only two pieces of data (one rating from rater 1, one from rater 2). That’s not enough data for me. I want to know how good their level of agreement was for my performance on therapy A, B, & C. Is this still an ICC situation?

Ah, I see. Your idea of SPSS is not right – since there’s only one of you, you’d only have one row of data, not two – with one column for each rater/variable combination.

Agreement and reliability, like correlation, are conceptualized as “proportion of variance explained”. Since you only have one case, there is no variance to explain – so there is no way to determine reliability in the traditional sense.

You could set SPSS up with three cases, one for each type of therapy, but you change the referent dramatically – you are examining a population (in a statistical sense) of yourself – i.e. you are looking at how consistently people rate you as an individual. You would not expect that number to generalize to any other person. So I am not sure what value that number would really give you. If you’re interested in how consistently people are rating you, I would just look at mean differences on each variable across your two raters to see how much they agree in an absolute sense (e.g. “rater 1 was consistently 2 points higher” or “ratings seem random”). You don’t have enough data to do much else.

Thanks for the quick reply again! I am discussing this today with my dissertation chair, and we will take your advice into serious consideration! Your input helps deepen my understanding of the ICC, as I didn’t think it’d be possible to use with only one ratee, me.

I will have to do a simple correlation of their ratings to see if there’s any trend, as you suggest.

Thanks alot for the fast help,

Dave

Hi Dr. Landers,

I hope you’re doing well. Thank you for your previous guidance with the ICC situation for my dissertation last year, it was very helpful. You may remember, I conducted an N=1 study where I administered therapy on a participant and was then rated by 2 raters on how well I adhered to the therapy manual. You’d told me I couldn’t use the ICC to describe the IRR between the 2 raters in that scenario because there was only 1 ratee, me. My dissertation chair disagreed, but that’s another story…

I have now completed a follow-up study which repeated the same N=1 design. I used the same adherence rating system, where I had 2 raters rate my adherence to the therapy manual again. I’m wondering how I can describe the IRR between the 2 raters in this study ? If I can’t use the ICC value because there’s only 1 ratee and 2 raters, then what test, if any, can I use to describe the IRR between the 2 raters?

Each rater rated the same 3/10 therapy sessions, chosen at random. Their ratings are here, in case it helps:

Rater 1 Rater 2

How adherent I was in Session 4 0.1875 adherent 0.22159

How adherent I was in Session 5 0.17045 0.21591

How adherent I was in Session 7 0.10227 0.15909

You can see Rater 1’s ratings are consistently 0.04 -0.05 units lower than Rater 2’s. Is that the only way I can describe their ratings, or is there another test I can use to formally describe their ratings (i.e., simple correlation) ? The only ratings data I have is what you see here.

Thank you so much,

Dave Juncos

Hi Dr. Landers,

thanks for you great post on computing ICC.

I have one questions concerning missings. I actually want to aggregate individual-level responses to the org level and want to compute ICCs. I have 3-10 raters for every organization, in most cases three raters rated the organization. Each rater rates a 5-item construct. How do I compute the ICCs if I want to consider all cases and raters?

Thanks in advance!

Unfortunately, the procedure to do this is much more complicated than what is available in SPSS. You will need to use another analytic technique; I would use Hierarchical Linear Modeling (http://www.ssicentral.com/hlm/).

Dear Dr. Landers,

Thank you for the wonderful post. Could you please guide on the following:

In my study on Leadership, a questionnaire having 20 items was used and 5 items together make a leadership style scale making total 5 scales.

Leaders from different organizations were selected. Each leader was rated by his/her 2 subordinates on the 20 items. Could you please guide me how to compute Icc1 and Icc2 for these 5 scales in this case. Thank you.

It depends on what you want to do with the scale means. You probably mean ICC(1) and ICC(2) in the typical leadership/group research sense, which often refers to ICC(1,1) and ICC(1,2) in the Shrout/Fleiss sense. So I would calculate the 4 scale means and then calculate ICC for each of them.

Thank you so much Dr. Landers!

Could you please help me a bit more:

5 items make a scale or a leadership construct. Each item has 2 values by 2 subordinates.

Should I compute mean for each item for each leader averaging ratings by 2 subordinates?

OR Should I compute mean for the scale (all 5 items) for a leader using 2 ratings on all 5 items?

Two values, as in Yes/No? If yes, you need to code as 0 and 1. If no, you can’t use ICC.

Again, it depends on what you want to do with it later. If you’re only going to use the scale means in subsequent analyses, calculate the mean across the 5 items for each rater and calculate ICC(1,2) on the scale means across raters (you will end up with 4 ICCs for 4 scales).

Oh I am sorry here 2 values means 2 ratings by 2 subordinates. This is a Likert-type scale 1-5. Then ICC (1,2) will be ICC2. How to compute ICC1 or ICC(1,1)?

Thanks a lot!

Then yes, calculate scale means and use the instructions I wrote above for either ICC(1,1) or ICC(1,2).

Thank you so much for your great help!

Dear Dr. Landers, Thank you for your guidance! What are appropriate values of ICC1 and ICC2 which allows us to aggregate the data?

In my study Each leader was rated by his/her 2 subordinates and I have got some values for the 4 different leadership scales like this (N=150 leaders) from 9 different organizations:

F ratio p-value ICC(1) ICC(2)

2.03 0.000 0.33 0.51

1.76 0.000 0.26 0.42

1.68 0.001 0.25 0.41

2.10 0.000 0.34 0.51

There is not really a hard cut off. Assuming by ICC(2) you mean ICC(1,2), this indicates that only half (or less) of the variance shared between the subordinates could come from measurement of the same construct. The attenuation factor is .7 with that ICC – so any effect size related to whatever you’re trying to predict would be reduced by 30%, which also dramatically reduces statistical power.

I would not trust numbers that low – I would only be comfortable with ICC(1,2) above .80.

Hi Dr. Landers,

I am writing again to you (the last time I wrote was Aug 27, we’d discussed using the ICC for only one ratee, and 2 raters, and you’d told me that wouldn’t work) for help understanding an ICC scenario. I was told by my dissertation adviser that I could use the ICC to calculate inter-rater-reliability with only one ratee, since we’re not interested in generalizing the findings of our raters to the population. I went ahead and calculated the ICC’s but one was negative, and I don’t know how to interpret negative ICC value.

To refresh your memory, I administered a therapy for one client, and I had 2 raters independently rate my adherence to that therapy’s manual. The scale they used to rate me has subscales for 3 therapy styles (the therapy I used, plus 2 I didn’t use). As I understand, it’s the raters’ job to rate my adherence to the correct therapy, while also rating my in-adherence to the 2 incorrect therapies. The ICC I’m using would be ICC (3, 2).

After instructing SPSS to do a two-way mixed ICC, looking at consistency, My results were the following:

ICC (Therapy 1, aka the correct therapy) = 0.396

ICC (Therapy 2) = -1.167

ICC (Therapy 3) = 0.591

I don’t know how to interpret the negative ICC for Therapy #2. Do you? Is a negative ICC a reflection of poor inter-rater-reliability? Because the raters’ agreement for Therapy #2 was actually quite high, so I’m confused.

Thanks!

Dave

My understanding is that negative ICCs can be calculated in SPSS because a bias correction is used (although it is normally interpretable as a proportion like all reliability estimates, that proportion has been adjusted to account for small-N bias). Negative ICCs are not interpretable and usually result from peculiar sample characteristics (e.g. near-perfect agreement or near-zero agreement) or possibly violations of the underlying assumptions of ICC. The fact that it is negative would imply that you have an exceptionally small sample size, which would make the size of the bias correction quite large. But that is a bit of a guess – I’ve never looked that deeply into it. You’d need to dig into the SPSS technical manuals to be sure.

Thanks for the reply. If SPSS is constructed in that way, then maybe calculating the ICC’s by hand will give a different result? I can try that, using S & F’s formulas. Hopefully that will change the negative ICC’s to positive values.

Thank you for all of your help thus far. I am currently working on my dissertation examining the relationship between stress and parenting behaviors as rated during videotaped interaction. Two raters rated each variable and inter-rater reliability was assessed using ICC. Would you recommend using my single rating or the average of the two ratings for subsequent analyses?

Thank you!

Diana

If you’re trying to determine the relationship between your scale means and the means of other variables/outcomes, the average should contain twice as much information as either rating would alone.

Dear Dr. Landers,

Thank you for the foolproof ICC manual. I have a question relating to the rationale to use ICC in my study after reading your explanation of the mixed model. I had 11 raters (purposive sampling from an unknown population) to rate the importance of 91 indicators. These indicators were identified and selected to develop a set of indicators. Is it appropriate to use ICC (average measure) to evaluate the reliability of the instrument because the indicators were fixed, not random? That means the effect of the rate is fixed and the raters are sample but not random.

If it’s not justified to use ICC, then do you have any suggestions on which method I can use?

Thanks again and kind regards,

Q. Do

It depends on what you mean by purposive. If you’ve chosen raters that you believe to be representative of a broader population of raters (e.g. experts in the subject area), you can probably treat them as a random sample (though one drawn for their convenience). If your raters have different expertise, you might want to identify homogeneous subsets of raters within that group to examine in sets. I find it doubtful that you have identified 11 raters with completely distinct types of expertise to judge your scale, so you might want to think about what population(s) might be represented. Whether you end up with an overall ICC or subsets, the average is what you want, since you will be interpreting the average score provided by your raters to make judgments about the scale.

Thank you heaps, Dr. Landers.

As the 11 raters are from one area of expertise, and actually I think they are representative or I tried to achieve it at least. So your suggestion for the model in my study is ICC2, isn’t it?

I ran ICC2 and ICC3 in SPSS and the results are exactly the same: 2.01 for the single and 7.34 for the average (p=.000). And I’ll use the mean ICC 7.34.

However, one of my colleague who actually referred to your writing, says I can’t use ICC in this case because the indicators, as the ratees, are fixed, not random. What do you think about this?

Many thanks.

Q. Do

The random/fixed effects model distinction does not apply to the criterion, only to the predictors. You only need to worry about the assumptions of the general linear model (i.e. ANOVA, regression, etc) in the DV: it must be 1) normally distributed, 2) interval or ratio level measurement, 3) consist of independently drawn samples, and 4) homoscedastic with respect to raters and ratees. Since you technically have a population of outcomes (every possible indicator you are interested in), you don’t need to worry about the assumption of independence. You could also check the homoscedasticity assumption with a 2-way ANOVA and Levene’s test in SPSS, if you really wanted to, but the general linear model is pretty robust to violations of the homoscedasticity assumption anyway, so it is not something I’d worry about much.

Dr. Landers

Thank you for your dedication to keep the dialogue moving forward on this topic. Your help is very much appreciated!

I am conducting a quasi experiment (intact student teams). My unit of analysis is the student team (4 members each); however, I collected data at the individual (member) level. Each member completed a likert-type survey of 6 items to measure team viability (would they want to work together in the future as a unit). I averaged the 4 members score to create a new score (team viability score).

A member of my committee asked me to use ICC(1) to justify aggregating the individual member data into a team level variable by looking for statistically significant results.

My issue is when I calculated the ICC(1) of each team some of the ICC(1) were negative and not statistically significant. I spoke with another faculty member about the results. He said the reason could be the result of a small n (4). However, only 6 of the 24 teams had negative numbers. Thirteen of the 24 produced non statistically significant results.

I have been unable to find scholarly articles to help me understand how to interpret non statistically significant negative ICC(1) for aggregation justification. Would you say the issue with the negative ICC(1) is the same as you mentioned in the post to Dave Juncos (9/17/13)? I ask b/c I am not looking for reliability, but justification to aggregate.

Again, many thanks for your guidance.

I’d recommend taking a look at LeBreton & Senter (2008). I would not take a statistical significance approach because it is not what you want to know, i.e. if we were to assume there were no agreement in the population, how likely is it that this team would have agreed as much as they did or more? That is not a useful question. Instead, you want to know how much “real” information in contained within each person’s judgment and the overall judgment. That means effect size interpretation. LeBreton and Senter argue that you want ICC(1) – which is ICC(2,1) in the Shrout & Fleiss framework – to be above .05 and ICC(2) – which is ICC(2,k) – above .7.

For negative values, I’d take a look at Question 7 in that article. They deal with interpreting negative rwg, but the issues are similar. In brief, the likely culprits are low variance or violations of distribution assumptions (like normality). The aggregation literature suggests looking at multiple aggregation statistics for that reason – sometimes you will have high agreement but low reliability, and looking at ICC alone doesn’t communicate that well (see Question 18 too).

Thank you for the information and prompt response. I have the LeBreton & Senter (2008) article. I have reviewed it several times, and once more again today. I am now questioning if I am understanding the components of ICC calculations correctly.

From LeBreton & Senter (2008, p. 11) ” ICC is estimated when one is interested in understanding the IRR + IRA among multiple targets (e.g., organizations) rated by a different set of judges (e.g., different employees in each organization) on an interval measurement scale (e.g., Likert-type scale).”

Am I correct to run an ICC(1) on each team? Thereby comparing the ratings of each member of the team. The targets are the 6 items in the survey, and the judges are the individual members per team.

OR

Should I run the ICC(1) at the class (4 different classes) or total study (24 teams) level comparing all members of all teams ratings (on the 6 items) against one another?

Background on my study: I am conducting a quasi experiment (intact student teams). My unit of analysis is the student team (4 members each); however, I collected data at the individual (member) level. Each member completed a likert-type survey of 6 items to measure team viability (would they want to work together in the future as a unit). I averaged the 4 members score to create a new score (team viability score).

Very appreciative

I’m afraid you’re on the edge of my knowledge area here. When I do multilevel analyses, I always model multilevel effects explicitly – for example, by using hierarchical linear modeling. That enables you to ask group-related questions explicitly (e.g. do individual-level predictors or group-level predictors better explain the outcome?). I’ve never used ICC to collapse to a single level of analysis myself, so I am not sure about the answer to your question. But based on my understanding of ICC and the LeBreton article, my impression is that you would need to conduct ICC on each team, since you are asking how well individual perceptions represent the group average (the first approach you mention). You wouldn’t do this at the item level though – you’d want to compare the scale averages (i.e. how much of the observed mean score for each team member represent the aggregated team mean?), assuming your items are all on the same scale.

I would probably recommend a multi-level approach though. If you only have 24 teams and treated them quasi-experimentally aggregated to the team level, you have only n=24, which will be quite poor statistical power for between-team comparison (even worse if your quasi-experimental manipulation involves more than 2 conditions).

Much thankful.

Dear Dr. Landres,

despite using ICC many times in my studies i have hard time understanding in which case i will have to use single or average measures when my measured variable is an average of a N of trials. what do you suggest in this case?

Thank you,

Dina

I am not sure who is doing the ratings or on what in the case you are describing. However, if you are using averages of trials, you are probably interested in construct-level conclusions, which means you are probably using average measures for your analyses, which means you would use average measures for your ICC determination as well. However, I will note that if you averaging ratings across trials, you are missing inter-trial variance in your reliability determination, which may bias your estimate.

Dear Dr. Landres,

to make the case more clear, so the study is about evaluating the inter-reliability between 2 raters, and the measured variable taken (e.g. tendon thickness) was from average of 3 trials, each single trial was an average of 3 measures.

So based on your above comment i still should take average measures..?, However, in which way missing the inter -trial variance can bias my results?

Thank you,

Dina

Well, more technically, it just changes the referent. You’re determining the reliability of your raters when taking the average of 3 trials. You would not be able to generalize that number to what would happen with a single trial. If you want to know the “real” ICC, you might be able to use a 3-level hierarchical linear model (trials at level 1, target at level 2, rater at level 3), but I’m not sure – not a problem I’ve faced before.

If the study’s purpose is investigating inter-rater reliability in a particular context (or family of contexts), you will probably need something more comprehensive than a single ICC regardless.

Hello Dr. Landers:

I am trying to compute inter-rater reliability for Modified Ashworth Scale for rating spasticity on an ordinal scale with five (1, 1+, 2, 3, 4). I have two patients and twenty seven raters. What is the best statistic for this and is it available via GUI on SPSS.

Thank you

If your scale doesn’t have interval measurement, you can’t use ICC. You would probably need Fleiss’ kappa (3+ raters of ordinal/nominal data). I believe you can calculate Cohen’s kappa in SPSS (2 raters of ordinal/nominal data), but I think you’d need to calculate Fleiss’ kappa by hand/Excel or in R.

Thank you very much for such prompt response.

Shahzad

Dr. Landers,

This is an extremely helpful site. I am confused on one point I do not see addressed here. I have read that along with the ICC being >.70, you also need the F value for the ANOVA to be non significant. This non significant value indicates that there is no significant difference between raters, which is what you desire in reliability testing.

In my results, I have an acceptable ICC, but my F values for the ANOVA are significant. How should I interpert this?

Thanks,

Amanda

I believe your confusion comes from two implicit assumptions you are making evident in your questions.

Assumption 1) You are assuming that you either have “sufficient” reliability or not. This is a meaningless distinction. 0.7 is not a magical line that you cross and suddenly you have “enough” reliability. You always want “greater reliability.” The closer that number is to 1, the greater the extent to which the shared variance in ratings loads onto a shared factor (or factors). The further it is from 1, the smaller your correlations, standardized differences, etc. will be using that information.

Assumption 2) You are assuming that a finding of statistical significance means “the raters are different” in some meaningful way. The raters are obviously different, because they are different people. You have no reason to expect them to make 100% identical ratings, so finding statistical significance doesn’t tell you anything you didn’t already know. A finding of statistical significance in this context simply means that the differences between your raters are large enough to be detected by your sample size.

I would instead focus on interpreting the proportion you calculated, and deciding for yourself if you are comfortable with the degree to which imperfect reliability will make your measured relationships weaker when drawing conclusions.

Thank you!! I now understand better the significance and why I should not focus on that.

I have already done the work of combing through the literature and looking at future studies to help me define the ICC I will accept, so now I can move on in peace without fretting about the significance of the F value.

Hi there,

Thanks for your comments. It is really helpful. I have question:

I have 5 raters who rates 25 questions 1 or 0. I thought I should use the Fleiss’ Kappa for my case, as the data are binary and I have multi-raters. However, the Fleiss’ Kappa for my data becomes negative! I don’t know why? I tested many cases but this method seems doesn’t work for such data (this is a sample):

case1 1 1 1 0

case2 1 1 1 1

case3 1 1 1 1

case4 1 1 1 1

I think the Fleiss Kappa for this case would be more than 0.9, while it is negative. Am I using wrong method for finding the agreement among rates?

Could you please help?

Thanks!

That’s the right approach, but if that sample is really a good representation of your data, you may not have enough variance in ratings to get an accurate kappa (i.e. if most cases have 100% agreement). I would probably just report percentage of cases with 100% agreement.

Hello Dr. Landers,

Thank you very much for your post about the ICC! It is very helpful!

I hope you can help clarify one thing for me regarding the use of the “single measures” versus “average measure” ICC. I have 5 rating scales and 3 raters (population of raters) to rate 10 patients (3 times/visits per patient), thus resulting in 90 ratings by the 3 raters for each of the 5 scales. I am interested in the “absolute agreement” between our raters for these first 10 patients. I believe this would be a “Two-Way Mixed Model” with “Absolute Agreement”. Is this correct?

If we achieve “good” inter-rater reliability with the first 10 cases, our goal is to for the same 3 raters to split up and each rate a portion of the remaining cases (sample of raters). In order to justify dividing the effort among the same 3 raters for the remaining cases, should I use “single measure” ICC rather than “average measure” ICC? In future ratings, we’ll be using ratings made by all 3 raters but they will each be rating different patients.

Many thanks in advance,

Isabel

If you’re diagnosing patients, you probably have a sample of raters rather than a population of raters – unless you’re saying that the three people you have access to are the only people who will ever conduct such diagnoses, forever. If that’s not true, you want Two-Way Random with Absolute Agreement.

And yes – to justify dividing the effort between three raters, you’ll need to look at the Single Measure estimate.

Thank you Dr. Landers for your prompt reply. Yes, we’re indeed trying to diagnose patient of a neurodegenerative condition. For the purpose of this study, only these 3 raters will be rating all the patients. Thank you for clarifying the use of “single measure” estimate. Much appreciated!

This is such a useful site -thank you so much! I just want verification if I am doing this correct. I am doing observational coding on 200 interactions. I have one person rating all 200 (primary rater) and I have a second person rating 25% (or 50 cases). I want to determine the reliability of the primary and secondary raters on the 50 cases and generalize then to all 200 cases coded by the primary rater. So I know I have to use single measure since I only have 1 rater for the 200 cases. My question is do I use “one-way random” or “two-way random”. The one-way random is more conservative and so I’ve been advised to use it, but is it appropriate since I don’t have randomly selected raters?

If you don’t have at least theoretically randomly selected raters, it is not meaningful to have them do the ratings at all – otherwise, why would you ever conclude that this person’s ratings are meaningful, or could be replicated by anyone else? Assuming you really do have a sample of raters, you want two-way random since your raters are consistent for all 50 cases. However, you will be generalizing from your pair of raters down to your single rater for the other 75% of cases, so you must trust that other assumptions hold (e.g. that both the full sample and subsample are randomly drawn from the same population).

Just what I needed! Thanks a lot.

Dr. Landers,

Than you for this helpful website. Currently I’m working on my thesis, to understand the writing ability of English language learners I should apply inter rater scale of measurement .I have two rater who must correct the writing papers of students according to four scale of CONTENT, ORGANIZATION, VOCABULARY, LANGUAGE, MECHANICS, and the ultimate score will be the mean of scores in these 4 scales. I would be thankful, if you help me to select a reliable inter rater measurement in the context of language learning and how i can calculate it by my hand on around 5 or 10 sample papers which is corrected by two raters?

You’re mixing some concepts, which is making it difficult to figure out what your’e asking. Interrater is not a scale of measurement; it is a way of looking at reliability (consistency) of measurement between raters. ICC as a measure of inter-rater reliabiltiy assumes either interval or ratio scale of measurement. So you need to figure that out first. If you do have interval+ scale of measurement, ICC would be fine. IF you always have the same 2 raters, you should probably use ICC(2). If you have different pairs of raters of raters for each, you should use ICC(1). If you are interested in your four subscales separately in further analyses, I’d calculate 5 ICCs – one for each subscale and one for the overall scale mean. If you’re only interested in an overall assessment, you only really need the overall mean (one ICC). If by hand you mean without SPSS, this is fairly straightforward if you understand ANOVA – you need a dataset with every rating on its own row, rater identifiers in a second column, and ratee identifiers in a third row. You then conduct an ANOVA (for ICC(1), ratee as IV and scores as DV; for ICC(2), ratee and rater as IV and scores as DV) and run through the ICC formulas from the ANOVA table – for ICC(1,1), (MSb-MSw)/(MSb+MSw(k-1)). Or (MSb-MSw)/MSb for ICC(1,k). It is slightly more complicated for ICC(2,1).

Dear Dr. Landers,

Thank you so much for replying me , according I understood that those 4 scales ll make me more confused and I decided to measure the raters correlation according the total numbers they give to students. Before the raters start to rate the papers of my final subjects for study , how many subject’s score is enough for understanding the agreement of raters according to ICC??? for example here i provide a table and the numbers are all examples and the scores are out of 100 , I would be so thankful if you show me the formula and how to calculate manually the ICC for these sample data ?

student rater A score rater B score

1 50 65

2 85 95

3 80 85

4 90 87

5 71 92

sorry if the figures didnt reach to you by order. first row is number of students , second row is scores give by teacher A, third row is scores give by teacher B. Also these two teachers ll rate one time means there is no pretest or post test just one pepper of composition. thank you , Im waiting for your kindly reply.

I’m not sure how much I can help you beyond what I’ve told you already. If you want the formulas for ICC, they are in Shrout and Fleiss (cited above). You’ll need to first calculate the ANOVA appropriate to the type of ICC you want, then use the formulas derived there.

In terms of precision of ICC, the number of raters is nearly as important as the number of cases. You probably won’t be able to get a stable estimate of ICC(2,1) with only 2 raters. You can algebraically manipulate the ICC formula in Shrout and Fleiss to solve for k – that will tell you the number of raters you want for a given level of ICC(#,k).

Dr. Landers, i hope you can help me.

I am currently conducting a reproducibility study on 26 young swimmers. I have measured their jumping height on two occasions, all measurements were performed by myself.

Which ICC is appropriate for my design?

So far i have only calculated ICC by hand using ICC (1,2) according to Rankin and Stokes (Reliability of assessment tools in rehabilitation: an illustration of appropriate statistical analyses), but i can’t figure out which one of the ICC’s is appropriate for my study design according to Shrout and Fleiss.

I am to calculate ICC i SPSS, so i hope you can help me.

Best regards

Melene Lykke.

You are violating the assumptions of ICC by rating it twice. ICC(1) assumes a randomly drawn rater for every rating and ICC(2) assumes the same pairs (or trios, etc) of raters for all cases. In your case, you have neither. However, if you consider each of your measurements to be

essentiallyrandom, you could use ICC(1), as your book suggests. The mapping of ICC(1,k) to SPSS commands is explained in the article above – one-way random, average measures.Thank you for your reply. I wasn’t aware of the assumptions of ICC. I will follow your recommendation of using ICC (1).

Again thank you.

Dr. Landers,

Thank you so much for such a clear explanation. I have an unusual question involving a robot performing a rating as opposed to a human.

If I use a testing system to measure say joint motion but it does so automatically/robotically. I am interested in finding the reliability of that testing system over days. So if I construct a study to measure 10 subjects over 6 days with 3 tests per day, would the ICC score be a means to calculate its reliability?

Each test requires that the subject be placed in the testing system the same way which is performed in a routine fashion. The system then automatically calculates the subjects joint motion. The system itself has been studied to show that it reliably performs the measurement tasks and reproducibly calculates joint motion in an artificial test to within 0.5 mm and 0.1 degree.

I would now like to calculate the reliability of the testing process on subjects. I would think that each test would act as an independent rater over days/number of tests. Furthermore, I would think that the testing system would be the ‘ideal’ rater. Thus, the ICC(3,k) would be utilized.

Do you agree or can you shed some light on this situation?

Thank you so much,

Tom

Well, keep in mind first that a system itself cannot be reliable or unreliable – it only demonstrates reliability in a particular measurement context given particular rating targets. I am not sure what you are trying to measure exactly – if you’re interested in the reliability of the system in detecting joint motion from a particular population of joints over time, I don’t see why you would want so many measurements. Remember that any time you chose a reliability coefficient, you must think about the sources of error that you want to assess. It seems like you’re most concerned about temporal error – changes over time – in which case, why worry about both multiple tests per day and tests over multiple days? Do you expect different error rates depending upon the time span? I would choose just one (e.g. one rating per day over x days or x raters in one day) unless you expect such a difference. If you DO expect such a difference, you have a fairly complex relationship between reliability and time and may need to create your own reliability estimate if you wanted to capture that. If you don’t want to do that, I would instead model those differences explicitly (e.g. variance over both the short and long term). 18 measurements of the same number is quite a lot though.

In this context, you do not meet the assumptions of ICC 1, 2 or 3. However, if you don’t mind relaxing the random-rater-draw assumption, you could feasibly use ICC(1,k). If you’re interested in the reliability of a single joint measurement (which it sounds like you probably are), you’ll want ICC(1,1). This would be interpreted as a test re-test reliability (sort of) rather than an inter-rater reliability though.

As a side note, you will also need way more than 10 subjects to get any precision at all in any reliability estimate. Otherwise your standard errors will be huge.

Dear Dr Landers,

Thank you for generously taking time to educate those of us who are less familiar with ICC. I’ve been searching the Internet for days looking for information on this topic and have not been able to find useful webpages- yours is the closest to what I was looking for. I don’t believe my questions have been addressed previously in this thread, and hope that you might be able to help!

I have a balanced panel dataset with a sample of 900 firms in 194 industries spanning 9 years. I have three levels – time, firm, and industry. I need to decide the appropriate level of aggregation for each variable. That is, I must decide whether each variable should be regarded as transient (varying over time) or stable (ie explains only cross-sectional variance between firm or industry). The literature indicates that ICC(1) can be used to answer this question, and ICC(2) can estimate the reliability of the aggregate measure.

My questions:-

(1) According to Bliese (2000, p 355-356), the formula to compute ICC(1) based on one-way random-effects ANOVA is as follows:

“ICC(1) = [Mean Square (Between) minus Mean Square (Within)] / [Mean Square (Between) + (k-1)*Mean Square (Within)].”

Bliese (2000) defined k as the group size. In my study context (where there are 3 levels – year, firm, industry), what number should I plug into k for each of the 3 levels?

(2) Given that my study context has three levels, should I run one-way random-effects ANOVA three times whereby each grouping factor is time, firm, and industry in order to determine the ICC(1) for each level?

I would be grateful for any guidance you can provide!

In the aggregation literature, ICC(1) usually refers to ICC(1,1) and ICC(2) usually refers to ICC(1,k). That is the Bliese interpretation as well – you can see that his formula for ICC(2) is really ICC(1) with a Spearman-Brown correction (which is consistent with ICC[1,k] in Shrout/Fleiss).

When answering aggregation questions, you’re not really interested in the higher levels organizing your data given a particular question. You just want to know if the most basic unit within your data (the one you are aggregating) is meaningful. So you could run a one-way random-effects ANOVA at the bottom level of your model (either firm or time, I imagine). If you wanted to aggregate across multiple categorizations, you’d need to create a new grouping variable indicating that group membership (e.g. for each unique combination of industry/firm).

However, I would recommend not doing any of that because you still lose information that might be important in later analyses, and it will be difficult to justify the particular aggregation strategy you end up using in a research paper given that you have a variable hierarchy (time within firm/industry or firm/industry within time). Hierarchical linear modeling (or linear growth modeling in the context of SEM) do not require such sacrifices. So I would use one of those approaches (and that will be an expectation of reviewers in many top journals, e.g. AMJ, anyway).

Dear Dr Landers,

Thank you most kindly for your prompt response!

Here’s a sample of my data structure:-

Firm_name TimeFirm_id industry_id

ABC 1234 3576

ABC 1234 3576

ABC 1234 3576

ABC 88553510 3576 4.00

So you’re saying that I should just run a one-way random-effects ANOVA using time as the

Dear Dr Landers,

Thank you most kindly for your prompt response! I apologize that my prior reply accidently got sent before I was ready.

I have a couple of clarification questions.

Here’s a sample of my panel data structure:-

Firm_name Time Firm_id industry_id Firm size

ABC 1 3576 0011 1.11

ABC 2 3576 0011 2.10

ABC 3 3576 0011 1.89

.

DEF 1 1234 7788 1.11

DEF 2 1234 7788 2.10

DEF 3 1234 7788 1.89

Let’s say I want to determine whether firm size is a transient or stable factor.

My questions:

(1) If I understand correctly, you’re saying that I should just run a one-way random-effects ANOVA using time (the lowest level) as the grouping factor and firm size as the dependent variable?

(2) In order to compute ICC(1) using Bliese’s (2000) formula, what number should I plug into k, the group size? Since I have 9 years of data, is k=9 in my case? I’m a little confused because I’ve also got 900 firms in 194 industries, so would my group size “k” be the number of years of data (9) or average number of firms in each industry (900/194=4.64)? Bliese (2000) gave the example of average number of teachers per group as 40.67 for “k”, but I suppose that was for multilevel modeling. Since I’m using growth modeling involving time, perhaps my k should be 9?

Thanks for your patience with my questions! I’ve been reading the literature quite a bit but I’m still relatively new at this, so please pardon me if these are basic questions.

If you’re interested in calculating ICC, the score you are interested in is your DV, whereas your grouping variable is your IV, whatever that grouping variable might be. I am confused by your other question, because if you are using growth modeling, you should not need to aggregate in the first place.

Dear Dr Landers,

I apologize for not being clear. Let me try to explain again. So I have three levels – time, firm, and industry. I’m interested in using ICC(1) to examine the amount of variance in firm size that occurs within firms over time versus between-firms versus between-industry levels. Based on Bliese (2000), I know that I need to use one-way ANOVA with firm size as the dependent variable and time as the grouping factor (or independent variable).

Let’s say the one-way ANOVA result for firm size is as follows:-

Sum of Squares df Mean Square F Sig.

Between Groups 58.400 11 5.309 2.040 .021

Within Groups 23830.728 9155 2.603

Total 23889.128 9166

Now, I need to compute ICC(1). Based on Bliese (2000), the formula is as follows:-

“ICC(1) = [Mean Square (Between) minus Mean Square (Within)] / [Mean Square (Between) + (k-1)*Mean Square (Within)].”

Using the one-way random effects ANOVA results above, I plug in the following numbers:-

ICC(1) = [5.309 – 2.603)] / [5.309 + (????-1)* 2.603]

I’m not sure what value to plug into k here as depicted by ????. That’s the essence of my second question. But perhaps I’m misunderstanding how the whole ICC(1) thing works in growth modeling. If that’s the case, I would appreciate your advice to help me understand how to determine ICC(1) in growth models.

Thank you so much again!

Ah. In your example, you would use the number of time points for k.

My instinct is still that you should not be aggregating, but blog comments are not a very good forum to figure that out for sure. I’m sure you’ll be able to explain it in your paper, so no need to worry about it.

Thank you so very kindly, Dr Landers! I greatly admire your generosity in sharing your knowledge and time with someone you don’t know. Happy Holidays to you!

Dear Dr. Landers

Thank you so much for your article, it is really helpfull!

I´m doing some research where I use ICC to test for agreement between two raters.

I choose two-way mixed –> absolute agreement –> single measures.

My question is: I know that 0,70 is a cutt-off score that people typically use when they interpret ICC, but since i test absolute agreement, should i still use this value? Or should i use other values that is often used for interpreting agreement – for example suggested by Fleiss (1981): “0,0 – 0,4” ; “0,40 – 0,75” ; “0,75 – 1”?

It depends entirely upon what number you find to be realistic and worthwhile. The further from 1.00 your reliability estimate is, 1) the more difficult it will be for you to find statistical significance and 2) the more likely your scale will be multidimensional and measuring something you don’t realize it’s measuring. I would personally not trust a scale in my own research that does not typically demonstrate reliability over 0.85 in the type of sample I’m using. Whatever your cutoff, agreement vs. consistency should not affect the cutoff that you find compelling – the agreement vs. consistency question is driven by the research question you’re trying to address.

Dear Dr.Landers

Thank you so much for the article.

After having read so many comments and replies I’m still not able to find the exact way to work on my paper. My work is on pain reactions in preterm babies. I have 4 raters who have rated the same 20 babies across 20 variables with a yes or no (whether they see the particular pain reaction or not).

It looks like I cant use ICC because the ratings are yes/no and I cant use kappa because I have more than 2 raters. I read some where in these cases that I can use the Kendall’s coefficient of concordance. Do you have any suggestions for my scenario? Your answer would be greatly appreciated.

Thank you!

Kruthika

By “kappa”, you probably mean Cohen’s kappa. Fleiss’ kappa is a more general case of Cohen’s and can be used with any number of raters. However, you could also use ICC by coding your yes/no ratings as 1/0. If your 20 variables are all part of the same scale, I would probably take a mean yes/no (1/0) rating, i.e. each person’s score is the percentage of yeses they answered, and calculate ICC on that. I would not use ICC to determine the reliability of individual yes/no ratings, because dichotomization will depress your estimates in comparison to kappa. Kendall’s W is used when you have rank-order data, which would also technically work here, but is less common.

Dear Dr. Landers,

Thank you for such a clear and quick reply!

Dear Dr Landers,

In our study we have determined the intra and interobserver reliability using ICC ( 4 raters, 51 cases, 8 options). Now we would like to have a 95 % CI of the mean ICC of the 4 observers ( inter and intra). Do you know how to determine the 95 % CI of the mean ICC?

Thanks for your help

It doesn’t sound like you’d need a mean ICC in the situation you described. And I don’t know what you mean by “intra-observer reliability” unless you had multiple scale items, in which case you probably wouldn’t use ICC. You should be able to calculate one inter-observer reliability estimate in SPSS given that data, and it will tell you the 95% CI by default.

Hi Dr. Landers,

I’ve read somewhere that the ICC can also be used to determine if there are predominantly trait or state differences between people. For example, if after a series of repeated measurements of the same function (lets say “alertness”, measured 10 times a day), the intra-indiviual variance (within-subjects) is less pronounced than the inter-individual variance (between-subjects); then a trait-like operation is underlying the function. Put differently, the higher the intra-class correlation, the more likely it is to expect effects to be stable over time.

Now, I have such measurements, but I fail to be sure which model I should use to compute the ICC, and which result to report (single or average). Could you formulate some advice? Thank you in advance, and congrats on the article.

I would like to help, but I have unfortunately never heard of this. I think you’d be best off finding an example of this in prior literature and replicating their approach.

Thank you Dr. Landers.

I found the article I was talking about: J Sleep Res. 2007 Jun;16(2):170-80.

Trait interindividual differences in the sleep physiology of healthy young adults.

Tucker AM1, Dinges DF, Van Dongen HP.

That’s what they say about it: […] Among 18 sleep variables analyzed, all except slow-wave sleep (SWS) latency were found to exhibit significantly stable and robust–i.e. trait-like–interindividual differences. This was quantified by means of intraclass correlation coefficients (ICCs), which ranged from 36% to 89% across physiologic variables, and were highest for SWS (73%) and delta power in the non-REM sleep EEG (78-89%). […]

I computed ICCs using the Two-way mixed-consistency approach, and I plan on reporting the single measures, which could reflect the amount of between-subject variability considering a single time point. Their approach is somewhat different, as they computed both within and between subject variability using traditional ANOVA procedures and caculated the ratio between BS var and BS+WS var…

I’ll try that and see how well the results concur with the 2way-mixed approach.

ICC is calculated from the results of ANOVA. A two-way mixed effects model just means that raters are entered as a fixed-effects variable, whereas the items are entered as a random-effects variables. At that point, you can just use the formulas in Shrout & Fleiss to make the ICC(3,1) calculation, and use the Spearman-Brown prophecy formula to scale up to ICC(3,k) if needed. I am not sure who your “fixed raters” and “random items” are in this context though.

Dear Dr. Landers,

thanks a lot for your explanation on the ICC. I still have a few questions though.

In my study, I have two groups, a clinical and a control group, who have to fill in a questionnaire which consists of 30 items. Each group consists of 2 subgroups. In the first subgroup are parents (48 clinical group / 100 control group) and in the second subgroup are students (43 clinical group / 70 control group).

I’d like to calculate the ICC for EACH item and compare them between the clinical and the control group; e.g., compare the ICC of the rater agreement between parents and students of item 1 of the clinical group with the ICC of item 1 of the control group.

Is it possible to first calculate a mean/median of the parents’ and the students’ ratings per item, and then calculate the ICC using the One-Way Random-Model?

And how can I calculate the ICC for each item instead across all items? Can I split my dataset between the items or do I have to create a dataset (separate SAV files) for each item?

Thanks a lot in advance!

You can calculate ICC for any subgroup you want; you’ll just need to set up variables indicating each subgroup/variable combination you are interested in. You calculate ICC for individual items the same way – you just need one column for each rater/item combination (instead of rater/scale combination). I will warn you, however, that comparing item reliabilities can be dangerous, because item reliabilities follows a normal distribution just like any other statistic – and because you are looking one item instead of many items, the familywise error rate is dramatically inflated – so just because one item has a high ICC and another has a low ICC does not mean that either item is “better.” To make such conclusions, you’d need to compare Bonferroni-corrected confidence intervals.

Thank you for your quick response! For my study, it’s not so important to decide between “good” and “bad” items but to see if the rater agreement between the clinical and the control group is similar.

I’m not sure, if I understand your comment correctly. I tried to calculate the ICC for an individual item by putting the median of the parents’ ratings and the median of the students’ ratings in the item field and then pressed the statistics button, chose the intraclass correlation coefficient and selected the one-way random model. Instead of the results, I get the warning message “There are too few cases (N = 1) for the analysis. Execution of this command stops.”

Ah – I see. You don’t need item comparisons then – but you would still want to look for overlapping confidence intervals on your scale measures.

There are several red flags in what you’re saying. One, I’m thinking you’re using median ratings because your scale is not interval or ratio measurement, but this is a requirement of ICC. Means must be meaningful. If you are using ordinal scales, you can’t use ICC at all.

If you are in fact using interval or ratio scales, you should not calculate medians yourself as input to ICC – they should be means, and you should have 1 mean per rater per rating target (e.g., for 2 raters across 10 cases, you would have 20 scale means). Your cases must be independently drawn from some population of things (or people, or whatever) to be rated.

For ICC to make sense in this context, my mental model is that you have multiple parents and multiple students rating the same paper people. All of those paper people being rated (your N) must be on separate rows. If you only have 1 observation per case, it is impossible to calculate inter-rater reliability – so if you have only 1 parent and 1 student per rating target (or if you only have 1 rating target; for example, only 1 paper person), you cannot examine inter-rater reliability within parents (or within students) – it is impossible given that structure of data.

The main exception to this would be if the items on your survey are theoretical random draws from a population of items (e.g., if your items are really each themselves “paper people”). In that case, you really have one item per case (i.e., SPSS row). In that case, you would put all of your student raters as columns, and all of the items as cases, then calculate ICC; then repeat with parents; then compare confidence intervals (and raw effect sizes, of course).

That’s about all I can think of. You might want to bring a psychometrician in to work on your project – someone who can dig into your dataset would be able to give you more targeted advice.

Hi Dr. Landers,

Thank you for taking the time to write and follow-up on this blog post. It’s been very helpful.

In our study, we have ratings by four different raters but only one rater per case (n = 30-40). We had intended to have all raters rate one case to make sure the raters’ assessments were close together and then analyze this using an ICC but seeing as it won’t work with a sample size of 1, we’re trying to determine what to do.

While it’s not possible to have the four raters complete assessments for each individual in the study, if all four raters completed ratings for a small group of the participants (maybe 10%), would this be sufficient for determining the inter-rater reliability?

It depends what you mean by “sufficient.” It will allow you to calculate an estimate of ICC across your sample – rather than the actual ICC – which is better than nothing but still not perfect. But it will be inaccurate to whatever degree sampling error is a problem in the random raters you end up choosing. It sounds like you want to keep each rater completing only 3-4 additional ratings. That is not nearly enough to get a stable estimate. I would instead ask each rater to rate a second person (chosen at random). That would require each rater to complete 8-10 additional ratings. You could then calculate ICC(2,1) to get a stable estimate of reliability for your individual raters. You could alternatively do half of this (4-5 additional ratings each; getting 1 additional rating on half of your sample), but this is a little risky for the reasons outlined above. I wouldn’t do less than that.

Dear Dr. Landers,

this article really has proven to be a huge help for my evaluation, thank you very much for the clear and precise words on how to use the ICC.

I’m having a problem with my results though and hope that you might be able to give me a hint as to if there even is an issue and if so why.

I have 2 raters and 27 ratees. The data had to be coded into a scale with 7 categories so that every rater now has a value between 1-7 in each of his 27 rows. When looking at the data you can already see that most values only range from 4-5 and both raters seem to even concord most of the time so there seems to be very little variation but as soon as I let SPSS actually compute the ICC I only get a 0,173 as the result which is pretty low for data that seems to be so similar.

Should I just accept the low value as the correct result or did I do something wrong?

I actually counted how often both raters don’t agree and it’s 6 times but only marginally e.g Rater 1: 4 Rater 2: 5

Thank you for your help!

Kind regards,

Moritz

P.S. English is not my native language so should anything be incomprehensible to you I would be glad to try to give you a better explanation.

You are likely getting a low ICC because of low variance in ratings. If every answer is a 4 or 5, then that means for 6 cases out of 27 (22%), you have zero agreement. Zero agreement is very different from 100% agreement, so ICC is being dramatically suppressed. If possible, I’d suggest re-anchoring the rating scale and re-rating all of the cases with your new scale. For example, you might add more numbers with labels between 4 and 5. If that’s not an option, you might report simple percentage agreement instead (78%), which isn’t as rigorous an examination of agreement, but it probably more closely represents agreement in your sample.

Thank you very much, that sounds like I finally can move along since this has been bugging me for days without finding a solution to that odd result.

Re-achoring won’t really work I guess since the rating described above is based on the same scale I am also using for another Item but the ratings have a much higher variance there. That’s why the example with the 4’s and 5’s is basically an exception.

Is there something like an approximate variance that should be present to effectively compute ICC’s?

Thanks again for your quick response, it’s rare that someone answers questions so patiently.

There aren’t any rules of thumb, I’m afraid. The general idea behind ICC is that it looks at how much raters vary from each other in relation to how much ratees vary from each other. So if those proportions are similar, you’re going to have a low ICC.

What you want is low inter-rater variance and high inter-ratee variance to demonstrate high agreement. In classical test theory, your scale should be designed this way in the first place – you want to maximize the differences between people (so that your scale mean is near the middle, e.g. 3 on a 5-point scale, with a normal distribution reaching out to 1 and 5, and without ceiling or floor effects) in order to explain those differences with other variables.

Another way to look at it is in terms of scale of measurement. Because your raters are only using 2 options on the scale, you have converted an interval or ratio level measure to a nominal one. That will certainly attenuate ICC.

Dear Dr. Landers,

I am conducting ICC for a scale to determine reliability but I am having an issue when trying to examine the reliability for one item. I can run the analysis for the scale as a whole, or two plus items, but when I try to input one item, I receive a warning in SPSS output saying there are too few cases, n=1. Do you know if it is possible to calculate the agreement between raters for individual items? I have 40 raters and would ideally like to calculate the reliability for each item of my scale.

Many thanks for any help and for your explanation.

Best wishes,

Adele

If you’re getting an error about n=1, you’re either not running the ICC calculation correctly or you don’t have more than 1 ratee (which is required to model variance among ratees). You can certainly calculate ICC for each item, if that is meaningful (although it usually isn’t, for reasons I’ve described in previous comments).

Dear Dr. Landers,

thanks for taking your time to write the article and anwer people’s questions on this page.

In our study, we want to compare the ICCs (one-way random) between two clinical samples.

Clinical sample A and B, both consist of an adult rater (who are actually more adults, who each rated a different patient who belongs to the same group, and are therefore placed in the “adult rater group”) and a student rater (who also consists of more students who belong to the “student rater group”) who rated patients on a questionnaire.

In a first step, the ICC for each item of the questionnaire for clinical sample A will be calculated and in a second step, the same method will be used for clinical sample B. This procedure will then be repeated for 3 different groups.

I’m unsure if I should use SINGLE or AVERAGE MEASURES in order to make basic assumption about the ICCs between clinical sample A and B. But also if I have more groups, and want to compare the agreement between different groups (e.g.: how does the ICC of a certain item for the adult rater group 1 and the student rater group differ from the ICC of the same item for the adult rater group 2 and the student rater group, or to the ICC of the same item for the adult rater group 1 and the adult rater group 2) within a clinical sample. Do I have to use single measures or is it also possible to compare average measures?

Although I’d like to help, I am not at all understanding your design. This sounds like a very unusual application of ICC, and reliability in general. My inclination is that you’d want to compare average measures since there is no meaningful single measure unit (i.e. your “single measure” is some ambiguous half student-half adult hybrid rater and is thus uninterpretable). But your conclusions about ICC comparisons will be limited to student/adult combinations regardless, so I’m not seeing why you’d want to do this in the first place.

Hi Dr. Landers,

This is one of the most useful statistics articles I’ve ever seen. I wanted to know if I could please ask your opinion. My colleague and I conducted an experiment in which we could use some advice, if you have time. He taught 4 classes on 2 occasions and I taught a different group of 4 classes on 2 occasions. I was the experimental group and taught all my sessions with a game. He was the control group and didn’t use games. We each gave students a pre-test at the beginning of the first session and a post-test at the end of the second class.

Our hypothesis is that students in the experimental (games) classes performed better on the post-test than the control. Unfortunately we didn’t think to assign each student a number so that we could figure out which pre and post-test belonged to who. So basically we have a ton of pre and post-tests divided by class but not by student. Is there a a way we could conduct statistical analyses for the groups instead of individuals to see if our hypothesis was concerned?

Thanks so much,

Kate

This is a bit outside the scope of this article, and I’m not an ANOVA expert. But as far as I know, there is not really anything you can do as far as recovering the pre-tests goes. Assuming you were wanting to use pre-tests as a control variable in some sort of parametric analysis (e.g. ANCOVA), the basic idea is that the relationship between each person’s post-test scores and the covariate (pre-test) is determined first using regression, and then the variance explained in the post-tests from that regression is removed from the ANOVA model (creating what is referred to as the Corrected Model). Without the covariate-posttest link, you have no way to do that correction.

However, covariates do not necessarily need to be pre-tests. Although that’s generally best – and least error prone – you can also use anything you believe to be correlated with the pre-test. If you think that differences across classes in student achievement is the primary driver of pre-test differences in your DV, for example, you could just control for prior student achievement (using, for example, other test scores unrelated to the game – assuming you had the same tests in the two classes – or even prior GPA).

The last resort is to just assume that the classes were more-or-less equivalent ahead of time. Most research methods textbooks call that kind of design “generally uninterpretable”, but despite that is surprisingly common in education research.

Hi Dr. Landers,

This is really useful, thanks so much for your input. I really appreciate it.

Best,

Kate

Hi Dr. Landers,

I am conducting a research to measure the inter- and intra-reliability of subjects in preparing three different thicknesses of liquids (i.e. mild, moderate, & extreme). In this study, 18 subjects are required to prepare each thickness of liquid three times (e.g. 3 mildly thick, 3 moderately thich & 3 extremely thick). The thicknesses will be measured in centimeter but it can also be categorized into mild, moderate & extreme. May I know if I can use intra-class correlation to analyze the inter-reliability of subjects?

Thank you.

Best wishes,

Bing

You can use ICC on any distance measurement (like cm), since that is ratio-level measurement.

Dear Dr. Landers,

Thank you for your reply.

Sorry, I have another question here. Is it correct to analyze the data separately?

Can I use ICC(1) to measure the intra-reliability among subjects?

Thank you so much.

Best wishes,

Bing

I’m not sure what you mean by “separately.” You can use ICC(1) any time you want to assess the degree of consistency or agreement between multiple raters making ratings on the same targets. ICC is sometimes used differently in non-social-science contexts, which seems to be the context you’re talking about, and which I know less about; you will probably want to find a journal in your field that has done this and use that as an example for which specific procedures to follow.

Sorry for being unclear. I’m trying to ask if I need to analyze the data for 3 times to get 3 ICC values (i.e. mild, moderate & extreme).

I’ve posted my question online and would greatly appreciate if you can give me some suggestions.

Thank you.

http://stats.stackexchange.com/questions/92713/inter-and-intra-reliability-intraclass-correlation

Best wishes,

Bing

So, you haven’t described this very well, but based on your post on StackExchange and what you’ve said here, this is how I understand it: you’ve asked nine people to make nine solutions each – three replications within each type, which is actually a within-subjects condition assignment. I don’t know what your research goals are, but your design does not sound like a reliability study. It sounds like you are interested in between-person consistency – which is reliability – but also differences between solutions – which is not reliability. I’m not at all familiar with your field, so I have no idea what research question you could even be asking here. I think you are going to need to bring in a statistician (or at least someone within your field) onto your project to help you on this one – I don’t think I can do much more for you.

Hi Dr. Landers,

Thank you so much for this overview, it is very helpful! I would really appreciate your opinion on a specific example that I’m working with, if you have the time. For our research, we asked staff across 40 different schools to rate their school across a variety of dimensions. We are planning on identifying the highest and lowest performing schools based on these ratings. For each school, we have between 4 to 10 different professionals rating their school on 4 survey scales. We want to know if the professionals within each school rate their school similarly to the other professionals within that school (for each of these scales), so we really only have one case for each group of raters (the individual school). Is the ICC the appropriate statistic to examine this?

Thank you for your time,

Nicole

Yes, it is. However, you won’t be able to use the ICC tool in SPSS to calculate it because you have variable raters. You will need to either !) calculate ANOVA instead and calculate ICC by hand from the output or 2) use hierarchical linear modeling, which usually spits out an ICC(1,1) by default.

Thank you for such a quick response to my question. I really appreciate your help!

Dear Dr Landers,

It is very refreshing to have the seemingly incomprehensible world of statistics explained so well, thank you so much!

I have looked through the many posts on this page, but would like, if you have the time, to just clarify something for my own research.

Without boring you too much, I am looking at applying a specific technology in a new way, and my first step is to see whether I get reliable results. The research is in its pilot stages, so I am the only “rater” so far and I took measures from my participants at the same time, every day, for 5 days to see how reliable my measures are.

I have run an ICC on this data using a “Two way mixed” design and I have analysed for absolute agreement.

I have seen differing answers as to whether I should report the single measures or average measures output from SPSS? I was wondering what your advice might be?

Many thanks and kindest regards

Catherine

Well… first of all, since you’re holding raters constant (because it is just you) and varying time, you’re really calculating a variant on test re-test reliability instead of inter-rater reliability. So that affects the sorts of conclusions you can draw here. You are really only examining the reliability of yourself making ratings over time. You can’t speak to the reliability of anyone else’s use of this scale, as you could with an inter-rater ICC.

Also note that if you would expect true score variation over time (e.g. if the scores from you scale reflect real underlying change in your technology, or whatever it is that you’re rating), then you shouldn’t use ICC this way.

As for the single vs. average measures, it depends what you want to do with your data. If you’re comfortable with only a test re-test reliability estimate, and you want to draw conclusions from some statistics calculated upon the mean of your 5 days, then you want average measures. If you are trying to figure out “if I was to make one rating on one day using this scale, how reliable would that rating be?”, you would want single measures. Given your description of this as a pilot study for a technology you developed, I suspect you want single measures (since I’m thinking 5 days is an arbitrary number of days).

Dear Dr Landers,

A huge thank you for your very swift and thorough response, it really is very much appreciated.

I will go back to the analysis with some fresh eyes and work on test-retest reliability for now.

Thank you once again

Catherine

Dear Dr. Landers,

I have 40 raters who are using a 100 point scale to rate 30 speech samples. Someone suggested I use Cronbach’s alpha . I thought that technique was not appropriate for this dataset. I thought ICC would be appropriate? Is that right?

Thank you.

Well, you can technically think of alpha as a specific type of ICC.

To conceptualize the relationship between the two, you can think of your raters as items. The trade-off is that you lose a lot of the specificity that ICC allows, because alpha makes a lot of assumptions about items.

So for example, alpha is always consistency-based (not agreement), and I’d imagine that agreement is important for speech sample ratings. Alpha also gives you the reliability of your scale assuming you continue using all of the items, and it requires that you have the fixed number of items for each case. I believe alpha also assumes a fixed item pool (don’t quote me on that last bit though). All of those restrictions together makes it equivalent to ICC(3,k) for consistency. So if your data don’t match any of those assumptions, i.e., if you wouldn’t use an ICC(3,k) for consistency, alpha would not be appropriate.

Dear Dr Landers,

Thank you for your excellent explanation. I still have one question, though, that I haven’t been able to find the answer to yet: are there any reference values as to what constitutes a high, medium, or low ICC (I’m particularly interested in the 2,k/consistency model)?

I’m in the process of developing a scale. I wrote a pool of statements that I want to assess in terms of their content validity. To do so, I’ve asked 6 judges (experts in the field) to rate each statement on a scale of 1 to 5, 1 being “statement does not represent underlying construct at all” and 5 “statement represents underlying construct very well”.

To report on the judges’ reliability, I am thinking of using the ICC. From your information I gather that the ICC(2, k) (i.e., (2, 6)) model is the one to go with (type: consistency) (is this assumption correct?). But when can I label such an ICC as high (or medium, or low)? Do you know of any reference values (and/or perhaps a source I could consult/quote)?

Any advice would be much appreciated!

Sean

Well… two issues here. First, ICC is simply a type of reliability, so any standard for reliability you normally find sufficient is the same here. The classic reference (Nunnally) suggests 0.7 as an absolute minimum (i.e. 49% of the observed variance is “real”), but it’s a very loose standard and usually misinterpreted. See http://wweb.uta.edu/management/marcusbutts/articles/LanceButtsMichels.ORM.2006.pdf

Second, the reliability of the judges has nothing to do with the reliability of your scale, so I’m not sure why you want to know it. ICC(2,1) would tell you how much true score variance in judgment, on average, does each judge capture? ICC(2,6) would tell you how much true score variance the mean judge score contains. Either way, I think you’d want absolute agreement, not consistency – because you want to know “do judges actually agree on their ratings?”, not “do judges have similar patterns of agreement?”

Dr. Landers, Thank you for your quick reply. Yes, I saw Alpha as specific ICC type; that is why I wanted to get your take. I thought the individual samples were the items. Each rater rates 30 samples. The ratings will not be combined. Since, I’m using a sample of raters, I can use ICC(2)?

-Steve

Hi Dr. Landers, You’ve expanded my mind 😉 Raters are items and should appear in the columns of my SPSS dataset, while the speech samples are in the rows. Thank you! -Steve

You’ve got it! Glad to help.

Thank you for your swift reply!

I’m aware that ICC has nothing to do with the reliability of the scale as such. Based on the judges’ ratings I will select the ‘best’ statements (based on their mean score and SE), after which the scale will be piloted. Follow-up procedures such as confirmatory factor analysis and Cronbach’s alpha will then be used to determine validity (unidimensionality) and reliability respectively.

I just thought it would be helpful to calculate the ICC on the judges’ ratings as a measure of their reliability. Just suppose their answers are all over the place, and there’s little consistency (or agreement), then obviously something’s wrong with the statements (or with the selection of judges).

You mentioned that absolute agreement would be preferable over consistency for this type of application. Could you explain that? I’m using the ICC (I think) to find out whether the judges are consistent in their ratings (5-item Likert type scale). Suppose the following situation: one or two of the judges are of the strict type, and they consistently feel that a 5 is too high a score in any case (‘There’s always room for improvement’, you know the type…). But their rating patterns would be consistent with other less strict judges, wouldn’t ‘absolute agreement’ ICC then give a distorted picture?

I’m really in your take on this.

Thanks

Sean

Ah, I see. I guess I am surprised because typically content validity is decided by consensus judgment – the addition of reliability estimates at that stage doesn’t add much (IMO) because you are going to collect psychometric validation evidence later anyway. I suppose it doesn’t hurt though.

I was thinking absolute agreement because of your scale: up to “statement represents underlying construct very well”. If one judge rates “very well” and another rates “moderate” but their rating patterns are similar, you’d make the conclusion of high consistency in the presence of low agreement. I’d think you want a mean of “4” (for example) to be interpreted as all judges being equally approving, i.e. that “well” always means “well.” But if you think (and can argue in your writeup) that this is a situation where individual judges could be “strict” in making such judgments, then consistency is the better choice.

Of course, the progressive way to go about this would be to calculate both and see if they agree. If they do agree, no problem. If they don’t, figure out if it really is a by-judge severity effect (what you are arguing) or something else.

Thank you very much for your feedback. I’m definitely going to calculate both, and screen the data thoroughly to see what is going on exactly.

Hello Dr. Landers,

Your post has been increadibly helpful for me! However, I still have some doubts ans was hoping you might have the answer. I need to calculate the interrater reliability on a checklist assessing frailty using a dichotomous scale (yes-no answers). A total of 40 patients have been included and each patient was seen by 2 of 3 possible raters.

I am not sure If I should use a one-way or two-way random ICC and if I should I caculate icc per item or on the total score of the checklist?

Also, when performing the one-way (and two-way) ICC in SPSS, I get an error saying my N=0. Could this be due to the fact not all three raters assessed all 40 patients?

Any help would be much appreciated!

-Eveline

If each was seen by 2 of 3 possible raters, you don’t have a consistent rater effect, which means you need ICC(1).

If you want to use ICC, you should do it on either the mean score of the checklist, using 0 and 1 as “no” and “yes”, or on the sum. If you want to analyze individual questions, I would not use ICC – your estimates will be biased downward pretty badly.

In SPSS, you’re going to want only 2 columns of rater data associated with each case. You should not leave the data in the original 3-column format, or SPSS will assume you have missing data (which you don’t, based upon your design). So you’ll need to collapse your 3 raters into “first rating” and “second rating” columns, so that every case has exactly 2 ratings in exactly 2 columns.

Hi,

I have some likert scale questions, and a sample rating them. Each likert scale question have 5 items. How should I calculate inter rater agreement?

So far I saw that cronbach alpha is for item internal consistency, does this mean that it is not suitable for inter rater agreement and I should use ICC?

Thank you

If you just have a sample responding to scales about themselves (or their opinions), you don’t have raters as defined here.

Thank you very much for your quick reply! I’ve paired my data now so an ICC is calculated for each rater pair (A-B, A-C, B-C) on the sum score to avoid missing data.

I am writing a scale validation manuscript and the scale can be completed by either the patient or by an informant (usually a family member) on the patient’s behalf. The items are the same for whomever completes the form and so I would like to use both the patient and informant scores (n=4500+) in all analyses but I think I need to provide justification for this.

There are an additional 2700 patients who completed the scale themselves and also had an informant complete the scale, so each of these patients has two scores for the scale. If I were to perform an ICC with these patients’ two scores (comparing the scale total for the patients to the scale total for the informants) and return values of .4 or better (Shrout & Fleiss, 1979), could that be considered justification to use both the patient and informant scores (that I mentioned in the first paragraph) in the for the scale validation analyses?

For clarification purposes- I would only be using the 2700 patients’ data for the ICC purposes, not combining it with the other data set.

If this is something I can do, would the ICC be a one-way random with average measures? Is it inappropriate to use a two-way mixed model/consistency?

Thanks!

This is sort of a weird use of ICC. What you’re really doing with ICC is saying, “I have a sample of raters drawn from a population of raters – how well do they agree?” So by calculating ICC, you are already assuming that all of your raters come from the same population. If your ICC(1,k) is only .4, that means only 40% of the observed variance is true, which is quite low.

I don’t know your research literature, but the question I’d likely ask is, how do you know both sources are telling you the same information? How do you know that the quality of information isn’t better for one source versus the other? You lose any of that uniquely good information about a single rater source when you combine them.

If I were writing a paper, I’d conduct my analyses all three ways – using only informants, only patients, and with the average of both. If all three sets of analyses agree, then you have a strong case that rater source doesn’t really matter. If they don’t agree, you need to figure out why.

You definitely want ICC(1). ICC(2) requires identical raters, i.e. the same two people rating every case.

Hi Dr. Landers,

Your post are very useful for my research. Thanks a lot. but I have some questions here.

I’m doing multilevel analysis, which is to analyse students achievement in different schools. My data is consist of 2500 students in 92 schools. and I have 16 items in 4 variables.

To compute ICCs am I need to do on aggregated data for (each school) or on the every variables?

Thank you.

Since you have 4 variables you are presumably using in further analyses, you would calculate ICC for each of the 4 variables. If you’re looking at student ratings on their schools, you probably have different numbers of students per school. In that case, SPSS is not ideal for calculating ICC automatically, because it requires equal numbers of raters for each case. I would recommend instead calculating it by hand or by using a multilevel analysis program (like HLM).

Yes I am doing multilevel structural equation modelling. I’m done with conventional SEM and now work on multilevel analysis. I’m just calculate ICC by hand. Thank you for your advise. I really appreciate it.

Thanks.

Hi,

I’m so sorry if this has already been asked and answered, but should you report a negative ICC? I thought the value should be between 0-1, but certain packages such as SPSS can report an out of range ICC?

Many thanks!

Hi,

Perhaps you can share some insight. For intra-rater reliability, ICC and Cohen’s Kappa, I don’t know if these tests are testing the null hypothesis that the two measures are the same or are not the same. A minor technicality which makes a big difference. If an ICC comes back with a p<0.001 and a coefficient of 0.8, does that support that there is a statistically significant difference between measures?. Same question for Cohen's Kappa?

Richard

I’m not sure what you’re referring to with “these tests,” but if you mean the significance test being reported by SPSS, those are tests against a null of zero – they tell you that (for example), the coefficient alpha or ICC you observed is unlikely to have come from a population where the mean alpha/ICC was equal to 0. If you want to compare two ICCs to each other, you can do so by comparing their confidence intervals – if they do not overlap, they are statistically significantly different (unlikley to have been drawn from the same population, at least given the sample size you are using). However, it’s important to remember that statistical significance (or lack thereof) is insufficient evidence to conclude if two measures are “the same” or not. You would also need to investigate latent factor structure at a minimum, and preferably also explore the nomological net of each.

Thank you Professor Landers! Really good explanation on ICC for interrater agreement. Unfortunately I’m not too sure which ICC I should be using for comparison between twins. Hope you can help me out and point me in the right direction. Thank you very much!

Yi Ting

If you are talking about examining inter-twin reliability with twin pairs, all of your cases would involve unique sets of twins, which means there is no consistency between raters (you are assuming each twin is a randomly drawn rater from a population of twins), so you would use ICC(1).

Thanks for the reply! I have another question for you. I am under the impression that for ordinal data, that has not been assigned weights, an ICC is not an appropriate test for inter-rater reliability. Is this correct?

ICC relies on the normal distribution, which does not apply to ordinal data, so that is correct. You would most likely need some type of kappa.

Thank you! I’m now trying to figure out between using Weighted Kappa and simply Kappa. My data is not normally distributed and is ordinal. I’m comparing the inter-and intra-rater reliability of 3 different scales (similar to Likert scales but based on skeletal maturation in radiographs) based on three different locations (skull base, teeth and cervical spine). One structure is based on a 3 stage maturation, another location is based on a 4 stage maturation and the third location is based on a 6 stage maturation. The observations for the inter-rater reliability is done using the same methodology with only 1 other observer (Observer B). I had done Kappa measures for all 3, however, I was recently told that the 6-stage maturation has so many steps that it merits a weighted Kappa and that the others 2 indexs may not. I’d like your professional opinion! Thanks Richard (By the way, Great name)

I’m less familiar with kappa than ICC, and I don’t know anything about stages or maturation or whatever you field of study is… so I’m not sure how helpful I can be here.

I will say that weighting kappa is useful in any context where you have reason to claim that not all disagreements between ratings are equally indicative of overall disagreement. In nominal measurement, that doesn’t really come up (one person says “A”, the other says “B” – they disagree). But in ordinal measurement, it can be useful (if one person says “1st”, the other saying “2nd” agrees more than the one who says “3rd”).

I don’t really see any reason that the origin of data it is based upon makes a difference to the type of kappa you’d want to use, since all three of your measurements appear to be on the scale (at least if I’m understanding you correctly).

The primary downside to weighted kappa is that you need to create the weighting matrix yourself (or implicitly trust a computer program to decide on your weighting matrix for you, which I wouldn’t do). Then you need to quantify things like “how much worse is a 2-step disagreement than a 1-step disagreement?” which can involve a bit of guesswork. There may be standards for this these days in some contexts, but I don’t use kappa enough to know what they might be. It is simpler to just use kappa, but it is going to give you a more conservative estimate (because all disagreements are “100%” disagreements).

Thank you! That is very insightful and once agains hits the nail on the head for answering my question.

1) What are short comings of using Kappa in comparison to ICC

2) Do you know any indications to use more than one type of inter-rater reliability (i.e. ICC and Kappa)

3) I can’t seem to understand when to use Cronbach’s Alpha. Can you provide an example of when it would be appropriate to use?

Kind regards,

R

You might have better luck with a psychometrics course or textbook. I am really only scratching the surface here.

1) I don’t think the can be compared directly like that. If you have interval or ratio level data, you might consider ICC. If you have nominal or ordinal level data, you might consider kappa (or a variant – there are many). There’s no situation where you could use both that I can think of.

2) If you think you need to use more than one, you’re probably not using the right one. The basic concept of reliability is that you are capturing the percentage of observed variance that is “true”. You must choose a type of reliability that makes the right assumptions about what is allowed to vary and what isn’t. In the case of both ICC and kappa, you are assuming that raters don’t fluctuate over time, that raters are all drawn from the same population, and that the variations between raters is “error.” If those aren’t all true, you would want a different reliability estimate (sometimes requiring a different data collection method entirely).

3) That’s a complicated question, but it is most commonly used to assess the reliability of survey scales. Cortina (1993) can explain it better than I can: http://psychweb.psy.umt.edu/denis/datadecision/front/cortina_alpha.pdf

I have appreciated reading the discussion on ICC. I was wondering though, what if there is one consistent rater, and a series of other raters to confirm the reliability of this rater. Would this be calculated as ICC(2,1), ICC(3,1) or something else entirely?

Also would the variability of the data affect the calculation of ICC? The distances I am looking at getting consistent ratings on range from 0-1800 meters and plugging them into SPSS I get an ICC of 1.

Thank you!

Technically speaking, you don’t meet the requirements of any ICC; however, ICC(1) is commonly used in this situation. The reason is that ICC(1) assumes each rater is a random draw from a population of raters; since you may have a rater effect (if your consistent rater is lenient or harsh, for example), ICC(1) will probably be a conservative estimate of reliability.

If you are getting an ICC of 1, that implies you have 100% agreement, in which case you don’t need ICC. You can just report “there was 100% agreement”.

Hi Dr. Landers,

I hope you’re doing well. Thank you for your previous guidance with the ICC situation for my dissertation last year, it was very helpful. You may remember, I conducted an N=1 study where I administered therapy on a participant and was then rated by 2 raters on how well I adhered to the therapy manual. You’d told me I couldn’t use the ICC to describe the IRR between the 2 raters in that scenario because there was only 1 ratee, me. My dissertation chair disagreed, but that’s another story…

I have now completed a follow-up study which repeated the same N=1 design. I used the same adherence rating system, where I had 2 raters rate my adherence to the therapy manual again. I’m wondering how I can describe the IRR between the 2 raters in this study ? If I can’t use the ICC value because there’s only 1 ratee and 2 raters, then what test, if any, can I use to describe the IRR between the 2 raters?

Each rater rated the same 3/10 therapy sessions, chosen at random. Their ratings are here, in case it helps:

Rater 1 Rater 2

How adherent I was in Session 4 0.1875 adherent 0.22159

How adherent I was in Session 5 0.17045 0.21591

How adherent I was in Session 7 0.10227 0.15909

You can see Rater 1’s ratings are consistently 0.04 -0.05 units lower than Rater 2’s. Is that the only way I can describe their ratings, or is there another test I can use to formally describe their ratings (i.e., simple correlation) ? The only ratings data I have is what you see here.

Thank you so much,

Dave Juncos

Sorry, the formatting was off in my previous email. Here’s the ratings:

Rater 1

0.1875 adherence for Session 4

0.17045 adherent for Session 5

0.10227 adherent for Session 7

Rater 2

0.22159 adherence for Session 4

0.21591 adherence for Session 5

0.15909 adherence for Session 7

Dave

It’s not that you “can’t” use ICC with these sorts of data; rather, they don’t represent what you probably want them to represent. ICC with N=1 means that you can only generalize to yourself, because you are not a random sample of all possible rating targets. As long as the only question are you concerned about is how well you as an individual can be rated, there is no problem. But that is not really “inter-rater reliability” in the sense that most people would want to know it.

Adding raters doesn’t change this. You don’t have any variance in ratees; thus you violate the assumptions of all reliability tests if you are trying to generalize across other ratees. It is like measuring something at Time 1 and trying to generalize to Time 2. Yes, it _might_ generalize that way. But you have no way to make that claim statistically.

If you’re just interested in comparing raters, that is a different problem. For that, you need variance in raters. You could then use any type of statistical test whose assumptions you are comfortable with, but with N=2, that is somewhat dangerous in terms of the validity of your conclusions. In terms of alternative approaches, a correlation between raters would treat your raters as populations, with a sample of paired sessions. That may or may not be useful to you, interpretively.

Yes, that makes sense. If both raters are only rating me, then you can’t generalize their pattern of rating, i.e., how consistently or how randomly they rated me, to other ratees. I suppose if I’m only concerned with how well I can be rated, then this information is still useful. I guess it confirms the raters were rating me in a consistent way throughout their rating process. Which is useful because… it suggests they were paying attention throughout their ratings. I can’t think of how else that information is useful though.

But does it confirm I was adequately adhering to the manual? No. What I’ll need to do is simply ask my two raters for their subjective impression of my adherence to the manual. That will give me the information I need most.

You may’ve noticed, my adherence ratings were quite low (they ranged from .10227 to .22159). The problem is the adherence scale they used is too inclusive of ALL possible interventions for this particular therapy. Of course, it’s not possible to administer ALL types of therapeutic interventions from this therapy manual in EACH session. Rather, a good therapist will administer only a handful of interventions each session – they’re simply isn’t time to administer ALL types of interventions in just one session.

Thanks again for your helpful guidance! It’s always appreciated.

-Dave Juncos

Hi there,

I have most of the comments on your page so far and still have questions about whether ICC is appropriate for me or not.

I have had children rate 5 dimensions of their quality of life using a scale which on its own was deemed reliable for the population. Their parents then used the same scale to rate their children’s quality of life. Again, Chronbach’s alpha was good.

I am now using ICC to ascertain whether both children and parents were in agreement about their quality of life.

For each dimension I have entered each child’s and parents average score. I have run an ICC Two way random and absolute agreement (although I am still a little unsure if the two way random was correct)

In reading the average measures box most of the scores are ok but I have two ICCs which have come out as negative values. Am I right in just assuming there was very little agreement between parent and child over this dimension or have I made an incorrect assumption somewhere? Your advice is on this is hugely appreciated!!!

If you always have consistent child-parent pairs, I would probably just use a Pearson’s correlation for ease of interpretation. However, that only assesses consistency – so if you’re interested in absolute agreement, ICC is probably your best option.

You don’t have a consistent population of raters – you have a random sample of raters. So you should be using ICC(1).

ICC is not actually calculated as you would think a reliability estimate would be (i.e., by directly dividing true variance by observed variance). Instead, it estimates rater variance by subtracting the within rating target mean square from the between rating target mean square as the first step of its calculation. So if your within rating target mean square is larger than your between rating target mean square, you end up with negative values – that will occur when there is more variance within rating targets than between rating targets (which you could interpret as “very little agreement”).

In your case, it means that parents and their children differ from each other (on average) more than each child differs from the other children (on average).

Thank you so much for the speed of your response. In the case of the negative answer would I still report this or would I simply explain here about the between group difference being larger than the within group difference and not report the coefficient.

That depends on the standards of your field. If it were me, I would probably just report that the ICCs were negative and the implication (parents and children differ more from each other than children differ from other children on that scale).

Hi Dr. Landers, I have bought your book, “A step by step… but I can not find the part about “Computing Intraclass Correlations (ICC) as Estimates of Interrater Reliability in SPSS”. I would like to use your explanation about ICC and reliability as reference in my manuscript and I thought it was in your book. Have i missed that part or is it not included in the book? Thanks for the explanation on the website it’s realy great and you made me understand the analysis in SPSS and the theoretical background:).

Sincerely Charlotta

I’m afraid the book doesn’t have anything about ICC! It’s really just intended as a one-semester intro to stats at the undergrad level. This page is the only material on SPSS. I’ve thought about converting it into a journal article and expanding it a bit, but I haven’t done it yet!

Thank you soooooooooooooooooo much Dr. Really need this for my presentation next week! God bless.

Hi Dr. Landers,

I have a study that had 60 participants rate the content of 30 different articles. Each article was rated by 2 different participants (i.e., each participant rated only one article). The articles were rated on 4 questions, but I would like to use the mean of the 4 items. Am I correct to use ICC (1, 2)? And should my SPSS datafile have one column (the mean of the 4 items) and 60 rows (one for each rater)?

Thank you for your helpful article!

It’s definitely ICC(1), although whether you need the single or average measures depends on what you’re going to do with that number. Your data file would have two columns (rater 1 and rater 2), one line for each case (30 rows), consisting of two means (mean for rater 1, mean for rater 2).

That makes things very clear. Thank you!

I will be using the means for subsequent analyses, so I believe I am interested in consistency, and that is why I planned on using the means. Which I believe is ICC (1, 2).

If you want the reliability of the means because you are using them in subsequent analyses, that only implies you need ICC(1,2) instead of ICC(1,1). Consistency vs. agreement is a different issue.

Ah…my mistake, thank you for clarifying. Yes, that is what I am hoping to do. Thanks for your quick and clear responses! It was a great help!

Dear Richard,

thank you for your great resource and willingness to explain.

Would you please consider if my following reasoning is correct?

I have a sample of n raters. Each rater has to evaluate m products along q different aspects. As my goal is to evaluate if the detected “mean” value for each aspect and product is reliable, I have to understand whether raters have reached a sufficient level of inter-rater agreement.

So far, I (mis?)understood that I should apply ICC(2,k) on a n x m matrix of data, for each of the q aspects. If this is correct, which threshold I’d conventionally consider sufficient to say, “OK, the raters agreed upon”?

Would be equating ICC (average measures) to an agreement coefficient (like Krippendorf’s alpha) plainly wrong?

Symmetrically I could also calculate if the n raters agree on the q aspects for each of the m products. And probably this would make more sense.

I am sorry if my ideas are still a little bit confused: could you help me clarify them with respect to ICC and your valuabe resource? Thank you.

FC

That sounds right to me – SPSS-wise, you’d want n columns and m rows, replicating that approach for each q. That is inter-rater across-product agreement for each aspect, which is most likely what you want to know.

Flipping it so that you had n columns and q rows would give you inter-rater across-aspect agreement for each product. Q columns and m rows would give you inter-aspect across-product agreement for each rater. M rows and n columns would give you inter-product across-rater agreement for each aspect. Any of these (or any variation) might be useful information, depending on what specifically you want to know about.

There’s not really an agreed-upon “threshold”, but the level of reliability that is considered “enough” is going to vary pretty dramatically by field. I would say that the traditional reliability cutoffs – .7 or .8 – are generally “safe” as far as publishing is concerned. Below that, it’s going to vary a bit.

Thank you Richard, for the prompt and clear reply (I admire your diligence and availability). Your answer wrt the threshold opens a reflection that maybe has been already done here, in that case I apologize and accept reference to previous comments. You cited the traditional threshold for interrater agreement (from Krippendorf on): 70% and I don’t have reason to doubt this coud also apply to ICC, although I lack some knowledge to understand its plausibility: I trust you here. However the wikipedia page on ICC as means “in assessing conformity among observers” is pretty vague and avoids to speak of “agreement” (purposely?). It seems that such use of ICC would require observer exchangeability, which is a tough assumption to make, or verify through a test-retest or a two independent sample comparison. Moreover, if I really wanted to distinguish between inter-observer and intra-observer variability, as I want to detect the former one, I should focus on n raters and 1 case, which is a case where ICC is not applicable. Any comment on that would be highly welcome. Thank you very much.

It’s important to remember that .7 has been broadly interpreted as a critical threshold for reliability, and also to remember that all reliability is the same, given a particular measurement. Any given test and population (together) has only one true score reliability – however, the method by which you attempt to assess that reliability makes certain assumptions which may lead to its mismeasurement. So if you say “.7 is enough” for one type of reliability, you’re saying it for all of them. Personally, I would say that naming such thresholds is not meaningful, in general. It is only a shortcut, and sometimes, a harmful one.

Wikipedia is generally not a great source for specific details about statistics. ICC can be used for whatever you want to use it for. It can be used to assess either consistency or agreement, depending on which is meaningful in your situation. You would be much better off reviewing original sources.

You are correct that observer exchangeability is required for ICC(1) or ICC(2). These both assume a population of raters from which you have randomly sampled. In practice, that rarely happens. ICC(3) assumes a population of raters, which is more unusual and less useful conclusion-wise. In most cases, researchers have a convenience sample of raters and must assume they are essentially a random sample to meaningfully use ICC (or any other assessment of reliability). This is a necessary assumption given current assessment approaches in the social sciences.

If you want to assess intra-observer consistency, you should not focus on one case, because then you do not have a random sample of cases over which to assess that statistic. The reliability you obtained would be specific to that one case (a population), which is likely not the intended target when you calculated it.

My English teacher always said to me to avoid “the former” as a pronoun. Actually I was referring to INTER-observer consistency in my last comment. Thus, is your last passage related to that kind of consistency or to the INTRA-observer one? The rest is very clear to me, thank you again for the clarification.

I understood. My response is referring to both – only the last paragraph references intra-observer specifically, because you made an inaccurate statement about it (there is never a situation that I can imagine, intra- or inter-, where you would want to examine a single case – assessment of either requires multiple independent observations by each rater).

I also understood, thank you Richard. As a matter of fact, I said that also to evaluate the inter-rater agreement in 1 case (among n raters) would be interesting (if possible at all), but I admitted that ICC was not applicable. Maybe just the variance, or a Chi-square test on response uniformity could help, or some other measure of inter-rater agreement, but I should check their assumptions to see if they are applicable in such an extreme case. Thank you again!

Dr. Landers,

I would be appreciative if you could confirm that I am using the proper ICC. I have 200 video clips of preschool children interacting with their mothers. Each clip is rated from 1-7 on how compliant the child is (7 = more compliant). 132 of the clips are coded by a single rater (66 by Sarah and a different 66 by Pablo). The remaining 68 clips are coded by both Sarah and Pablo for purposes of assessing interrater reliability. As much as I appreciate their work, I consider my coders to be a random selection of available coders in the world. Clearly not all 200 clips are coded by both coders.

I believe I should use ICC(2, 1) or in SPSS lingo Two-way Random, single measure. It is single measure because when I ultimately analyze these data, such as by correlating child compliance scores with parenting measures, I will not use the average rating from my two coders (since 132 of my clips were not even coded by two coders). I will use a single rating for all 200 clips.

How does this sound to you? If it is correct, do you have any advice on how to pick which coder’s rating to use for the 68 clips where I have two coder’s ratings? Does randomly picking a coder for each of the 68 clips sound right?

THANK YOU!!!!

It’s important to remember that inter-rater reliability can only be determined with multiple raters. So the only cases from which you can determine ICC are the 68 for which you have two raters. So you don’t need to pick anything. On those clips, you are correct: you should calculate ICC(2,1).

But that number may not mean exactly what you think it means. Usually, we calculate ICC on our full sample In such cases, ICC(2,1) will be an estimate of the reliability of one rater across your sample, but you will only be calculating reliability on a subset of that sample. Thus you must assume that 1) those 68 cases differ only randomly from the remainder of your cases and 2) your raters varied in their rating processes between those 68 cases and the full sample only randomly. Hopefully you selected those 68 cases at random, since that is what you need in order to argue that using a subset doesn’t bias your reliability estimate (or any later stats), but it can be subtle. For example, if you had them rate the 68 cases first, you might have over-time biases (ratings may become more accurate with more practice).

Once you have ICC(2,1), you have a lower-bound estimate of reliability for the full sample if you calculate the means of your 68 cases and use those values as your estimates for those 68 cases alongside the 132 estimates you already have. Alternatively, you can randomly select one of the two coders for those 68 cases and use that, in which case you have an accurate estimate of reliability. But you will also be attenuating any later statistics you calculate (lower reliability means smaller correlations, smaller effects, etc). So I would use the means.

Dr. Landers. Thank you so much for your reply. To clarify, when I asked about picking coder ratings, I did not mean for calculating ICC. Clearly one needs two (or more) sets of ratings to calculate ICC. I meant for subsequent analysis between the variable that was initially used in an ICC and other variables. For example I might code compliance and subject it to ICC and then correlate compliance scores with measures of parenting.

Also, I would absolutely randomly pick which clips are coded by two raters to calculate ICC as I agree that there may be time or other effects.

With respect to your last paragraph. I was under the impression that if only a subset of my video clips were rated by two people, that lumping their average scores for those clips with the single scores of the clips only rated by one person would cause problems. You seem to be suggesting I do that since average scores by two raters are desirable. Is that right?

Finally, I guess Im confused by your statement that lumping the average scores of two raters with the single scores will produce a lower-bound estimate of reliability. Im not sure what you mean by lower bound. Are you also saying that randomly picking one of the two raters scores is more accurate? Than you so much again!

Assuming the assumptions of ICC are met, adding additional raters always increases reliability. Where you get into trouble (what I assume you mean by “problems”) is if those assumptions are not met, e.g., if your sample of raters are not a random draw of a population of raters. But in that case, you couldn’t use solo ratings either. The only situation where you might not use the means is if you had Rater 1 rate every case and Rater 2 rate a subset – in which case, you might just stick with Rater 1 for interpretive ease.

You can actually see the reliability of your mean scores by looking at ICC(2,2) in those analyses. It will be higher, and potentially much higher.

Thus, if you combine mean ratings and single ratings, your ICC(2,1) will be an underestimate of the reliability of your sample, which varies by case. But you will still get the effects of that increased reliability in subsequent analyses. Some types of analyses (like HLM) actually give you an overall estimate of reliability regardless of differences in the number of raters, but you need at least 2 raters for every case to do that (most useful when, for example, you have 3 to 5 raters of every case, which is common when studying small groups, teams, etc.).

I thought that the distinction between mean rating and single ratings was not about whether you have single or multiple raters, but about whether single scores from multiple raters or multiple scores (aggregated into a single score) from multiple raters are used to estimate interrater reliability using ICC.

The statistician, Andy Field writes:

So far we have talked about situations in which the measures we’ve used produce single values. However, it is possible that we might have measures that produce an average score. For example, we might get judges to rate paintings in a competition based on style, content, originality, and technical skill. For each judge, their ratings are averaged. The end result is still ratings from a set of judges, but these ratings are an average of many ratings.

Do I have that wrong? I guess Im having trouble mapping what you said about ICC(2, 1) and (2,2). Thanks again!

No, it is about what you want to generalize to, which is what Field is saying. Single measures is “what is the reliability of a single rater?” and average measures is “what is the reliability of the rater mean?” Functionally, when calculating ICC, you use ANOVA to determine the reliability of the rater mean and then use something akin to the Spearman-Brown prophecy formula to determine what that reliability would have been if you’d only had one rater. Once you know one, you know the other.

Last question. If you have 2 raters rate 50 video clips, then which rater (of the two) is the single measures referring to? Thanks!

Neither. You have determined the reliability of their mean rating and then mathematically adjusted that number down, under the assumption that both raters provide the same quality level of information.

Hi Dr. Landers

Greetings

I are currently analysing the relationship between employee performance (job stress) and customer evaluation of service encounter (service quality). I have collected information from employees of 10 different branches (no. of employees range 5 to 25 from each branch) and from customers of these 10 branches (no. of customers range from 10-60 from each branch). I am trying to providing the justification for aggregation of these two datasets to understanding the impact of job stress on customer perceived quality. Since customers interact with employees at the branch level, i wanted to justify aggregation at the branch level. However, i am not sure how to test the interrater agreement and reliabilities. How do we justify aggregation at the branch level? Thanks!

In the terminology of what I wrote above, you can look at ICC by considering each employee to be a “rater”, and each branch to be a target/case – multiple raters (employees) for each case (branch). But the implications and process of aggregation are a bit outside the scope of this particular discussion, primarily because the way you aggregate has theory-building implications. I’ve recommended a few sources elsewhere in this thread that might be helpful. If those don’t help, you’d probably be best off adding a collaborator on your project.

Dear Dr Landers,

Thank you for you very informative article. I am conducting a reliability and validity study and have a few questions that I could not draw definitive answers from the previous posts.

Reliability

I have 1 measure (leg angle) that is collected on 20 children. This measure is collected by one assessor on two occasions and another assessor on a single occasion. The assessors are only a sample from a population of assessors. From this data I want to calculate intra and inter rater reliability of assessors in performing the measure. My current method is as follows (base on my interpretation form the above post) for both intra and inter rater reliability:

Two-way random with absolute agreement and I interpret the Average measures ICC value.

Validity

I also have a ‘gold standard’ measurement of leg angle from MRI of the actual bones. How would you suggest I assess validity? Would I assess validity against a single assessor from one occasion or would I try to include all 3 assessments (i.e. 2 form one assessor and 1 from a further assessor).

Best regards,

Chris

You could do that. However, the intra-assessor number may not represent what you think it represents. If you look at two occasions for one assessor to assess intra-rater reliability, your estimate will only generalize to that one assessor. It [i]may[/i] generalize to other assessors, but you have no way to know that given that analysis. The only way to do that would be to have two or more assessors both rate over time.

For inter-assessor reliability, you would use what you’re saying (although it assumes your two assessors are a random sample of an assessor population).

For validity, since you have a measure without error (or nearly), you can assess validity in a very practical way – mean deviation across assessors from that error-less measurement, and the standard deviation of that deviation.

Hi Dr. Landers,

Thank you for this informative article. Please excuse my limited statistical knowledge as I still have some questions about our project. We have a group of 4 assessors conducting balance tests with study participants and we would like to be double check on our inter-rater reliability. We only planned to have more than 1 rater per assessment at the beginning of the study. Once we established that we have relatively good inter rater reliability, each assessor will then work on his/her own (as we can’t afford to have more than 1 assessor per session). Due to our schedule conflicts, for our 10 completed sessions, we don’t always have all 4 of us there (only one has all 4). However, each session had at least 2 raters (the combination is different each time). I believe that we should do one-way random (and we have a “population” rather than a “sample” – is this right? I am still quite confused between the 2 terms). However, when I ran the analysis in SPSS, it can’t generate any outputs as I only have one valid case (i.e., with all 4 ratings). How should I get around this?

One more questions, we are administering 6 different tests and each has various subtests ranging from 3 to 13. I understand that the ICCs should only be calculated with the “transformed” scores, and each subtest actually has its own transformed score. Is it necessary to calculate ICC for each subtest? Or, is it sufficient to calculate ICC only for the total score for each test?

Sorry for the lengthy post. Thank you for reading!

What you’re describing is a limitation of SPSS, not of ICC. The only way around it is to either 1) calculate ICC in a program that doesn’t have this limitation, such as HLM or R or 2) to calculate an ANOVA in SPSS and then calculate an ICC by hand using the output, based on the formulas provided by Shrout & Fleiss.

For your six different tests, it depends on what you want to know. If you want to know consistency by subtest, you’d need an ICC for each subtest. If you aren’t using subtest scores for anything (e.g., if you’re only using the overall test score in later analyses or for decision-making purposes), you don’t need them. You only need ICC for the means you’re going to end up using.

Dear Dr. Landers,

Thank you for your very prompt reply – much appreciated! I just realized that half of the tests are ordinal data, so I guess I am not really supposed to use ICC. Should I do kappa instead? For the crosstab analyses, I can only seem to compare 2 raters at once. Is this another limitation of SPSS?

Is there another resource that you can direct me to, so I can calculate the ICC by hand using ANOVA output in SPSS? Thank you very much!

Yes, kappa would be more appropriate for ordinal data. For kappa, there are actually several different types. Cohen’s kappa, which is what SPSS calculates, is for 2 raters only. You’ll want Fleiss’ kappa, which I believe you need to do by hand. More info here: Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.

For calculating ICC by hand, all of the formulas are contained in Shrout & Fleiss, which is referenced in the footnote of the article above.

Thank you for taking your time to answer my questions!

Thank you Rechard for your nice answers. I have one question. I want to aggrgete individual level data to organizational level. What statisitcs can I use to justify that such aggregations can represent organizational level construct? Thank you.

That is an enormously complicated question and has a significant theoretical component – in short, it depends on how you conceptualize “organizational level.” I’d recommend Johnson, Rosen & Chang (2011) in Journal of Business and Psychology as a starting point.

Hi Dr. Landers,

Thank you for this article, it’s very clear and now I know that ICCs aren’t what I need. But I’m hoping you can help me figure out what I do need! I have 2 raters each rating about 200 grant proposals on the types of strategies they are using to increase student retention. They had 19 possible strategies to choose from, and could apply as many strategies as they want for any given proposal (so, for example, Coder A may give proposal 1 codes 7, 8 & 4, whereas Coder B may give the same proposal the codes of 4, 8, and 1). These strategies are independent (i.e., not part of a scale) and nominal. Any ideas regarding how I would assess agreement for this type of scenario? I’ve figured I can dichotomize each strategy separately and calculate agreement between the two raters using Cohen’s kappa for each strategy individually (giving me 19 separate kappas, one for each strategy). But I was wondering: Is there a way to get agreement across all 19 strategies? Can you think of a better methods I should be using for this data?

If you have two coders consistently, determining inter-rater reliability for each of the 19 codes with Cohen’s kappa (matching “identified the code” or “did not identify the code”) is the right approach. I don’t know what you mean by “agreement across all 19” – you stated that the strategies were not part of a scale and were independent, and thus the “overall” agreement would not be a meaningful statistic. If you just want to know “on average, what is the reliability”, I would calculate just that – the mean Cohen’s kappa across your 19 scales. Reliabilities are proportions/percentages, so you can just calculate the “average reliability” if that’s what you want to know. But assuming you’re writing this up for publication somewhere (or an internal report, or whatever), you should still report all 19, since each code has its own validity, and differences in reliability/validity across codes is a meaningful concept given what you’ve described (if there is disagreement on one strategy but agreement on 18, this would be washed out in an average yet would be something important to know).

That makes sense, thank you so much Dr. Flanders!

Dear Dr. Landers,

This is a fabulous explanation of ICC and SPSS. Thank you so much! It’s especially helpful to me since my field, Library Science, does not use statistics & math as heavily as other fields. I have a question about the difference between “single measures” and “average measures” that appear in the output SPSS provides.

I’m currently conducting a rubric assessment to evaluation undergrad senior theses. Three librarians (2 colleagues and myself) scored an initial sample of 20 senior theses on 9 criteria measuring research and citation quality. The 9 rubric criteria were rated on a scale from 1 to 4. All three of us rated the same 20 projects on all 9 criteria, so I had SPSS calculate ICC for Two-Way Random, Absolute Agreement. I calculated ICC for each of the 9 criteria, but reading over your advice to Sofia, I also ran ICC on the total scores and average scores (i.e. I calculated created a sum total score and an average score for each senior project, for each rater).

My Question: Would I want to look at Average Measures or Single Measures? In our case, the average measures in SPSS look much more favorable (higher intraclass correlation) than the single measures.

Our hope with our initial testing of the rubric is to show an acceptable degree of inter-rater reliability (ICC of at least 0.7) so that we can move on to individually scoring a larger sample senior theses (much less time-consuming!). Thanks again for any light you can shed on this.

Darcy Gervasio,

Reference Librarian, Purchase College Library

If you’re interested in individually scoring the theses, you’re talking about the reliability of a single scorer (i.e., a single measure). If it makes you feel any better, the reliability of individual experts reviewing written student work (e.g., for course grades) is generally pretty terrible. That you’re paying attention to reliability at all puts you far ahead of most people doing this sort of thing.

The easiest way to develop a rubric with high reliability is to place objective criteria on it (e.g., number of citations). The more subjective it gets, the poorer your reliability is going to be.

If that’s not an option, unreliability in your case is going to stem from different mental models among the three of you about what each dimension of the rubric represents. You can also increase reliability by getting a unified mental model for each dimension among the three of you – the best way to do that is to do what’s called frame-of-reference training – identify the five or so papers where you had the greatest degree of disagreement, and then meet to talk about why your scores were different. In that discussion, try to come to a shared understanding/agreement of what score is the “right” score for each of those papers on each dimension. You should then find that your reliability goes up in a second sample of grading if you all commit to that mental model. And you might get it high enough to push that single measures reliability estimate up to a level you are comfortable with.

For the record, for “practical decisions” like grades, the reliability level recommended by Nunnally is 0.8 to 0.9. In practice, that’s obviously quite rare.

Thanks so much for your swift & helpful response! This clarifies a lot and reassures me about the “squishy” nature of grading and reliability. We have conducted one round of frame-of-referencing already on a smaller sample, but I will definitely try your suggestion of looking at the senior projects in this sample where we had the most disagreement. Unfortunately quantitative measures like raw number of citations don’t really get at the QUALITY of the citations (were they correctly formatted? did the student cite consistently?). It’s always tricky to conduct authentic assessments of written work and research skills since grading can be so subjective. Hopefully we can work towards unifying our mental model!

Even with subjective measures, you can usually get better reliability through one of two approaches.

One, you can often find room for a little more objectivity within a subjective measure. For example, correctly formatted citations as a single dimension seems highly objective to me – it only becomes subjective when you combine it with other less objective aspects of overall citation quality. A “consistent citation” dimension can be anchored – what does “highly consistent” mean and what’s an example? What does “moderately consistent” mean and what’s an example?

Two, ensure that your raters aren’t making their own priorities within dimensions. For example, if you have an overall “Citations” dimension, Rater 1 may think formatting is more important than consistency whereas Rater 2 may think the opposite. Even if they have the same judgments about the underlying aspects of “Citations”, that can result in disagreement regarding final scores. Chopping your dimensions up into finer pieces can help reduce that problem. But of course, it also adds more dimensions you’ll need to code!

Hi Dr Landers,

Firstly I’d like to say thanks for writing this, it is certainly more comprehensible than some other pieces I’ve read; however, I still have some residual confusion and wondered if you’d be kind enough to help please?

I’m trying to gauge reliability in decisions to include/exclude abstracts between 2 raters for a number of papers. The same raters rated all of the abstracts and data has been entered as either 1 (include) or 0 (exclude) into SPSS.

I believe I require ICC(2) but was unsure which specificity I require – ICC (2,2)? Therefore, analyse as: model as ‘two-way random’ and type as ‘absolute agreement’. Then read ‘single measures’ line.

Many thanks in advance!

Even if you can represent your outcome as 0 and 1, your outcome must be ratio-level measurement (i.e., can be conceptualized as 0% or 100% of something) in order to use ICC, and even in that case, ICC will be biased low because of the lack of data in the middle of the scale (e.g., no 50%). So I would not use ICC in this case, since your scale is not ratio. A better choice is Cohen’s kappa.

Thank you so much for your swift response!

To be honest, I was somewhat confused why I had been asked to do an ICC (by a supervisor) but assumed my confusion was because of my poor understanding. I feel far more comfortable with Cohen’s kappa. I really appreciate you taking the time to explain why.

Dear Richard

I enjoyed reading your content on the ICC and thanks a lot. Nevertheless, when I investigate test-retest reliability of a test that measure “Right Ear advantage” -a neuropsychologic variable- but rater is consistent in two sessions of testing, i should use two-way mixed model, Is that right?

If you have only one rater and always the same rater, I would actually just use a Pearson’s correlation. ICC assumes that your raters are a random sample of a population of raters; if you only have one rater making two ratings, you don’t meet those assumptions.

Greetings Dr. Landers,

I have read several of the questions in your assistance section and you likely have answered mine in the 130 plus pages, so my apologies for a redundant question. I created a global assessment measure for schools. This is a 1-100 scale that is broken into 10 deciles. Each decile includes terms to denote levels of student functioning in school. I contacted school personnel across the state and asked them to use the measure to score five vignettes. 64 of the school personnel met my criteria and rated EACH of the five vignettes. Thanks to your explanation, (and with hope that I understand it correctly, I am using a (2,2) as I consider this a sample of the larger population and look for consistency as I am coding for research. Across the top of my SPSS page, I have the 64 columns representing each rater, and my five rows representing the five student vignettes they scored. I ran the analysis and got a crazy high ICC (> .98), so I just wanted to make sure that this is set up correctly. Many thanks.

If you’re looking at average measures, that is actually ICC(2,64), not ICC(2,2). You have a high reliability because you essentially have a 64-item measure, which is a very high number of raters. If you always plan to have 64 people rate every vignette, or if you’re using that mean score as some sort of “standard” for a later purpose, then that’s the correct number. However, if you are in fact looking to use those numbers as standards later (e.g., if you’re planning to use the means you get from the 64 raters as an “official” number representing each vignette), just remember that high reliability 1) does not imply unidimensionality and 2) does not imply construct validity.

But Pearson’s correlation is not sensitive to systematic error. In my study rater has no effect because response of the patients is binary (Correct recall/Not correct recall). In test-retest reliability using ICC rater has replaced with trial. what is your opinion؟

Ahh, so that means you are not assessing inter-rater reliability – you are trying to assess inter-trial consistency. In that case, I am not sure which systematic error you’re concerned with – if you are concerned with error associated with each trial (your “raters”, in this case), you could use ICC(2). This would be most appropriate if the trials were different in some consistent way. If the trials are essentially random (e.g., if the two trials don’t differ in any particular recognizable way and are essentially random with regards to a population of trials), you should use ICC(1). However, you should remember that both of these assume one trial has no effect on the other trial; if they do, you’d still want to use Pearson’s.

The use of ICC further depends upon how you’re going to use the data. If you want to use average measures (e.g., ICC[1,2] or ICC[2,2]), your final rating will take one of three values: 100%, 50%, or 0% (the average number of successes across trials). If you need to use single measures so that all final scores are either “correct” or “not correct” you’d need ICC(1,1).

You might also consider use of binary comparison, which might be more appropriate given your data structure, like Phi coefficients or even chi-square.

Thanks for your response Dr. Landers,

I actually meant to report this as a single measure (I think). I want to be able to at some point have individuals use the instrument to assess baseline behavior and later use as a progress monitoring tool. Regarding the “crazy good” comment, I didn’t mean to infer that the measure is superior, I simply meant I didn’t expect to get such a high correlation and assumed I did something wrong. Many thanks. You are helping many of us and it is appreciated.

Yes, that is a correct application of single measures. I’m still not quite clear on your design, but if you got .98 for single measures, that is worryingly high – I would suspect inadequate variance in your sample of rating targets. 5 is not very many to believe you have a completely random selection of vignettes of all possible vignettes. You also want to ensure you have adequate variance in those scales – means should be toward the middle, and 2 SDs out on each vignette should still be well within the scale range. If you have floor or ceiling effects, that will also bias ICC.

So this may be the result of a poor design. I asked for multiple teachers to rate each of the 5 vignettes. I struggled to get teachers to complete the survey online, and to increase the number of variables would have likely resulted in even fewer participants. So would I have been better off to have FEWER teachers rate MORE vignettes? How is sample size calculated for a study like mine? Again thanks.

Yes, that would have been better. It is a balancing act though – ideally you want a large sample of rating targets and as many raters as you can get. You can actually get an estimate of how many raters you actually needed with the data you have – you would use the Spearman-Brown prophesy formula on ICC(2,1) to calculate a hypothetical ICC(2,k) in order to determine how many raters you would have needed for a stable estimate of ICC(2,1) – for example, with reliability = .8.

Conceptually, it’s easiest to think of raters as items, e.g., 5 people making a rating of one target is like a 5-item test. So you could use Spearman-Brown to determine how many people (items) you needed to get a target reliability across the whole scale, which is ICC(2,k) – you are essentially solving for k with a target reliability.

Your sample size is 5. There’s no calculation involved.

Dear Dr. Landers,

many thanks for your article, it is very clear and helpful!

I still have a question and I hope you can help me to understand better how to proceed.

I am developing a questionnaire to assess the awareness of a certain behavior. It will be probably a 20 item questionnaire (more or less) with a 5 point scale answer.

Each person will rate itself filling the questionnaire and for each person, two people who know him very well will also complete the questionnaire. I am interested to look at the inter-rater reliability so I was thinking to use an ICC (three raters: self-rating, rater 1 and rater 2).

How can I measure the sample size? What is the best number of ratee (self-raters)?

I was thinking to use a sample of 60 people but I need of a clear rationale to do that.

Many thanks,

Mia

That sounds like a situation for ICC. For what sample size you’ll need, it depends on what you want to do with it. If you’re taking the mean and using it some other stats, you need a power analysis, probably using a program like G*Power: http://www.gpower.hhu.de/

Dear Richard,

could you please elaborate on your comment on power analysis a little further? Do you mean that this kind of analysis can tell when it’s safe to use the mean and related parametric tests from ordinal values (assessments) because it can tell us if the effect is such that also non conservative techniques can be used, even if main assumptions are not met?

Thank you for your support and continuous help.

No, you’re referring to assumption checking.

Power analysis refers to procedures used to determine the appropriate sample size needed to reject a null hypothesis given a particular expected effect size and analytic approach. It is not something I can easily explain in a blog comment. I would suggest you start here: http://www.statsoft.com/Textbook/Power-Analysis

Sure Richard. It was just to understand for what you were suggesting the power analysis and your prompt reply answers my doubt already. As you possess a rare talent for clear explanation I hope in the future you ll treat more topics in statistics here (hypothesis testing, power analysis,…) as your blog would soon become a classic in popular (statistical) science… Best

Dr. Landers,

I am conducting something somewhat like a psychophysics experiment. I have numerous participants (50), who will provide ratings of 10 different variables, but they will do so 4 times each. Can ICC be used for INTRArater reliability? I understand it’s most common use is for interrater reliability but I am not exactly sure of which method to use for measuring how accurate each rater is across the four times they will rate the 10 variables.

Thanks very much for your help.

Sure. Just flip the matrix to whatever configuration you need. Just keep in mind what you’re assuming.

In ICC, you always have three sources of variance: case variance, which is expected to be random, systematic rater variance, which you only assess with ICC(2) and ICC(3), and unsystematic rater variance.

If you flip the matrix to examine intra-rater variance (cases as raters, raters as cases), you are changing this around a bit:

rater variance, which is expected to be random, systematic case variance, which you only assess with ICC(2) and ICC(3), and unsystematic case variance.

Note that this will not tell you how consistent

individualraters are – instead, you’re only assessing intrarater reliability.If what you’re really trying to do is identify “bad” raters, there is no set procedure to do that. But what I usually do is use the reliability analysis tool in SPSS with the “variance if item deleted” function to see if alpha reliability would increase dramatically if a particular rater was deleted – if that effect is consistent across rating targets, then that’s pretty good evidence that there’s something wrong with the rater. You can also look at rating distributions – sometimes for example you can see normal distributions for most raters but then severe skew for one.

Dear Dr. Landers,

many thanks for your help!

I will go through Gpower as you suggested.

Many thanks!!

Thank you sir, this is really of great help for me as i was unable to find any source to know how to calculate ICC(1), ICC(2) but now i understand how to calculate it. My research work deals with groups in public sector undertakings. I will be grateful to you if u also tell me how to calculate rwg(J)….to justify the aggregation of individual scores….or any link related to the process of calculation of rwg(J)..

I would suggest:

LeBreton, J. M. & Senter, J. L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11, 815-822.

Also keep in mind that in the aggregation literature, ICC(1) refers to Shrout & Fleiss’ ICC(2,1) whereas ICC(2) refers to ICC(2,k).

Could you maybe tell me whether ICC is the appropriate method in the following situation, since I’m not entirely sure.

20 respondents of a questionnaire have been asked to evaluate the importance of a certain matter, on a scale from 0 to 10. 10 respondents belong to group A and the other 10 belong to group B. I want to determine how well respondents in group A agree on the importance of this matter in comparison to group B.

Thank you for your help!

That sounds like an independent-samples t-test to me.

Dear Dr. Landers,

Let me please give some additional information to my previous question. Both Group A and Group B consist of 10 experts, which are samples of the total population. When we add the experts from both groups together, we are consequently speaking of 20 experts in total.

Group A is asked to evaluate the importance of 30 factors/issues on a scale from 0 to 10.

Group B is also asked to evaluate the importance of the same 30 factors/issues.

Now I want to compare these groups with each other. Using Intraclass correlation I want to apply the “Average measures value” to indicate the degree of agreement among the rater groups A and B.

Could you tell me wheter ICC is the right test to determine the degree of agreement among groups in the first place? If not, which test can?

In addition I tried to incorporate the above in SPSS. For every factor/issue I run a separate ICC (1,2) test, so 30 tests in total. In the columns (2) I have put the rater groups: Rater group A and Rater group B. The rows (12) represent the expert measures.

Could you maybe explain me whether this is right?

Thanks in advance!

I’m not clear if your 30 factors are dimensions/scales or cases. If they are cases, I think this is still a t-test problem. If they are dimensions/scales, I’m not clear on what your cases are.

If you are treating them as cases, your 30 issues must be a random sample of all possible issues. If that’s not true, they are scales.

If they are scales, you need to have replication across scales, e.g., each expert in each group rates 20 cases on all 30 issues.

The design approach you are taking determines the analytic approach you would take moving forward.

Dear Dr. Landers

Thank you for this impressive website and your indefatigable work in teaching us statistics. I have a study where we measure lipids in two successive blood samples from a series of 20 patients. What I want to see is how reliable the measurements are for each lipid (some will be good, some not is my guess), i.e. do we get the same value in the second measurement as in the first. They are all analysed at the same time on the same machine. I know that the lipid levels are continuous measures, but they are not normally distributed (positive skew). Is the ICC(1) test appropriate? If so, should I report the single measures or average measures? Many thanks in advance.

ICC is based on ANOVA, so if assumptions of normality (or anything else from ANOVA) don’t hold, you can’t use ICC. There are a couple of possible approaches here. If you think that the data are normal with a positive skew, the easiest approach would be to use transformations to bring it back toward normality – most commonly recommended is a Box-Cox procedure. If your data are not in fact skewed positive and are actually a distribution that simply appears skewed – like a Poisson – then you shouldn’t do that. In that case, you might consider using a rank-order correlation, like Spearman’s. The downside to that approach is that Spearman’s assumes your measure ordering is meaningful – that the first measurement and second measurement are paired – so that could bias your estimate – and it also only gives you the reliability of a single measure – the equivalent of ICC(2,1) – which might not be what you want.

Beyond that, I’m out of ideas.

Many thanks for the speedy reply – much appreciated. I can the data to pass the D’Agostino and Pearson omnibus normality test when I use log10 values, so I guess I can use those, but I’ll have a crack at Box-Cox since you recommend that. That gets me back to the question about which ICC to use. I had a go and I seem to get the same result in SPSS regardless of the model – is that simply because I have only two measures per patient, rather than a single bunch?

If by models you’re talking about ICC(1) vs ICC(2)/ICC(3), you’ll see little difference between them when you don’t have consistent rater effects – that is essentially what ICC(2)/(3) “correct” for versus ICC(1). Since you’re just doing two random samples (there are no consistent raters), you would not expect a rater effect. If there is no anticipated rater effect (as you’d expect when ICC(1) is appropriate), ICC(2)’s calculation can take advantage of chance rater variation, which would bias ICC(2) upward. But that is why you should stick with ICC(1), even if the values end up being similar.

All the Box-Cox procedure does is allows you to see the effect of various lambdas (transformations expressed as data to the lambda power) to see which one approximates normality most closely. If you’ve already found another transformation that minimizes skew, it’s probably not going to make things much better.

Dear Dr Landers,

On your above comment you say that ICC is based on ANOVA and can therefore only be used on parametric data. I have been struggling to find a quote to this effect so was wondering if you might know of any references I can use to support this?

Many thanks

Catherine

I don’t know where you might find a “quote” to that effect, but Shrout and Fleiss talk to a great extent about the derivation of ICC from ANOVA. You might find a specific quote in there somewhere that meets your needs.

Hello Dr Landers,

I really enjoyed reading all the discussion about the best ways to use ICCs.

In my study I’m comparing the reliability of measures across three different instruments, and trying to infer recommendations on the most reliable one to be used in the population of interest.

The ICCs were all greater than .75, and the 95%CI all overlapped. However, the ICC point estimate of one instrument was not always contained within the 95%CI of the other.

Question:

Is it correct to infer that one ICC was significantly higher/lower than the other if the point estimate fell outside of the 95%CI, even though the CIs overlapped?

Any advice would be much appreciated.

Thanks you very much,

Gus

To conclude statistical significance definitively, the confidence intervals cannot overlap at all. However, if two intervals do overlap, I believe that the estimates could still be statistically significantly different, although it is unlikely. At least, that is my impression, since that is the way means work (i.e., for independent-samples t-tests). I’m actually not sure what sort of test would be best for the difference between two ICCs (that’s a very uncommon need, so there is not much literature), but I suspect it might be something like the test between two independent Pearson’s correlations. I’m not positive that the sampling distributions are the same though (for Pearson vs ICC), so don’t take that as a definitive recommendation.

Once again, many thanks – you a a great teacher. I will try to be a good student!

Things are moving on… In my samples I have the 18 cases who did two series of sampling on two separate occasions. Series A: sample – rest – sample. Series B: sample – intervention – sample. For series A, I get an ICC value of 0.63 with 95% confidence limits 0.27 – 0.84. For series B, I get an ICC value of -0.061, with 95% confidence limits -0.48 – 0.39. At first sight, it looks like the intervention affects the test-retest reliability, but that is describing “the difference of the significances”, whereas what I need to do is to measure “the significance of the difference”. The two values are for the same cases, randomised to do either A or B (or vice versa). The mean values do not change, and I don’t see an interaction term in a two-way repeated ANOVA, but you wouldn’t expect to if the intervention is changing the reliability of the measure in a random, rather than a directed, manner. Is there a neat way to compare my two ICC values? Many thanks in advance.

Examination of overlapping confidence intervals is the traditional way to determine if two independent point estimates are statistically significantly different. However, I am not sure of the procedure used for ICC, since that’s not a very common need, and I’m not aware of any research literature on it. In your case, to make this more complicated, you have a within-subjects design, so straight confidence interval comparison won’t actually work either, since that assumes the two ICCs are drawn from different samples. So I don’t have a recommendation I’m afraid. Let me know if you figure it out though!

Many thamks for the input. I’ll do my best!

I have collected questionnaire based data from around 100 firms. From each firm, a sample ranging between 5 to 30 individuals have responded. For study at the firm level, I intend to create a firm level score from their individual respondents. Which ICC do I need to calculate and report. And how do I calculate ICC and Rwg. I am using SPSS 20. for my research. Please help

Aggregation is more complex than reliability alone, but you generally need both 1) an estimate of how much an individual’s opinion reflects their group, which is ICC(1,1) and 2) an estimate of how much true score variance is captured by group means, which is ICC(1,k). If your groups range for 5 to 30, you’ll want to either calculate ICC manually using ANOVA and hand-computation or a program that does it for you, like HLM. I don’t know how off-hand to calculate rwg since I don’t do aggregation research, but I’ve cited a few articles in other comments that should help.

Dear Dr. Richard,

Thank you very much for the informative post. This is very useful because you have a real talent of explaining statistics to people who from non-statistical background.

I have computed ‘resilience level’ of 40 people using three different indices [i.e. methods]. In all three indices, resilience level has been computed as an aggregated value of a set of indicators (eg. age, sex, eye-sight, pulse rate). Indicators use in one method is not exactly similar to the other. I meant some indicators are common and some are different. Each index use a different method of computing the aggregate value [i.e.resilience level]. I wanted to see how identical the results obtained from three indices. For that I have calculated Pearson co-relation but it only compares two pairs. As I want to compare the relationship of all three, can I use ICC?

To make the question more clearer, I want to say how similar/distant the 3 resilience levels (of the same person) computed by 3 methods. (I have a sample of 40 people) Will ICC help for this?

Thanks in-advanced for any explanation you can give.

Chet

That’s not reliability, although it is a valid use of ICC. If the items are not identically scaled (e.g., all from 1 to 10), you would need the consistency type (not agreement). The problem, however, is that if there was disagreement, you would not be able to say which of your three indices was causing it. A correlation matrix would still tell you more.

Assuming all of your indices are on the same scale, and if all of your indicators are being used in some combination across your indices (i.e., if each index is made of the same indicators but in different combinations), I would think you’d want to use multiple regression to determine the ideal combination empirically based upon your criterion (whatever you’re trying to predict). If you don’t have a criterion, I would probably just use a 1-way ANOVA with post-hoc tests, defining each group of indices as a condition. Depends a bit on your specific RQs though.

Dear Richard,

Thank you very much for this post, i have read a lot around this topic, including the S&F article, but this was the clearest explanation I found. I don’t have a particular model to test but i am teaching this topic to PhD students and i would like to be entirely clear before doing so, so i have been simulating different analyses and reached a problem i can’t really make sense of. Any suggestions would be most helpful

In this case i have 3 raters (which as fictitious i considered that they could be a sample or the population) rating 8 ratees.

My problem is that regardless of using the RANDOM (which i think one should do if looking forICC 2,1, k) or the MIXED (for ICC 3, 1, k) options, i obtain the same results. In other words, these syntaxes give me the same results.

RELIABILITY

/VARIABLES=judge1 judge2 judge3

/SCALE(‘ALL VARIABLES’) ALL

/MODEL=ALPHA

/ICC=MODEL(RANDOM) TYPE(CONSISTENCY) CIN=95 TESTVAL=0.

RELIABILITY

/VARIABLES=judge1 judge2 judge3

/SCALE(‘ALL VARIABLES’) ALL

/MODEL=ALPHA

/ICC=MODEL(MIXED) TYPE(CONSISTENCY) CIN=95 TESTVAL=0.

When changing the option from CONSISTENCY to ABSOLUTE (agreement) the results are again the same for these two options (though different from the results obtained with the option CONSISTENCY)

RELIABILITY

/VARIABLES=judge1 judge2 judge3

/SCALE(‘ALL VARIABLES’) ALL

/MODEL=ALPHA

/ICC=MODEL(RANDOM) TYPE(ABSOLUTE) CIN=95 TESTVAL=0.

RELIABILITY

/VARIABLES=judge1 judge2 judge3

/SCALE(‘ALL VARIABLES’) ALL

/MODEL=ALPHA

/ICC=MODEL(MIXED) TYPE(ABSOLUTE) CIN=95 TESTVAL=0.

I ran the same dataset in an online calculator and the results I obtained with the option (ABSOLUTE) they report as ICC (2, 1, K) and the results I obtained with the options (CONSISTENCY appear reported as ICC (3,1,k). http://department.obg.cuhk.edu.hk/researchsupport/IntraClass_correlation.asp

I was confused with this, as I was expecting the results to differ as a consequence of random/mixed, and not just ABSOLUTE/CONSISTENCY. Does this make sense to you?

The other question is (I hope) simpler – what are the cut off points recommended for ICC (2, 1), ICC (2, k) and ICC (3, 1) ICC (3,K)

Thank you so much for your time and advice!

Claudia

Reporting of ICC is quite inconsistent, so I wouldn’t read much into the particular way anyone reports it unless they give the source they’re basing it on – that is why it is still to this day important to cite Shrout & Fleiss if reporting an ICC based upon Shrout & Fleiss. There are actually types of ICC beyond those of Shrout & Fleiss, so this is just one conceptualization – although it is one of the two most common.

Absolute vs. consistency agreement is not typically part of reporting in the Shrout and Fleiss approach, because Shrout and Fleiss were only interested in agreement. Their work was actually later extended by McGraw & Wong (1996) to include consistency estimates. When agreement is high, consistency and agreement versions will nearly agree, although agreement can be a little higher in certain circumstances. But you can make the difference more obvious in simulation if (for example) you add a large constant to all of the scores from one of the raters only. For example, if you’re simulating a 5-point Likert scale across 3 raters, add 20 to all scores from one rater. Your consistency ICC will stay exactly the same, whereas the agreement ICC will drop dramatically. In practice, you’re only going to see differences here when there are large mean differences between raters but similar rank orderings.

The difference between Shrout & Fleiss’ ICC(2) and ICC(3) is more subtle, although disagreement between them become large when agreement is poor but consistency is high.

Your question actually triggered me to hunt down the SPSS documentation on ICC, and it looks like they use the McGraw & Wong conceptualization, although many of McGraw & Wong’s versions of ICC are missing from SPSS. I also ran a few simulations myself and found that ICC(2) and ICC(3) in SPSS agree in situations where they do not agree when I analyze the same dataset in R using the ICC function of the “psych” package. So I’m at a bit of a loss here. There’s not enough detail in the SPSS documentation to be able to hunt down the reason. You can see the full extent of this SPSS documentation here: http://www-01.ibm.com/support/knowledgecenter/SSLVMB_22.0.0/com.ibm.spss.statistics.algorithms/alg_reliability_intraclass.htm

You can also see some discussion of how the SPSS calculations were incorrect for several previous versions of SPSS in the documentation for the R package – perhaps they are wrong again:

http://www.personality-project.org/r/psych/help/ICC.html

Dear Richard,

Thank you so much for your prompt and thorough response, and for the links for the SPSS materials, I will check them.

I went back to the data provided as an example by Shrout & Fleiss and ran it again in SPSS to check if I could replicate the results reported in their ICCs table by using the SPSS – ICC calculations.

Regarding the ICC (1,1,k ) and the SPSS one-way random the results are exactly the same.

I am able to obtain the results that S&F report for ICC (2,1,k) by selecting as type (ABSOLUTE), regardless of the model being random or mixed;

and i am able to reproduce the results that S&F report for ICC (3, 1, K) by selecting as type (CONSISTENCY), again regardless of the model being random or mixed.

So it seems to me that SPSS nomenclature is probably different than that used by S&F.

I am however not sure if i can assume that the pattern of results will be consistent for all cases I may with to test.

In any case, my best conclusion so far is if wishing to obtain ICC (2,1,k) i should use model: random (in case results change as agreement/consistence change, as you mention) type: agreement; and if wishing to obtain ICC (3, 1, k) one should use model: mixed; type: consistency.

Another – and possibly more trustworthy – option is to run the anovas with random/mixed effects according to the type of ICC required and calculate the formulas.

Again, thank you very much for your post insights on this, it was most helpful.

Claudia

Dear Richard,

That’s a very informative post. I have a query regarding the usage of cronbach’s alpha for scales with only two items. Is it appropriate to measure internal consistency using cronbach’s alpha for scales that have two items whose responses are rated on a five point likert scale. How does the number of items in a scale affect its cronbach’s alpha?

thanks in advance

santosh

As long as the standards of your field allow Likert-type scales to be treated as interval or ratio measurement, then sure. Some fields are more picky about that than others. The relationship between scale length and alpha is complicated – I’d suggest you take a look at the following article dedicated to exploring that idea:

Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98-104.

thanks a lot Richards, will look into the source suggested by you

So if an ICC of 0.4 with a 95% CI of 0.37-0.44 – is this adequate inter-tester reliability?

Your assistance is greatly appreciated!

That depends on the standards of your field and what you’re doing with that number. If you are trying to assess the reliability of a mean that will be used in other analyses, that is pretty bad, generally. Inter-rater reliability is still reliability, so the traditional standards (e.g., Nunnally’s alpha = .7) still apply if you trust such traditions. However, there are many areas with lower standards of quality for measurement.

Hello Dr. Landers,

Do you provide any examples of the results write up for the ICC? I have ‘a two-way random effects, absolute agreement single measures ICC (2,1) .877. What exactly does this indicate?

Thanks in advance

I’m not quite sure what you mean by “results write up”, but assuming that’s a Shrout & Fleiss ICC, it means that 87.7% of the observed variance of a single rater is true variance (i.e., reflecting the construct jointly measured by the population of raters the rater is randomly drawn from) when considering mean differences between raters to be meaningful. You would just report it as ICC(2,1) = .877.

Thanks,

How do I cite you and this website (APA)?

It depends on what part you want to cite. I recommend checking this out for guidance: http://blog.apastyle.org/apastyle/2010/11/how-to-cite-something-you-found-on-a-website-in-apa-style.html

Dear Dr. Landers How many raters? (I know at least 2) or how many ratees? do I need to have power to determine the reliability of a classification scheme. Subjects are assessed for personality and are given a numerical classification of 0-24

Thanks Andrea

For how many raters, that depends upon what you want to know. If your goal is to produce a reliable mean score, it depends upon how reliable each person’s individual observations are, which depends upon your coding scheme, and what degree of reliability you are targeting. You can use prior literature to help guide that decision, and then the Spearman-Brown prophecy formula to figure out the specific number. For personality, my intuition is that you’d need 5 to 7 raters to reach a reliability of the mean rating of .7, but that will vary by personality trait.

For how many ratees, that depends upon how wide of a confidence interval you are willing to accept around ICC, and you can calculate a priori power based upon that model. For t-tests, which are similar, the typical rules of thumb more or less apply if you don’t want to bother with power calculations: at least 20 per measure. So if you end up needing 2 raters to get reliability, I’d get at least 40 people. If you need 6, I’d feel pretty safe with 100. More than 250 is probably enough regardless of how many raters you have, unless your single-measures ICC is exceptionally poor. The only way to be sure is to run through the power calcs for your particular situation.

Hi Dr. Landers,

I’m so happy I came across your site! I am a beginner to ICCs and this site has proved to be very valuable.

I have a question that I was hoping you could help me out on. I am involved in a study that is exploring the reliability of a situational judgment task (basically the development of a behavioral observation scale). The test is composed of 14 questions with scores ranging from 1-3 for each item. We had two independent raters code 69 tapes, with an overall average ICC of .804. The only thing I’m worried about is two of those 69 tapes produced negative ICCs (-0.182 and -0.083) and I’m wondering how those should be interpreted.

Through this thread I learned that these scores might be due to 1) a lack of variability (and you suggested the scale be reanchored to increase variability, but I don’t think that is possible at this point), 2) a small sample size and/or 3) near-perfect agreement. But I’m still wondering what to do with those values when writing up the manuscript. Should they just be forced to 0 since it only happens twice? Or does it even need to be mentioned due to the low baserate (only 2 tapes?) And if so, how would you recommend describing and reporting those values?

Thank you so much for your help, I really appreciate it!!

I would actually say that a rating 1-3 makes ICC somewhat inappropriate – that sounds like an ordinal/categorical scale to me, in which case I’d use a Cohen’s kappa or simple percentage agreement instead. The further you deviate from the assumptions behind ICC (i.e., normally distributed interval/ratio data), the less interpretable ICC will be. Have you checked if you have nice, clean normal distributions of the 14-question means or are they a bit lopsided?

Dr. Landers,

Thanks for the wonderful explanation! ICC is very confusing and I really appreciate this post! Nevertheless, I do have a question about my data: I have a repeated measure dataset with data for protocol compliance for 2 dozen participants over 6 observations using two types of criteria (one is the gold standard, the other one is the new method of measurement; both binary variables, compliance 1 or 0). I am interested in see how much in agreement the new method is to the gold standard, considering the within subject correlation (the idea being compliant participants are likely to be compliant in the subsequent observation, non-compliant participants are likely to be non-compliant later).

I’ve been told that I can use repeated measure kappa statistics, but I am not quite sure how to make the choice.

I’d really appreciate any of your suggestions and thank you so much in advance!!

Cheng

ICC, as considered by Shrout & Fleiss, isn’t appropriate for data captured over time. However, the more general case of ICC could be, although you’d need to either a program like HLM to model it (i.e., observations nested within time nested within person). But even if you did that, it would be non-optimal given the binary nature of your ratings. You also have a somewhat more difficult case because you have meaningful ordering – in statistical terms, your gold standard rater is not drawn from the same population as your experimental rater, so you know a priori that the assumptions of any reliability analysis you might conduct do not hold. In such cases, I would not normally recommend looking at reliability across, instead suggesting that you look at agreement within rating approach and then compare means and covariance (possibly even with a Pearson’s/phi approach) with those numbers. But this depends on your precise research questions and may need a theory-driven solution rather than a statistical one. I don’t know enough about repeated measures kappa to know if I should recommend it or not – I’ve actually never heard of it before now.

Dr. Landers,

Thank you for the response! Let me clarify a bit more: each of my participant were examined both by the gold standard and the standard I am testing. Would this piece of information change anything??

That is how I understood it, so no, it doesn’t change anything.

Dr. Landers,

Thanks for confirming! To be honest, I did not know about repeated measure kappa before I take over this project but I will definitely try the phi approach!

Dr. Landers,

My question is about an unconventional use for ICCs and any advice for working with multilevel (but not aggregation) data. We want to examine the agreement between two measures of sleep (both interval data). Participants complete a sleep diary of the number of hours slept, and wear an actigraph (a wristwatch-like device) that records their movement. Number of hours slept is computed from this movement. We are conceptualizing these as our “raters” of number of hours slept. First question: is this a reasonable extension of ICCs? Our research question is how well do these two measures (sleep diary and actigraph) agree on how many hours of sleep a person is getting?

The second question is a bit more complicated. Each participant completes the diaries and wears the actigraph for several nights. Am I correct in concluding that because we now have nested data (each “rater” has multiple ratings for each participant) we have violated the independence of ratings assumption and using ICCs for the entire dataset would be inappropriate? If so is there any correction for this or any way for ICCs to handle nested data? We don’t want to aggregate or use the data to answer any multilevel questions so I am struggling to find the appropriate analysis. We simply want to know how these two measures agree but we have nested data. Even if we cannot report an overall ICC for the entire dataset would it still be appropriate to report ICCs for each participant individually or would this violate the independence assumption since the measurements would be coming from the same person?

Any advice is appreciated. Thank you.

I wouldn’t recommend ICC in this case because you have meaningful pairings – rater 1 and rater 2 are not drawn from the same theoretical population of raters – instead they represent two distinct populations. Since you only have two you’re comparing, I would probably just use a Pearson’s correlation (to capture covariance) and also a paired-samples t (to explore mean differences). If you’re just throwing the means into another analysis, you don’t even need the mean differences.

You can calculate ICC for nested data, but you’ll need to do multilevel modeling. You probably should not do it in SPSS. I would recommend either R or the program HLM. You could instead determine ICC at each time point (not for each participant – that’s not very useful), but you do lose the ability to examine accuracy over time that way. You’d need, at a minimum, an argument that accuracy shouldn’t change over time for some literature-supported reason.

You might – maybe – be able to calculate ICC by hand using an RM-ANOVA, which could be done in SPSS, but I’ve never seen any work on that specifically.

Thank you for the reply. The reason we even thought about doing ICCs with this data is because other authors did but I wasn’t sure it was appropriate considering the multilevel nature of the data. Is there any way to do some type of modified Pearson correlation with multilevel data that you are aware of or would you recommend trying to run a multilevel model in R and getting ICCs from that?

I wanted to clarify since at first you suggested not running ICCs with this data but then once the multilevel issue come into play it seemed like you were suggesting it might be feasible.

To be clear, it’s feasible with appropriate justification (i.e., that you can argue the two approaches are drawn from the same population of estimators), so that has theoretical and interpretive implications which carry a certain degree of risk. I can’t tell you how risky that is without knowing the project or your research literature better. If prior researchers publishing in decent journals have done it the way you are doing it, it is probably pretty low risk.

I will say, however, that the disadvantage of using a single ICC versus a Pearson’s correlation (or even an ICC) at each time point individually is that if there are any subtle differences over time (e.g., if the actigraph becomes less accurate over time, or if diary entries are biased differently over time, etc), these could be washed out in ICC. If there are any large differences, it will just bias your overall ICC downward – that’s the risk of general summary statistics in data where there may be multiple effects. If you’re confident there aren’t any such effects, then it doesn’t really matter.

If you actually use a multilevel modeling program (e.g., HLM), you could alternatively calculate ICC given a three-level model – hours within measurement approach within time (or hours within time within measurement approach) – which might solve both problems.

Ok. Thank you for the help.It is appreciated.

Dear Dr Landers,

Sincere regards, would be great help if you pl help me with this.

i am measuring trust within software development teams by using scores on trust factors and scores are on scale of 0-5 i.e poor- excellent and i have 8 teams in total and will be same for entire exercise and also my items (trust factors) will also be same on which individual team members will give score(scores will be collected in 3 cycles after every 2 weeks after incorporating improvements in trust factors in which score is low) can you plz suggest which i should use ICC1 or ICC2 to measure inter rater reliability and which technique i should use for data validation

@sulabh

just to add little more information to my previous question as i will be taking scores from 8 teams and each team is having around 7-8 members(total 62 members approx) giving score from 0-5 scale on the trust factors and i have total 35 trsut factors on which i am seeking trust score . can you help me with which ICC1 or ICC2 i should procced and which technique will best suit for data validation?

many thanks

Sulabh

Since you’re aggregating, this is much more complicated than a statistical question alone. The type of ICC you need depends on the goals of your project. I would recommend you take a look at the aggregation article I’ve cited for others in earlier replies. Most critically, remember that ICC(1) and ICC(2) in the teams literature refer to ICC(2,1) and ICC(2,k) in the Shrout & Fleiss framework. You will probably need both, because they tell you different things.

Thanks so much Dr Landers,

can you plz suggest any article or provide me any link where i can study how to calculate ICC(2,1) and ICC(2,k) and also can i use exploratory factor analysis and Cronbach alpha for validating my data?

The Shrout & Fleiss article linked above discusses both version of ICC you are interested in (I believe as Case 2). An alternative conceptualizations is that presented by McGraw & Wong (1996), published in Psychological Methods, which uses the numbering system I’ve described here.

You could use EFA or CFA, but I would probably use CFA if I was going to take that approach. Cronbach’s alpha is the same as a consistency-type ICC(2,k).

Hi Richard,

Thank you for this informative post, it was very helpful!

I have a quick question about assessing the inter-rater reliability of a diagnostic measure (i.e., categorical data).

My sample will include 20 participants, with the same two raters assessing each participant. My variable is a dichotomous one (i.e., Yes/No based on whether or not the rater gave them a diagnosis using the measure).

So far, the two raters have assigned every subject the same rating, and therefore I am getting a warning when I run kappa on SPSS and it won’t provide me with a kappa statistic.

If my raters continue to do this, will I not get a Kappa statistic at all??

Also, other than Kappa, can you recommend another statistical measure to assess inter-rater reliability?

Thanks!

Remember that your goal is high reliability, not a high kappa specifically. Kappa is just an assessment of chance-corrected percentage agreement. In your case, I’d just use percentage agreement – 100% (reliability = 1).

Hi Richard, thank you for the clear explanations! Already used them several times to my advantage.

I have a particular question for which I don’t find an answer…

Students collected 64 voice samples that were rated for several (6) parameters on 2 occasions by the same group of 5 raters. Overall interrater would use the ICC (2,5) model, no probe there!

Remaining question: the students want to know which individual sample out of the 64 has the highest agreement/reliability on a certain parameter? they are trying to build a collection of ‘ideal’ voice samples for teaching purposes…

Should we calculate ICC between the 5 raters for a single sample and then choose the highest number? I don’t think this is the correct solution, but I’m stuck on this one…

Any ideas would be greatly appreciated!!

Thanks!

Jan

So, if this isn’t for publishing purposes, you have more options than you would otherwise. If you’re ok looking at agreement instead of consistency, I’d probably just calculate the standard deviation of ratings for each rater – the smallest SDs will have the least variance between raters. However, you should keep in mind that there will still be error associated with those SDs – don’t take them as a definitive ordering (this is why I mention publishing), but it should still get you what you need. If you wanted to look at consistency, you could probably modify this approach by converting ratings into z-scores and then calculating the standard deviation of the rating z-scores for each rater (this sort of lazily controls for rater mean differences).

Dear Mr. Landers,

thank you for all your explanations. But as you realized there are still some individual questions;)

So here is mine:

I have a huge dataset with several companies, employees and leaders. I want to analyze the self-other agreement in leadership. To assign the followers to their leaders I generated a new dataset aggregating the mean score of the self perceived leadership, the follower perceived leadership and the follower job satisfaction for each of the 300 companies. (Leadership was computed before out of 10 items of the questionnaire).

So my new dataset consists of 300 companies. For each company I have the three mean scores (self-perceived leadership, follower perceived leadership and follower job satisfaction). I want to run polynomial regression using self and follower perceived leadership as independet variables and follower job satisfaction as dependent variable.

In self-other agreement in leadership literature I read that you should compute an ICC score before you run the regression. So my question ist at what point and which ICC should I compute? Do I just have to compute an ICC between self and follower perceived leadership in my new dataset?

I hope my explanation of the problem is not to confusing.

Thank you in advance!

Best wishes

Jannis

I’m afraid I don’t know the leadership literature well enough to have an answer for you. My suspicion is that you’re conflating a few types of ICC – you need ICC to determine how well the individual follower ratings reflect the population individual and mean follower ratings before aggregation, and you’ll need some other type of reliability (although I wouldn’t personally go with ICC) to determine agreement between the mean follower rating and the leader rating. But that’s a bit of a guess. However, I’m confident that the articles on aggregation that I’ve posted in other answers (especially those related to rwg) will get you closer to what you need.

Thank you for your fast response!

I was thinking of the same, first computing ICC within the follower group. But how do I do that? I have about 19.000 followers. How do I compute the ICC within one group???

That’s interesting they were also reffering to the rwg in the literature. I have to look that up! Especially the article you cited: LeBreton, J. M. & Senter, J. L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11, 815-822. Any clue if there is somewhere a free version of this article? I am a student in Germany and I don’t have free access to this article anywhere.

I am not blessed with that much money to pay 20 dollars for every article;)

Thank you again!

ICC isn’t calculated within one group – it is used to assess the consistency of all groups in predicting their respective leaders. If you have a variable number of followers per group, SPSS can’t do that – you’ll need to either calculate ICC by hand from ANOVA or use another program that can do it for you, like HLM.

I don’t know if that article is available for free.

Thank you so much for your help!

But I have one last question. So if I can only calculate ICCs between groups maybe I don’t need an ICC at all. Because for each company I compute the mean of all followers rating leadership in general. Additional I compute the mean of all leaders rating leadership in the company in general.

So that I have

Mean Rating Leadership (Leaders) Mean Rating Leadership (Followers)

Company A 4,3 3,9

Company B 4,1 2,8

Company C 4,9 4,5

Company D 3,8 1,9

….

….

Thanks in advance!

Jannis

The problem with ignoring reliability is that any reader/reviewer will not be able to tell how well your follower scores hang together. For example, on a 5-point scale, you might have 2 leaders both with a mean of 4.0. The first leader has scores ranging from 1 to 5 whereas the second has scores ranging from 3 to 4. That means scores of the second leader have greater reliability than those of the first, which is usually interpreted in the leadership literature to mean that the leader construct is more accurately assessed by people with leader 2 than leader 1. ICC is used to assess the degree to which this consistency exists on average across all leaders, across all companies.

At this point, I’d suggest you bring in a statistician onto your project; the more you write, the more it sounds like the aggregation strategy of your project is closely related to interpretation of your hypotheses, and if so, you would be better off with someone with expertise in this area (and preferably with leadership too) actually working on your project rather than just asking questions on the internet.

Dear Dr. Landers,

Our group of 3 raters each completed 5 rating scales for 20 patients at 3 visits per patient. The 5 ratings scale are diagnostic checklists, they are Likert-type format, each has different numbers of final categories/diagnoses, I treated them as ordinal. We are interested in “Absolute Agreement” between the 3 raters for all the visits and particularly at patient’s last visit. Using a ICC(3,1) – “Two-Way Mixed Model” with “Absolute Agreement”, our ratings for each scale range from .5 to .6 (single measures) and .7 to .8 (average measures) for all 60 visits.

When we looked at the ICCs for the last visit only (N=20), the ICCs are all lower for all 5 scales despite the ratings between the 3 raters are actually in more agreement at the last visit as the disease progressed. When I looked at raw ratings of one of the scale, there are only 4 cases (out of 20) of disagreement among the 3 rates (see below), but the ICC coefficient is .36 (single measure) in this case. The lack of variance among these raw ratings which should indicate “agreement” does not seems to be reflected in the ICC calculations? Did I do something wrong here?

Last Visit (N=20)

Rater1_Checklist 5 Rater2_Checklist 5 Rater3_Checklist 5

2 1 3

1 1 1

1 1 1

1 1 1

1 1 1

1 1 1

1 1 1

1 1 2

1 1 1

1 1 1

1 1 1

1 1 2

1 1 1

3 3 1

1 1 1

1 1 1

1 1 1

1 1 1

1 1 1

1 1 1

No, that sounds right. Your raters agree in 80% of cases (also a reliability estimate); however, in the cases where they disagree, the pattern is the opposite of predicted rater effects.

In this case, Rater 3 rates much higher than both Rater 1 and 2 in 75% (3) of cases where there is disagreement, but dramatically lower in 25% (1) of cases where there is disagreement – specifically, in the 3-3-1 case, in which case it is in completely the opposite direction. That harms reliability dramatically. If you change the 3-3-1 case to 3-3-3 (just as a test case), you’ll find that your single-rater reliability increases to .737.

The reason this has such an extreme effect is because you have so little variance. Remember that variance is necessary for ICC to give you reasonable estimates of reliability. You want variance, but you want that variance to be _predictable_. In your case, there just isn’t any variance in the first place, so the “bad” (unpredictable) variance is larger than the “good” (predictable) variance, driving your reliability estimate down.

This is the same principle at play in ANOVA (which ICC is based on) – you want there to be very small differences within groups but very big differences between group means to get a high F statistic. In this case, there are small differences everywhere, so your small inter-rater differences seem huge in comparison to your also-small inter-case differences.

Given your situation, I would probably just use percentage agreement, or even Fleiss’ kappa, as my estimate of interrater reliability with this dataset.

Thank you for the clear explanations! It makes perfect sense now. Percentage agreement would be easy to do for this dataset but it don’t think it takes into consideration agreement by chance? Thanks very much again for your insights, much appreciated!

It doesn’t. If you want to worry about that, I’d use Fleiss’ kappa.

I appreciate the simplicity that you use to explain a fairly complicated analytical method, so thank you! I wanted to clarify that I am using your technique correctly as I have a team of Patient Navigators who collected data from 900 community participants. The number of surveys were not equally collected across PN and I’m interested in the variance within each set of data collected by an individual PN as well as the variance between participants of different PNs. If I’m understanding your technique correctly, I would use the ICC (2) as I’m interested in the mean difference and would refer to the “average measures” in the ouput?

I don’t know the field-specific terms you’re using here, but if you’re saying that you expect the means across PNs to be equal, and if the PNs are each assessing the same group of participants, yes, ICC(2) would be the right approach, although I think you want to know ICC(2,1), single measures, to assess how accurate a single PN is in capturing their assigned group’s mean. If you have different participants by PN, you may want ICC(1). Remember also that you cannot assess ICC when your groups contain different numbers of cases using the built-in SPSS functions; you’ll need to either calculate it by hand or use another program designed for this (e.g., HLM).

I was wondering if you had any suggestion for entering in N/A from a global scale that goes from N/A (the target participant did not respond to the other participant because the other participant did not make any suggestions) to 5( the target participant responded to the other participants suggestions a lot). For reliability purposes only, I am wondering if I could enter a zero for N/A or if a value like 6 would be more appropriate.

First of all, there are no “reliability purposes only.” The way you calculate reliability must be matched to the way you actually use those scores – otherwise you have a meaningless estimate of measurement quality. So if you are not using scale scores of N/A in analyses, those cells are essentially missing data. If you do include N/A in your reliability estimate somehow, you must use those values in future calculations as you treated them for reliability (i.e., if you calculate reliability on scale scores with N/As represented as 0, any analyses you do must be on variables including that 0 too).

Whatever you do, you need interval or ratio measurement of the scale to use ICC. Since you are apparently already confident in the interval+ measurement of 1-5 (i.e., at a minimum, the distance between 1 and 2 is the same as the distance between 2 and 3, between 3 and 4, and between 4 and 5), you should consider if the same is true for N/A to 1. If so, you could reasonably recode N/A as 0. If not, you could instead consider the analysis of two difference variables, one binary-coded variable distinguishing N/A and not-N/A, and another with 1-5. But you will have missing data in the 1-5 variable that way, so be sure this is theoretically meaningful.

I was also wondering if you could clarify the difference between using single measures and average measures for your ICC variable. I understand that average measures is typically used in research, however I also know that you said that single measures is able to tell you how reliable a single rater is on their own. Is it okay to just use one of the ICC values or is it important to ensure that the ICC values is above .07 for both the single and average measures?

Thank you

It’s only typically used in research because that is what researchers are most often (although not always) interested in. Remember that reliability is conceptually the proportion of true variance to total variance – so a reliability of 0.7 means that 70% of the differences between scores can be attributed to something stable whereas 30% is something that isn’t. In ICC’s case, the 30% is attributed to differences between raters.

If you only have one rater, the rater-contributed variance is much higher because you don’t have many different raters to average across. All of the mismeasurement caused by that rater is present, and your numbers will be randomly weird. When you have two raters, it’s essentially cut in half – Rater 1 is randomly weird and Rater 2 is randomly weird, but because you’re taking an average, a lot of that random weirdness averages out to zero.

So it depends on what you’re trying to do with your numbers. If eventually a solitary person will be making these judgments (or you’ll be using a single instrument, etc), what you want to know is how much of the variance that rater is capturing is “real”. That’s single measures. If you will always have multiple raters and will be taking an average, that’s average measures. If you’ll be using a different number of raters than you actually had available for your reliability study, you can mathematically derive that reliability from either of the two you already have.

We have created a composite score of two of our variables and I was wondering if in that case you would use the ICC from the average measure because we added together scores on the rating scale from the two variables OR if we would still look at the single measures because each of the coders will eventually be going on their own to code the videos.

Thanks you

Average measure refers to the raters, not the scale the raters used. You want single measures if you want to know how well a person can code a video on their own.

Dear Dr. Landers, I would like to test inter rater reliability of a patient monitoring tool. Both investigators have been asked to monitor 10 patients and identify care issues. I have assigned a score (out of 1) to each rater by seeing how many care issues each identified out of the total care issues identified by both i.e. if 5 issues have been identified 4 of which are common for both and 1 rater identified 4/5 I assigned a score of 0.8 whereas the other rater identified 5/5 therefore I assigned a score of 1. Which test would be most suitable in this case to test for reliability please?

It depends if you have access to the original data or not. It also depends on how many total issues there are. If you have the original data, and hopefully you do, you would probably want to code the agreement across each dimension, preferably coding 0 for absent issue and 1 for present issue, then use a kappa (or even percentage agreement) to assess each agreement on each issue individually.

However, for all of this, 10 cases is going to produce a massive confidence interval – the reliabilities you find will be highly volatile.

You probably do not want to use ICC on your percentages unless your goal is to assess “in general, how much do the investigators agree”, in which case you’d probably just want to report the mean of those proportions (i.e., mean percentage agreement), but that would camouflage any by-issue disagreement, which may or may not matter to you. You also might be able to justify ICC(2,1) depending upon your hypotheses, but I don’t think ICC(2,k) would be interpretable unless you’re going to use the mean agreement percentage in some sort of follow-up analysis.

Hi Dr Landers,

I am confused how to set my data up. I am looking for rater reliability of a tool to measure healthcare worker competence in evacuation. We have designed a rubric (face/content validity established). The rubric is divided into multiple tasks that are complete, partially complete, not complete (2,1,0). There are also several points where time is measured to complete the task. We recorded the evacuation of 3 patients in 3 different scenarios (Poor, Average, Good). We showed the video to a group of 10 raters. I know that I will have 10 raters for the columns (Hopefully that is right). Do I then put each of the ratings for the 3 groups of videos as rows for the individual items?

What do I do with the time measurements of the task?

Wish I was a statistician…..

From what you’ve described, it doesn’t sound like you have enough cases for an assessment of reliability – it doesn’t sound like you have replication within each condition, which you need. It also sounds like you have multiple measures being taken within each video, possibly, which would violate independence assumptions of any reliability assessment, but it depends on your research goals and influences what sorts of conclusions you can validly draw. I think you’re going to need to bring in someone in a fuller capacity than I can provide via blog replies.

Thanks for the response. Would you please remove my last name from the post or remove the post? I didn’t realize I could post without my full name. Thanks

Dear Dr. Landers,

Thanks for your prompt reply! I do have access to the original data. In these 10 patients I have identified a total of 29 care issues.

When you told me to code agreement across each dimension .. in this case dimension refers to each patient or to each observer?

In this case 1 observer detected 22 issues (in 10 patients) and the other 23 issues (in same 10 patients) however these were not always common for both in fact the total number of issues was 29.

The hypothesis is that the tool is reliable enough for patient monitoring… no follow up analysis as such will be carried out.

Thanks and Regards,

Maria

Dimension refers to each issue. If you have 29 issues that are distinct, you have 29 possible scales on which to agree, which means you need 29 estimates of reliability.

Alternatively, if you are saying all 29 issues are unidimensional (i.e., all measure the same underlying construct), then you should convert all of them into binary indicators (1 for present, 0 for absent), calculate the mean score for each rater, and then assess the reliability of those mean scores.

Hello Dr. Landers,

Thank you for your posting on this topic! I was hoping you can verify that my current calculations on SPSS are correct? For my study, the same two individuals assessed 15 patients each with the same 25 questions. The 25 questions can be grouped into 5 questions each based on concept. The columns are labelled Person 1 and Person 2, while the rows are the 15 patients. Each cell is the mean score of 5 questions in a concept. Since there are 5 concepts each with 5 questions, then there are 5 data sheets. I determined the two-way mixed methods ICC value and Chronbach’s alpha value for the mean scores of each concept. Is that correct?

Thank you for your input!

Maya

Unless you’re interested in conducting scale diagnostics, you don’t really need alpha in this context, since you are already getting an estimate of reliability of the means from ICC. You will probably end up with 5 ICCs – one for each concept – unless you have reason to believe the concepts are unidimensional, in which case you would probably just want 1 ICC. Otherwise this sounds pretty normal to me.

I was also wondering how to go about determining an overall cronbach’s alpha value and ICC value?

Hi Dr. Landers,

Thank you for being here. Wishing you always in good health and prosper…so we could always haveyour guide and expertise.

Dr. I have a few questions. It may have been addressed here but since Im using my cell, checking your previous comments is a bit limited. My queries:

1. If I need to check the percentage of raters between two known raters on a test question, what is the least number of students required to be able to be executed in the SPSS? If the number of students is less than 5, can these still be calculated in spss?

2. In another case, there are two known raters, for 10 question test. Will 20 students suffice to determine the percentage of raters?

Looking forward to your help.

Best regards

Aimy

To calculate an ICC, you just need inter-case and inter-rater variance, so I suppose the minimum is 2. But your confidence interval (i.e., the mismeasurement of that ICC) is going to be pretty high with so few cases.

It’s not really possible to give a specific number of students without knowing what the ICC is ahead of time – what you’re asking about is essentially a power calculation. So the greater agreement you have, the fewer cases you need to detect it.

Dr. Landers,

Thanks very much for this useful information. Two questions: 1) Could you verify that I’ve made the correct choices for decisions 1, 2 and 3? 2) Do you have a citation for the recommendation you posted about using scale means and not individual items (see below)? This makes sense to me and a reference for the paper where we’re doing exactly this would be great.

“A special note for those of you using surveys: if you’re interested in the inter-rater reliability of a scale mean, compute ICC on that scale mean – not the individual items. For example, if you have a 10-item unidimensional scale, calculate the scale mean for each of your rater/target combinations first (i.e. one mean score per rater per ratee), and then use that scale mean as the target of your computation of ICC. Don’t worry about the inter-rater reliability of the individual items unless you are doing so as part of a scale development process, i.e. you are assessing scale reliability in a pilot sample in order to cut some items from your final scale, which you will later cross-validate in a second sample.”

Study Design/Research Question: We have 126 mothers and 126 fathers who each separately rated their assessment of the father’s involvement with their child(ren). Each set of parents rated one or more children on 8 dimensions of father involvement (mean scale scores based on some set of individual continuous items). Our research question focuses on the extent to which the parents agree on their assessment of father involvement for each child (not across children in cases where they report on more than one child). We want to report the ICC’s associated with the various ratings.

Decision 1: one-way random effects or ICC(1)

Decision 2: individual (not average)

Decision 3: absolute agreement (not consistency)

Thank you so much in advance for your time.

PCharles

Since you have meaningful pairs (always one mother and one father), I would probably use a Pearson’s correlation in this context. Using ICC means you that you are assuming that mothers and fathers are drawn from the same population of judges about the father’s involvement (i.e., mothers and fathers are drawn from the same population of people with opinions about the father’s involvement). If you expect mothers and fathers to have different perspectives on this, you probably don’t want ICC. But if your goal here is to get the most accurate rating possible of father involvement for later analyses, you could use ICC(2,k) to assess the mean of their two ratings, with mothers and fathers as consistent raters (i.e., mother as rater 1 and father as rater 2).

I don’t have a reference handy for that idea.

Dear Dr. Landers,

I have read through all the responses and still could not find an answer to my question. I have conducted an online questionnaire and asked multiple choice comprehension questions for a case study.

I have 5 comprehension questions and answers are multiple choice (coded as 1-0). There are 85 respondents. My mentor asked me to provide intra-corelation of the comprehension questions.

– I have located the answers from each respondent on the columns (86 columns) and 5 questions on the rows and run ICC (two way mixed). Is this a correct approach to find the intra-correlation of the questions for the reliability?

Many thanks for supporting us!

Kind Regards,

I assume by “intra-correlation” you mean “intraclass correlation.” These are slightly different concepts though.

First, this is an atypical use of ICC. In most cases like this, you would calculate coefficient alpha – although, in this case, alpha is actually a special case called a KR-20, since you are working with binary data.

Second, if you wanted ICC anyway, your rows and columns are reversed, and you probably want two-way random.

Thank you very much Richard.

It’s so well-explained! I have used it for my master’s thesis. You’re great!

But, now, I would like to citate this source in my project. Could you provide me with the citation of this (article?) ?

Thank you loads!

Gemma.

There is not really any “easy” citation for this page right now. You could cite it as a webpage, I suppose. I will have a DOI and source for it in the next couple of months though and will add a reply then with more info.

Dear Dr. Landers,

many thanks for your comments. I used KR-20 as you have suggested. Kind Regards,

Dear Dr. Landers,

Thank you very much for your contribution!

I am currently conducting a study and I have some problems on the statistical analysis. I am not sure if my problem can be addressed by ICC. Hopefully you can give me some insight.

I have interviewed 3 members (father, mother and child) of a family. All of them answered a 5-item questionnaire, in a 5-point likert scale, which tested on the level of permissiveness of mother. I would like to know how the patterns of their answers related with each other.

I would like to know if it is still correct if I do my data set like this:

………….father……mother……child

item 1……2…………..3……………3

item 2……1…………..2……………1

item 3……2…………..2……………1

item 4……1…………..1…………….1

item 5……2…………..1…………….2

I suspect that it twists the usage of ICC and I do not know if it is still statically make sense.

Thank you very much for your help!!

ICC is going to assume that your three raters are all drawn from the same population. Thus, using ICC means that you expect the father, child, and mother to provide the same sort of information. If you don’t think that’s true – and it doesn’t sound like you do – then you shouldn’t do that. I would instead just use something like ANOVA and calculate an effect size. Eta-squared would tell you the proportion of variance in your ratings explained by their source, so I’d probably recommend that.

Dear Dr. Landers,

Thank you for your prompt and clear explanation!

I am glad that you suggested me an alternative way so that I know the direction to work on!

Millions of thanks!

Dear Dr. Landers,

Thank you for your prompt and clear explanation!

I am glad that you suggested me an alternative way so that I know the direction to work on!

Millions of thanks!

By popular demand, this article has now been published in the

Winnowerfor your citation needs. You can cite it in APA as:Landers, R.N. (2015). Computing intraclass correlations (ICC) as estimates of interrater reliability in SPSS.

The Winnower 2:e143518.81744. DOI: 10.15200/winn.143518.81744You can also download it as a PDF here: https://winnower-production.s3.amazonaws.com/papers/1113/v12/pdf/1113-computing-intraclass-correlations-icc-as-estimates-of-interrater-reliability-in-spss.pdf

Hi Dr Landers,

Thanks very much for the very informative article. I do have some follow on questions, however. I am doing a study attempting to assess inter-rater agreeability/reliability. 3 raters (doctors) were tasked to rate a group of 48 identical ratees (patients) each for 6 variables (A-F) on a 5 point linkert scale measuring stability or progression of different risk factors in a disease. My first question is regarding the type of ICC test chosen in SPSS. Would it make sense in this case that a 2 way mixed type ICC should be chosen due to the fact that raters in this situation are specific to their own experiences as doctors? Or could their ratings and subsequent measure of agreeability be generalized for the population thus making the case for a 2 way random type ICC to be chosen?

Secondly, I’ve been exploring the different kinds of statistical tools available for analyzing inter-rater reliability and was wondering, between the ICC and the relatively new fangled Krippendorf’s alpha – would you recommend one over the other when it comes to assessing ordinal levels of data which potentially should be weighted?

Thanks very much for your time and effort for all the help you’ve put into this post.

Best regards!

It depends on to whom you are trying to generalize. If you want to run statistics generalizing only to those three doctors, 2-way mixed. If you want to run statistics generalizing to doctors, in general, 2-way random.

I believe ICC is a specific case of Krippendorf’s alpha – I believe they will be identical under the conditions both could be calculated. However, ICC can’t be calculated on ordinal data (it relies on meaningful rater means), so I suppose I’d go with alpha in your scenario.

Dear Dr. Landers,

I hope this question has not been asked before. I apologize if I missed it going through the previous posts.

I have the following study design. To examine the reliability to of two different methods to assess gait (one based on a human analysis, the other based on a computer) we asked 3 raters (human analysis) to rate gait and had two identical computers. 20 participants walked 6m at their usual speed three times and the three humans and two computers had to assess their gait speed.

This means that each rater measured 20 participants 3 times (trial 1, 2 and 3). I want to compare the ICC for the human raters with the ICC of the computers. I can calculate the ICC for the three human raters for each trial (1, 2, 3) and the two computers for each trial (1, 2, 3) but then have three separate ICC values for the humans and 3 separate values for the computers.

Is there a statistical way to combine the 3 trials so I will only have one ICC for the humans and one ICC for the computers?

Thank you for your advice,

Bjoern

Maybe. ICC assumes that all cases are drawn from a single population of raters, and a single population of cases. If you have multiple trials, you have a new source of variance: time. If you think that the differences over time potentially create meaningfully new outcomes, you’d want to look at reliability separately for each time point, or model it explicitly somehow. If your three time points are themselves just a sample of a single population of assessments, and you just want to know mean reliability, you can do so – since ICC is a reliability estimate, it’s a percentage, which means you can just calculate a mean. The only problem with doing so is that there’s no established way to calculate a confidence interval around a mean ICC that I know of. So if you want to compare the mean human ICC and mean computer ICC, you’d need to do so based upon the absolute difference between those ICCs alone, versus any sort of hypothesis testing. I don’t know if that meets your needs or not, but that’s the best I can think of.

Professor, I am graduate student studying in Korea.

I found your webside by chance while searching for statistical method. Thankfully I could learn a lot. I really appreciate you!

Currently, I am doing a research on organizational culture, and planning to aggregate individual responses on organizational culture data to organizational level construct.

Your guidance is very helpful for me, but I wonder whether my data is appropriate.

I measured organizational culture without using likert-type scale.

Respondents distributed 100 points across each of the four descriptive statements of organizational culture depending on how well they matched their company. (this way of classifying culture is called competing values framework – hierarchy, market, clan, and innovative culture)

Therefore, the total score distributed across each of the four culture totalled 100, and each person rates each culture such as 25, 50, 15, 10. I would like to create overall organizational culture scores for each company by averaging (aggregating individual responses). One company, for example, will have 30, 15, 20, 35 scores of each culture type.

As I heard that usually the guidance is appropriate for likert-type data, I would like to ask you that

1) whether this type of data is appropriate for your guidance as well.

2) If not, could you please recommend the way how I can justify the use of aggregated individual level scores as higher level data?

3) Someone recommended me to change the responses to liker-type scales (0~20 -> 1, 20~40 -> 2, etc.) Then, is it possible to follow your guidance?

I appreciate your time for consideration on my questions in advance.

Have a great weekend.

Jeongwon Lee,

You’ve actually combined several different problems into one project, so I don’t have a good answer for you, although I can give you some guidance on the hurdles you need to clear.

Problem 1: Distributing scores among 100 points means you have ordinal data, meaning ICC (and alpha, and rwg, and all the other usual stats used here) cannot be used.

Problem 2: The ordinal data you do have are ipsative, meaning scores cannot be compared across people, since people can assign whatever ranks they want among the four statements. For example, when comparing a person that ranks 90, 10, 5, 5 versus 60, 30, 5, 5, you have no way to know if the first person judged culture on the first dimensions dissimilarly from the second. If you’d asked them on a Likert scale, both the 90 and 60 could correspond to “Strongly Agree.”

Problem 3: You need to justify aggregation, not just assess reliability, which is a more complicated process. You need to establish two things: one, that each person’s view of culture maps onto the team’s view of culture to an acceptable degree (the level varies by whom you ask), and two, that the consistency of that view across people is sufficient to get a stable estimate of company culture.

In the presence of Problems 1 and 2, I am unfamiliar with any approach that will solve Problem 3. However, those are the three issues you will need to work through. It will probably involve multi-level structural equation modeling and estimators that can be used with ordinal scales. Good luck!

Dear Dr. Landers,

I really appreciate your comments!

I could figure out problems that my data have. Although it seems not easy to solve, I will try to solve those problems that you help to give me some guidance on.

I would like to thank you again for your kind help.

Jeongwon Lee

Dear Dr. Landers,

Thank you a lot for this post, I found it extremely well-written and useful!

I am dealing now with ICC and I have an issue in my analyses.

I have an experiment with 12 annotators coding a time value (the time of a scpeific occurrence) for 40 events. In some cases, I do have missing values (the annotator might not have coded the event or might have forgotten to save his answer). If I run the ICC in SPSS, the events with missing values are eliminated from the test. Is there any way to replace those missing values?

If I replace the missing values with”0″ or any other continuous value, the analyses are not correct anymore.

I also thought about running 2 analyses:

1) Fleiss kappa with a dataset that contains only categorical variables (“0” for missing values and “1” for annotated values) to check the inter-rater reliability for the missing values.

2) running the ICC on the events that do not have missing values and compute the corresponding coefficient only for those events.

I would really appreciate your help on the topic.

Thanks,

Giulio

There are many approaches to missing values analysis. The easiest option is mean replacement, where you enter the mean of the other raters for your missing value. However, there are many limitations/assumptions to that approach. I would suggest you use the Missing Values Analysis procedure in SPSS and select the EM algorithm for imputation. Also set the option to “save new dataset” (I believe in an Options or Save menu?) – then run your reliability statistics (and all analyses) off of the dataset it imputes.

The key to making that decision though is what you are going to do with missing values later. If you’re going to use listwise deletion when there aren’t 12 raters (not generally recommended), then you’ll want to use listwise deletion before you calculate reliability, and then you don’t want to do any imputation. If you’re going to use imputed values for reliability, you need to use them for analyses too. Everything needs to match.

If you do want to do missingness imputation, there are procedures to determine if your data are essentially “missing at random” (MAR) (vs. not missing not at random; MNAR). Your data must be MAR (or essentially MAR) to justify this sort of analysis in the first place. I don’t have any papers handy, but my recollection is that as long as less than 10% of your data are missing, you are safe assuming MAR and running missing values imputation. But you might want to look into the research lit on that.

Dear Dr. Landers,

I’ve previously written to you to ask you about the use of one-way ICC to derive the intraclass correlations of twins. Can I confirm with you whether the data need to be normally distributed? Also, should I be looking at the ‘single measures’ or the ‘average measures’ row to get the ICC in my case?

Thanks and best regards,

Yi Ting

Yes – ICC is based upon ANOVA, so all of the same assumptions apply – independence of observations (by pair), normality both between and across groups, etc.

For the sake of example, let’s imagine that you’re having twin pairs rate their satisfaction with their parents. A single-measures ICC will tell you the extent to which a single twin’s opinion speaks to population twin satisfaction with parents. An average-measures ICC will tell you the extent to which the mean opinion within each set of twins speaks to population twin satisfaction with parents.

Dear Dr. Landers,

Thank you very much for your detailed answer.

I will follow your suggestions.

I do appreciate your help,

Giulio

Thank you Richard for this explaination. I am just confusing if I can use ICC in my situation or not. I am working on my dissertation for master. It is about sentiment analysis on texts. I have 428 different texts. I applied a code for sentiment analysis and I retrieved results from -1 to 1. Then, I used the code with same text but translated to other language and I found a different results. I also have different result from 3 inter-rater. I just want to figure out how these sets are agreed or not. Is there a significant different between them? and If there I will look to data in more detail.

Thanks,

It sounds like you have three numbers to compare – the mean of 3 raters, the result of sentiment analysis, and then the result of sentiment analysis on translated data. If that’s correct, you can use certainly use ICC to determine the reliability of your rater mean, assuming that your ratings otherwise meet the assumptions of ICC. To compare the rater mean and two sentiment analysis results, assuming they are all on the same scale, you might want to use an ANOVA.

Thanks for replying,

sorry for that but why do you assume that I will compare with the mean of 3 raters. I thought it must be compared each as a seperate classifier. For that, I was assuming I have 5 rows of comparing. One is the result of sentiment analysis in the origin text, the second is tesult in translated version and the last three are the 3 raters. Is calculating the mean is an efficient way?

Another thing, I did not think of ANOVA because of the assumptions related to ANOVA that my not satisfied on my data sets.

Thanks.

The analysis should be driven by your research question. I was assuming that you’re interested in to what degree your raters and the two unique sentiment approaches differ from each other. If that’s not true, for example if the raters are not drawn from a single theoretical population of raters, you should not test it that way.

If your data don’t meet the assumptions of regular ANOVA due to non-normality or the use of rank data, then you might use a Kruskal-Wallis ANOVA, which is non-parametric. But you said your scores varied from -1 to +1, which sounds like ratio-level data, on the surface of it. If you have hierarchical nesting causing non-independence, then things get more complicated.

Thanks Richard for replying. Yes, your assumption is correct. I am interested in to what degree raters and the two sentiment approaches are agreed or disagrees. And which version is more similar to human judgement. So, Do you think it is better to calculate the mean? Because I was thinking to compare the two sentiment approaches together with raters.

I tested ANOVA before with two approaches and each rater individuallyt. However, the result always meaningless. For that, I assumed a violation in the ANOVA assumptions. In addition, I am interisting to show how much they are agree or disagree, not just if there is a difference or not.

Thank you so much.

You really need to think carefully about what populations you are trying to compare here, because that should drive your analysis. To the extent that you have multiple measurements of a given population, you should assess reliability for those measurements, and then compare the resulting groups. If you want to compare group means, you should take an approach that does that, such as ANOVA (or, if failing normality assumptions, a Kruskal-Wallis ANOVA). If you want to compare ordering, you should take an approach that does that, such as regression (or, if failing normality assumptions, something that compares rank orderings, such a Spearman’s rho or nonparametric regression). All approaches assume you have the same scales across all of your population measurements, which in your case I believe means all scores should be meaningfully anchored between -1 and +1. You also can’t mix and match assumptions – your raters and techniques must all be on the same scale, with the same variance.

I don’t know what you mean by “assumed a violation.” Most violations are themselves testable. You need to figure out what shape your data are and then choose tests appropriate to those data and the assumptions that can or cannot be made. Then you run tests. Finding a result you didn’t expect and then changing your approach solely because you didn’t find what you wanted is not valid.

I’m doing a book evaluation research using Bloom’s taxonomy with 4 analysts including me as the researcher. I use the taxonomy (Cognitive level 1 until 6) for every question We find in the book & categorize it into C1, 2,3,4,5 or 6. After all the analysts finished, then I want to check if there is inter-rater reliability (if all the analysts categorizing all the questions reliably). What formula should I use to check this. Thank you.

I don’t know. You need to determine its scale of measurement first.

HOW CAN I INPUT COMBINE VARIANCES INTO AN ERROR TERM WITH CONSTANT VARIANCE TO REPLACE THE CONSTANT VARIANCE.

Dear Dr. Landers,

Thanks for your informative article I’m hoping you can confirm I’m on the right track. In my study I have 60 subject dogs. Both the dog’s owner and the dog’s walker each filled out an established personality instrument that rated the dog on 5 dimensions. I’m interested in the inter-rater reliability of the dog walker and owner assessments. I’ve established that that ICC(1) is appropriate because each rater only rates one target, however I’m not sure about the average measure vs single measure. The instrument is a k-item list so scores from the raters are an average so I believe average is correct but I’d like to confirm.

Thanks again.

It depends on why you want to know. If your goal is to compare assessments between the two sources, you shouldn’t be using ICC – I would use a Pearson’s r. If your goal is to interpret the means across the two sources or to use those means in other analyses, you want average measures: ICC(1,k).

Thank-you so much for your response. I will be doing both Pearson’s r and ICC to analyze the data. I’ve been using this paper as a framework, which suggests doing both of those analyses.

Stolarova, M., Wolf, C., Rinker, T., & Brielmann, A. (2014). How to assess and compare inter-rater reliability, agreement and correlation of ratings: an exemplary analysis of mother-father and parent-teacher expressive vocabulary rating pairs. Frontiers in psychology.

I am doing an interrater reliability study of the rubrics used for grading by instructors at the American College of education. I set up 5 “mini-studies.” For each mini-study, 4 instructors are grading the assignment papers of 12 students. Two papers are graded for each of the 12. The rubric has six criteria and a total score, so I am examining reliability for the six criteria scores and the total score. For each criterion, scores from 1 to 5 are issued. In students’ regular classes, they will be graded by one instructor. Therefore, I think I am correct in using the ICC single measures coefficients as my reliability indicators. I’m interested in agreement but also whether or not students are being ranked the same across judges. So I’m running the ICCs for absolute agreement and consistency. The SPSS outputs also give me pairwise correlations among the graders and a mean correlation. I’m not sure if any of the other numbers I can get in the outputs are useful for interpretation. — In addition to the ICCs, I’m computing by hand possible agreement versus actual. The maximum pairwise agreement among the four professors is six. I’m defining agreement as exact, within 1 point, within 2 points, and within 3 points for the total scores which are each 30 or less. I could share more detail but would just appreciate knowing if I’m correct in what I’m doing and if there is anything I should be doing that I’m not so far. Thank you for your help.

Single measures sounds right for your purposes, and the differences between consistency and agreement sounds theoretically interesting for you, so you’re all good there. If all you want to know about is reliability, the ICCs are the only numbers you need. Pairwise correlations among graders can be useful if you’re trying to determine why the ICCs are low, e.g., general disagreement or disagreement caused by one particular person.

While you certainly can calculate the extra agreement numbers the way you are describing, that information is contained within the ICC and is not really necessary. If you want to know if a particular rater is causing a problem, there are better ways to do that (such as the pairwise correlations, or also an ANOVA). But I can see why you might do something like that if you’re going to try to translate what you did for a non-statistician audience.

One more quick question if you don’t mind. Can I average the single measures coefficients across the five mini-studies, and do I need to do a Fisher’s z transformation to do so? Thank you again.

An ICC is a reliability estimate, so it is a proportion – essentially on a scale from 0 to 100% – so you can just calculate an average of ICCs, if you need to know mean reliability for some reason.

Dear Dr Landers,

Thanks for your very informative article. Can you please confirm I’m on the right track? In my PhD study I have a total of 251 essays, which are rated by a single rater (myself), but a subsample of them (10) are double-rated by myself. How should I proceed to calculate intra-rater reliability? Which ICC is the most suitable for my case? Should I consider only the 10 essays which are double-rated or the whole sample of essays? I assume that the more essays I double check, the better, amb I right?

I hope you can answer my doubts.

Thank you very much in advance.

That’s an unusual situation for ICC since you can’t really be considered a random sample of yourself, which is required of ICC. I suppose you could use ICC(1,1), if you were to assume that you are a totally unbiased source of grading, but there are still some other assumption violations involved. Specifically, you can’t really assume upon second grading that you were unaffected by the first time you graded them, which is a violation of the independence assumption. You really need a second grader using the same criteria you are using. If you don’t have that, I would probably use a Pearson’s correlation instead, since you have meaningful time 1 – time 2 pairs. But that’s not precisely a reliability estimate for your situation.

You can only calculate reliability if you have replication, so it is only calculable on the 10. Just like with study design, the more cases you have two ratings of, the more stable your estimate of ICC will be (i.e., a smaller confidence interval). N=10 is not going to be very stable; if you want to calculate reliability on a subset, I’d personally want at least 25% of my final set.

Thank you for clearing things up and for the opportunity to ask questions, Mr. Landers.

I have a question of my own and I would gladly appreciate if you can answer my question.

My research group developed an instrument that can test the knowledge of students of a certain physics topic. After the development, we asked three experts (i.e. physics professors of our university) to rate the developed instrument using a validation rubric. The validation rubric is a questionnaire where criteria are listed and the expert will rate whether the listed criteria is evident in the developed instrument or not through a Likert scale where 5 means that the criterion is highly evident in the instrument while 1 means that the criterion is not evident in the instrument.

The rubric listed six major groups of criteria which are Objectives, Content, Illustrations, Diagrams and Figures, Language, Usefulness, and Scoring/Assessment Method. The following link shows a chunk of the data that we have gathered [http://imgur.com/wK4RKJI].

Since the data is a Likert scale, do you think it is appropriate for us to use ICC for the reliability of their ratings? If yes, what model should we use? Also, should we get the ICC for each major group (e.g. one ICC for Objectives, one ICC for Content, etc.) or should we get one ICC for all? If we should get one ICC for each major group, then how can we determine the overall ICC?

Thank you very much, Mr. Landers. Again, your response will be a great help for our study.

1. That varies by field. If your field is comfortable calculating means on Likert-type scales, then yes. If they usually use non-parametric tests, then no.

2. If all 3 experts rated every case, probably either ICC(2,1) or ICC(2,k), depending upon your goals. The article above explains this decision.

3. One for each variable you’re going to use somewhere else or interpret.

As the individual is attempting to validate the content of the instrument would this not be better handled using a content validity index? I dont think you are looking for reliability of the ratings you are looking for an overall rating of the validity of the tool?

I wouldn’t recommend that except in very limited circumstances. The CVI is a bit unusual. For the most part, it has only been commonly adopted in nursing research and a little bit in medicine more broadly, although it’s been introduced more broadly than that (including in my own field, industrial psychology). In most fields in the present day, content validity evidence is created through a process in which subject matter experts are consulted at multiple points, including to approve the final scale. In this context, the CVI is not really necessary except as a type of cross-validation (e.g., with different experts). In the OP’s case, even if such a statistic would be relevant, it would likely require a new data collection effort, and I don’t think it would provide much better evidence that what was already described… but it might still be worth looking into, assuming that the OP did not also need to introduce the CVI to their field (which is likely, outside of medicine).

Dear Dr Landers, thanks for providing this very helpful page. Could you please give an example for a case in which the raters are the population (and not just a sample)? Thank you a lot in advance!

Imagine you were running a triage unit at a hospital. Everyone that comes in needs to be triaged, but you’re worried that your triage nurse team (12 people) have different priorities when determining whom needs care first. In essence, you want to know how reliable a single nurse is in making that judgment. So to assess this, you ask for pairs of nurses to provide judgments on each person that comes in. Since you’re only interested in the reliability of the 12 nurses you have now, you want ICC(3). Since you want to know how reliable one of those 5 nurses is when making judgments alone, you want ICC(3,1).

If you wanted to be able to generalize from the 12 nurses you have now to any nurse you might ever hire, you would calculate ICC(2,1), but you’d need to assume that your hiring process for future nurses would be the same as past nurses (i.e., ICC(2) requires an additional assumption beyond ICC(3), which is why reliabilities are usually a little lower with ICC(2)).

Another example: we’re a party of 6 ENT-surgeons aiming to validate a specific protocol on scoring endoscopic video recordings of people swallowing different foodstuffs. Since there are >50 patients to score (each patient takes about 20 min to score), we were interested to see how well any individual of our ‘peer’ group faired compared to the remains judges, in order to allow a single judge to score several exams without being controlled by others… (so the judges were the population since nobody else would score the exams)sadly, results were really bad, but thanks to Dr. Landers we found out! Question remains: how to interpret the single results (1 judge) versus the ‘population’… what constitutes a significant difference to not allow the single judge from scoring future exams…

Yeah, identifying particular raters that are “problems” is a question without a clear answer. You just have to decide how different is different enough, in terms of means, covariances, or both. You might even try an EFA – I think that would be most interesting, personally.

Dear Dr. Landers,

I am supervising a number of Assistant Psychologists responsible for assessing patients taking part in an RCT examining the efficacy of cognitive remediation therapy for schizophrenia. Two of the outcome measures require the APs to make clinical judgements about the severity of patients’ symptoms. The PANSS contains 30 items each rated on a 1-7 scale, and the CAINS has 13 items each rated on a 0-4 scale. I have trained the APs to use these tools and we have randomly selected approx 10% of patients from the RCT to assess our interrater reliability (IRR) on the PANSS and CAINS. For each case selected for IRR there have been two raters (myself and another AP). The purpose is to assess our reliability as raters.

Your advice is much appreciated.

Thanks

Dr. Danny O’Sullivan

Your design implies to me that you are interested in the consistency of ratings between you and the assistant psychologists. In that case, I would probably use a Pearson’s correlation, since you are consistently one case and the AP is consistently the other. If you want to assess the reliability of the scale in general, you appear to be violating the assumptions of all types of ICC; you should really be randomly choosing a pair of ratings from all available raters for ever rating task. Or counterbalancing would work too. But having 1 rater consistent and the other rater inconsistent makes it impossible to parcel out the rater-contributed sources of variance, i.e., there’s no way to find out if your ratings are biased, which is a problem because you are 50% of every rating pair.

Having said that, you could still theoretically use ICC(1), but only if you _assume_ yourself to be equally as skilled as the APs. That may be a risky assumption.

Thank you so much for this very helpful article. I just finished a study comparing two methods of diagnosis for autism: a new procedure v. a gold-standard procedure. Both result in a 1 (= autism), or 0 (=not autism). I am comparing the dichotomy across methods (new v. old, 1 rater each) and across two raters (new, rater 1 and rater 2). I see that ICC is not appropriate for dichotomous variables, but is kappa the appropriate statistic for this analyses? I hear conflicting information. Thank you for your help!

Yes, ICC won’t work. Kappa is fine, although conservative. I usually report both proportion agreement and kappa in such cases.

I am doing a patient study that looks at the ICC of 2 observers (stays the same). In my study we make these 2 radiologist look at the MRI images of the same 30 patients.

In this study, since there are no repeat measurements, can we only use the single measurement. Or is there a place available for the average measurement?

I’m not quite sure what you mean. If you have both radiologists rating all 30 patients, you can calculate ICC. If you have each radiologist rating half of the patients, you can’t. I would recommend you follow the steps described above to figure out what kind.

I’m not quite sure what you mean. If you have both radiologists rating all 30 patients, you can calculate ICC. If you have each radiologist rating half of the patients, you can’t. I would recommend you follow the steps described above to figure out what kind.

Hey mike, thanks for the speedy response. I have 2 radiologists (don’t change), they each look at all 30 patients independently. None of them looks at the same patient twice, so all radiological imaging assesment takes place once. I know I will be using a two way mixed effect model.

I am just not sure if I can also use the average & single ICC. Since all the assesments were done once, can I even use the average ICC score?

I don’t think you understand the difference between average and single measures. The question you need to ask is if you want to know the reliability of a radiologist’s opinion (single) or if you want to know the reliability of the average of two radiologists’ opinions (average).

Maybe my knowledge in this field is lacking, but how can you use the average of two radiologist? Since there are no repeat measurements, and all measurements are done independently.

The single measurement (i.e. reliability of a radiologist opinion), is a relatively straight forward concept to grasp. However I am in particular struggling with the “average measurement”

I’m not sure where your confusion is coming from. You have radiologist 1, who made a rating on a patient. You have radiologist 2, who made a rating on the same patient. You add them together and divide by 2 to get an average rating for your two radiologists.

Thanks Mike! That is what I initially thought, but my professor was certain that this was not the case.

He envisioned that you have >

Radiologist made a rating on all 30 patients & then repeats the rating again.

Radiologist likewise makesa rating on all 30 patients & then repeats the rating again.

You take the average repeats of radiologist 1 and take the average repeats of radiologist 2. The icc then compares both averages to produce the average measurement*

Since the radiologists rated all the patients once, he believed that only the single measurement was appropriate in this clinical study.

Hey mike,

I have pretty much explained it concisely?

Cheers

I’m not sure whom you mean by Mike, since you’re Mike, and I’m not.

If you made repeat ratings, you would more likely calculate a coefficient of equivalence and stability, since you believe your estimate to be unstable both over time and between raters. You would not use ICC in that case.

“Single measure” does not refer to the number of observations by each rater. It refers to the number of observations from independent sources of which you wish to determine the reliability. If you want to predict the reliability of a single person making a rating from multiple people making ratings, you want “single measure.” If you’re going to take the average of your raters and use them in other analyses, you want “average measures.”

Hi. I am attempting to compute intercoder reliability for content analysis by 3 raters. The same 3 raters will be coding the data for presence of each of the 40 measures. In this case I should compute the intercoder reliability using ICC1.. Please advise which statistics I should report as the intercoder reliability and what is an acceptable level. Thank you very much for your advice.

Please! Read the whole topic, including Q&A! I know it takes some time but it is very instructional! Your design topic and suggested answer just show you did not read through ANY of this post… so your questions are gratuitous! This kind of statistics is not a one-stop option, if you think so, please stay away! Happy analyzing in the correct way!

I would actually say that because you are coding “presence” of a measure, you are probably dealing with dichotomous (yes/no) data, in which case you should not use ICC at all – you probably need either simple percentage agreement or Fleiss’ kappa.

Thank you, Assoc Professor Landers, for understanding my problem and J for your advice.

Sorry that I have not elaborated well on the issues that I faced when computing the results of content analysis conducted by three coders.

The 3 coders coded data for presence, with no presence = 0 and presence = 1.

Though we have very high level of agreement for each of the 40 measures on 200 texts (pilot coding), I am getting very strange results after running ICC1 using SPSS. Could this be due to lack of variability in codes?

Hence I sought advice on whether I should compute the intercoder reliability using ICC1 and which statistics I should report as the intercoder reliability in such a situation.

Appreciate your kind guidance.

As I mentioned before, ICC is not appropriate here. You have nominal data, which does not meet the distributional requirements of ICC. You should calculate Fleiss’ kappa (or Krippendorff’s alpha).

To accomodate all variations, please consider Krippendorff’s alpha! Prof. Landers, any ideas about this measure?

My understanding is that Krippendorff’s alpha is a general case of virtually all agreement statistics, including kappas, ICC, etc, for any number of raters and any amount of missing data. I’m not sure if there actually are any common

agreementstatistics that can’t be expressed as a K’s alpha (importantly, alpha does not replaceconsistencystatistics).There are only two problems I am aware of with it. The first is that its flexibility also makes it quite complicated to calculate, and it has different approaches depending upon scale of measurement and a variety of other factors – that makes it hard to follow when looking at formulas (just compare the formula for Fleiss’ kappa with the formula for Krippendorff’s alpha for nominal three-rater measurement). The second is that it is entirely data-driven, which occasionally causes inaccuracy – so for example, if nominal ratings are made on 4 categories but no rater used them, kappa will capture that (correctly) whereas alpha will be biased downward. But that is only a problem when you happen to have data with that characteristic.

Dear Landers,

Thank you so much for this informative article.

I have a question for you. I have assessed 30 participants on two versions (Hindi and English) of a same 36 item scale (each item has yes or no response). I need to do reliability analysis.

Can you please suggest which way i can do it. is ICC a suitable method for this?

ICC is almost certainly not appropriate, but it depends what you want to do with the means of those scales. You will most likely need to use structural equation modeling to establish measurement invariance across the two versions if you hope to compare their means directly. See “A Review and Synthesis of the Measurement Invariance Literature: Suggestions, Practices, and Recommendations for Organizational Research” by Vandenberg & Lance (2000).

Hello, thank you so much for this. I am completing a study at university where 7 assessors/judges have scored 8 videos with a colour category (one of four colours) – how is this possible in SPSS. What does the data type and measure need to be in order for this to work?? Thank you so much.

Kind regards,

Rebecca

I don’t know what you’re doing, so I don’t know. If you’re trying to determine the reliability of the color codes, you’ll want Fleiss’ kappa. I would use Excel and work through the formulas – SPSS won’t give that statistic without syntax, and SPSS syntax is clunky.

Hello Dr. Landers,

I found this site very helpful!

I just have one query. For a study I’m working on, I had three coders read a set of stories (N=82) and make judgements on 3 perceived characteristics of the writer for each story. Each coder made the same judgements about the same stories. However, one of the coders didn’t agree well at all with the other two, so I left them out temporarily.

Then, with the remaining two coders, I ran two-way random ICCs, which were .42 and .56 for two of the traits (ps<.01).

Does this indicate moderate reliability?

Is it acceptable to average across these two coders' scores for each variable?

Finally, I also read somewhere that it is not ideal to run ICCs with fewer than 5 coders – is this true?

Many thanks!

Helen

That is pretty poor reliability. Assuming you’re talking about ICC(2,2), that indicates that only 50% of the variance observed in your mean rating comes from “true” variance. You generally want at least .7, preferably closer to .9.

It depends on what you mean by “ideal.” Having a smaller number of coders, assuming all coders are drawn from the same population, will result in lower reliability. So if your ICC(#,1) is already low, that means you’ll need even more raters to make up for the limited information provided by each rater individually.

Hi Richard,

I echo all of the previous sentiments regarding your very user-friendly explanation about ICCs. I have used your guide a number of times now. Thank you very much for taking the time to write and post this. In the most recent instance of running these stats, however, I have run into a bit of an anomaly. Perhaps you could help?

We had a number of raters go into the field together and rate the same case using an established scale, so that we could determine the extent to which each rater was reliable against an established rater (before we send them out on their own). To evaluate this, I ran a two-way mixed ICC looking at absolute agreement, adding ratings of the established rater and one trainee as variables (so as to evaluate how consistent the trainee was with this established rater). I then repeated this for each trainee rater. We got good levels of reliability, however one rater with higher levels of agreement (59% absolute agreement across rated items, 14% of ratings out by 2 points from the established rater) actually had a lower ICC (.734) than someone with lower levels of agreement (38% absolute agreement, with 24% of ratings out by 2 points from the established rater…yielding an ICC of .761).

I am not sure why this should be, and if I am perhaps doing something wrong. Can you think of an explanation that could account for this? The data is technically ordinal, but it is a 7-point scale and absolute agreement is not often at very high rates. Your advice would be much appreciated.

Sincerely,

Steven

If you always have a “gold standard” rater, that means you have meaningful pairs – so in that case, it sounds like you should potentially be using a Pearson’s correlation or possibly Spearman’s rho. ICC provides the reliability of your raters _in general_, so you are capturing information about both your “goal standard” rater and your experimental rater with each ICC, which is a contaminated measure of what you seem to want to know (i.e., the reliability of the “new” rater).

As to why you might have seen the pattern you saw, it might be helpful to remember that ICC is a close cousin of