Clustered Standard Errors in R

December 13, 2016 AV 60 Comments

The easiest way to compute clustered standard errors in R is the modified summary(). I added an additional parameter, called cluster, to the conventional summary() function. This parameter allows to specify a variable that defines the group / cluster in your data. The summary output will return clustered standard errors. Here is the syntax:

summary(lm.object, cluster=c("variable"))

Furthermore, I uploaded the function to a github.com repository. This makes it easy to load the function into your R session. The following lines of code import the function into your R session. You can also download the function directly from this post yourself.

# load necessary packages for importing the function
library(RCurl)

# import the function from repository
url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)),
     envir=.GlobalEnv)

Below you will find a tutorial that demonstrates how to import the modified summary() function into you R session. It also explains the application of the function in greater detail. The tutorial carries out an OLS estimation in R that is based on an fake data that I generate here and which you can download here.

# start with an empty workspace
rm(list=ls())

# load necessary packages for importing the function
library(RCurl)
# load necessary packages for the example
library(gdata) 
library(zoo)

# import the function
url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)),
     envir=.GlobalEnv)


# download data set for example
url_data <- "https://economictheoryblog.com/wp-content/uploads/2016/12/data.xls"
data <- read.xls(gsub("s:",":",url_data))


# estimate simple linear model
reg <- lm(id_score ~ class_size, 
             data=data)

# use new summary function
summary(reg)
summary(reg,cluster = c("class_id"))

# Note, the standard errors are much 
# larger with clustered standard errors

One can also easily include the obtained clustered standard errors in stargazer and create perfectly formatted tex or html tables. This post describes how one can achieve it.

60 thoughts on “Clustered Standard Errors in R”

Pingback: Clustered Standard Errors | Economic Theory Blog
Pingback: Example data – Clustered Standard Errors | Economic Theory Blog
Ranaivo says:

May 4, 2017 at 12:03 pm

Will this function work with two clustering variables? Something like: summary(lm.object, cluster=c(“variable1”, “variable2”))?

Reply
1. ad says:
  
  May 4, 2017 at 1:46 pm
  
  Yes. This function allows two clustering variables. Cheers.
  
  Reply
  1. Ranaivo says:
    
    May 5, 2017 at 5:12 am
    
    Thank you. One more question: is the function specific to linear models? Or can it work for generalized linear model like logistic regression or other non-linear models?
  2. ad says:
    
    May 5, 2017 at 6:40 am
    
    Currently, the function only works with the lm class in R. I am working on generalizing the function. However, it will still take some time until a general version of the function is available, I suppose.
Ranaivo says:

May 5, 2017 at 7:41 am

Thank you so much. I tried the function and it worked well with a single clustering variable. But it gives an error with two clustering variables. Any clues? Here is what I have done:

> SITE URLdata VarNames test fm url_robust eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv)

# one clustering variable “firmid”
> summary(fm, cluster=c(“firmid”))

Call:
lm(formula = y ~ x, data = test)

Residuals:
Min 1Q Median 3Q Max
-6.7611 -1.3680 -0.0166 1.3387 8.6779

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02968 0.06701 0.443 0.658
x 1.03483 0.05060 20.453 <2e-16 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.005 on 4998 degrees of freedom
Multiple R-squared: 0.2078, Adjusted R-squared: 0.2076
F-statistic: 418.3 on 1 and 499 DF, p-value: summary(fm, cluster=c(“year”))

Call:
lm(formula = y ~ x, data = test)

Residuals:
Min 1Q Median 3Q Max
-6.7611 -1.3680 -0.0166 1.3387 8.6779

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02968 0.02339 1.269 0.204
x 1.03483 0.03339 30.993 summary(fm, cluster=c(“firmid”, “year”))
Error in summary.lm(fm, cluster = c(“firmid”, “year”)) :
object ‘M’ not found

Reply
1. ad says:
  
  May 5, 2017 at 8:26 am
  
  You are right. There was a bug in the code. I fixed it. I guess it should work now. However, you should be careful now with interpreting the F-Statistic. I am not sure if I took the right amount of degrees of freedom. The rest of the output should be fine.
  
  Besides the coding, from you code I see that you are working with non-nested clusters. I cannot remember from the top of my head. But should you not be careful with such a structure?
  
  Reply
  1. Pankaj Jindal says:
    
    May 7, 2017 at 6:06 pm
    
    Hi,
    
    Thanks a lot first of all for putting in so much effort to write this function.
    
    I am getting an error for twoway clustering. This is the error I get:
    Error in if (nrow(dat)
    
    Is this error specific to me?
    
    Thanks,
    Pankaj
  2. ad says:
    
    May 8, 2017 at 5:36 am
    
    Hey. Thank you for comment. There was a bug in the code. Basically, not all of your observations have a cluster, i.e. one or both of your cluster variables contain NA’s. The problem was that I did not set-up the warning properly. That is, the warning only worked for the single clustering case, but did not work for twoway clustering. I fixed it and now it should work. Let me know if you encounter any other problems. Cheers.
  3. Pankaj Jindal says:
    
    May 7, 2017 at 6:07 pm
    
    The error didn’t paste properly in the previous comment.
    
    Error in if (nrow(dat)
Ranaivo says:

May 5, 2017 at 7:45 am

I am sorry my comment above is a bit of a mess. It changed when I posted it. But basically when I use two clustering variables [e.g., summary(fm, cluster=c(“firmid”, “year”))], I get the error message: “Error in summary.lm(fm, cluster = c(“firmid”, “year”)) :
object ‘M’ not found”

Reply
1. ad says:
  
  May 5, 2017 at 8:27 am
  
  No worries, in my browser it appears quite clear.
  
  Reply
Ranaivo says:

May 9, 2017 at 4:24 am

Thank you for your response and your great function. It really helps.

Reply
1. ad says:
  
  May 9, 2017 at 5:40 am
  
  I’m glad it helps.
  
  Reply
ct says:

June 12, 2017 at 4:54 pm

Hi! Is there any way to use this code when using weights in your lm model? I get an error telling me that my weights are not recognized : “Error in get(all.vars(object$call)[length(all.vars(object$call))]) : objet ‘yeardif’ introuvable”
Is there anything I can do?

Many thanks for your help!

Reply
1. ad says:
  
  June 13, 2017 at 7:53 am
  
  Hi! Thank you so much for you comment. There was a problem when extracting the data object from the formula when weights were specified. I was able to fix the problem and now it should work fine. Thank you again for your help.
  
  Reply
  1. ct says:
    
    June 13, 2017 at 12:46 pm
    
    Hi again!
    
    Thank you very much for your reply! I tried again, and now I only get NAs in the Standard error, t-value, and p value column, even though I have no missing values in my data… I don’t get it! Do you know what might be going on?
    
    Thanks again!
  2. ad says:
    
    June 13, 2017 at 2:03 pm
    
    Hi! I conducted some additional robustness tests and everything works fine for me. Can you provide a reproducible example?
    
    Cheers!
2. ad says:
  
  January 10, 2018 at 3:34 pm
  
  Hey! Sorry to come back to you after all this time. But I wonder, were you ever able to solve your problem with the function?
  Best, ad.
  
  Reply
3. Thomas says:
  
  November 7, 2018 at 3:25 pm
  
  This is an incredibly helpful function. Thank you! My query is also regarding the use of survey weights. When using survey weights, i get no error warning, but the SEs do not appear to be clustered: they are identical to the unclustered……
  
  Reply
  1. ad says:
    
    November 7, 2018 at 3:32 pm
    
    Thanks a lot. How exactly do you specify the weights? Are you using the weight option of lm? Is there any way to provide a reproducible example?
Ricky says:

January 10, 2018 at 3:08 pm

Hi!
Thanks for the function.
I think I am getting the same problem as ct. Even reproducing the example you provide I get a bunch of NAs.
Is it only me?

Many thanks!

Reply
1. ad says:
  
  January 10, 2018 at 3:27 pm
  
  Hey!
  
  Thank you for you remark. I tried the example and it works fine for me. Could you provide some more details? Do you have the package “sandwich” installed?
  
  Cheers!
  
  Reply
Ricky says:

January 10, 2018 at 7:57 pm

Below a printout of my console.
Maybe I am missing some packages.
Btw, sorry for taking up so much space.
And apologies for I am new to R and probably this is why I am not seeing the obvious.

Reply
1. ad says:
  
  January 11, 2018 at 6:53 am
  
  Thank you for the printout. I now removed it from your comment. Unfortunately, I still cannot find the error. I tried the example with the newest R Version (3.4.3) and went to a completely different PC, in both cases the example worked fine. Can you check if you have the sandwich package installed? Furthermore, I noticed that you download the data differently – not that this should matter – but did the gdata package not work for you? Finally, you might have some packages loaded in your memory that mask other functions. Could you restart R and only run my example? Best, ad.
  
  Reply
Emi says:

January 15, 2018 at 4:04 pm

Hi! Thanks so much for making this available. I am a newbie to R, and I am having some trouble making the modified summary() function work. Although the example you provide in the short tutorial above worked smoothly, I tried to use it with a toy example of mine and I got the error message

“Error in summary.lm(mod, cluster = c(i)) :
The function only allows max. 2 clusters. You provided more.”

Here is a reproducible example (I realize that since each cluster is a singleton, clustering should be irrelevant for the calculation of standard errors; but I don’t see why that should make the function return an error message):

rm(list=ls())
library(RCurl)
url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv)

i <- seq(1,100,1)
x <- rnorm(100)
y <- 1 + 2*x + rnorm(100)
simpledata <- as.data.frame(cbind(i,x,y))
mod <- lm(y~x, data = simpledata)
summary(mod, cluster = c(i))

Thanks a lot!

Reply
1. ad says:
  
  January 15, 2018 at 6:25 pm
  
  Hi! Thank you for you remark. I am glad to hear that you are using my function. The reason that your example does not work properly has actually nothing to do with the cluster function, but is caused by a small syntax error. Try to put the variable i in last line of you code, i.e. summary(mod, cluster = c(i)), in parentheses such that it looks like this “i”. Your example should work fine then. Let me know if it works. Also, just get in touch in case you encounter any other problems. Cheers
  
  Reply
  1. Emi says:
    
    January 16, 2018 at 3:00 pm
    
    Thanks a lot for the quick reply! It worked perfectly.
cvaccari says:

March 1, 2018 at 8:47 am

Hello, many thanks for creating this useful function. I read in the comments above that you are working to extend it so it works for the the glm family, and let me just add that I would be really, really glad to see it implemented for the glm.nb (negative binomial regression) command. Thanks again for your work!

Reply
1. ad says:
  
  March 5, 2018 at 9:50 am
  
  Thank you for your comment. I’ll try my best. Currently, I am working on a different project. Hence, it will take longer than expected 🙂 Cheers
  
  Reply
Max Pger says:

March 1, 2018 at 10:21 am

Hello ad, thx a lot for this function! I had the same issue than ct and Ricky and after examining the code, I realized that it came from the cluster object. I modified the function accordingly, and it works like a charm :

cluster <- dat[,cluster] #Max P : since dat is a df, cluster will also be a df
require(sandwich, quietly = TRUE)
M <- res_length <- length(unique(cluster[[1]])) #Max P : instead of length(unique(cluster)) , =1
N <- length(cluster[[1]]) #Max P : instead of length(cluster),=1 since cluster is a df

The same modifications should work for the 2 clusters case.

Thank you again for sharing your R thoughts and functions!

Best,
MP

Reply
1. ad says:
  
  March 21, 2018 at 2:39 pm
  
  Hi Max
  
  Thank you for reaching out. The solution that you proposed does not to work properly. The object cluster does contain all possible clusters and you interested in the unique clusters. Using cluster[[1]] you select only the first element of the date.frame.
  
  I did now change the function a little. Maybe this helps to get rid of the NA problem. As I am not able to reproduce this problem, I find it incredibly hard to tackle it. Could you by any chance provide a reproducible example?
  
  Best, ad
  
  Reply
tb says:

March 13, 2018 at 2:48 pm

Hi and thanks for the amazing work!
I think I’ve done everything right, but I’m getting NA’s for Std. error, t value and Pr(>|t|). Do you have any solutions for this?

Reply
1. ad says:
  
  March 21, 2018 at 2:41 pm
  
  Hi tb
  
  Thank you for you comment. Unfortunately, I am not able to reproduce t the NA problem. Can you, by any chance, provide a reproducible example? That would help a lot! Best, ad
  
  Reply
Martin Fernandez says:

March 19, 2018 at 9:12 pm

Hi,
First of all, thank you so much for this fantastic function! It’s been very helpful for my research.
I would like to tell you about a problem I am having when using the clustered robust standard errors while changing regressors in a loop. I will illustrate it with an example:

# Here some sample data
Y <- c(1, 3, 2, 0, 5, 6)
X <- c(2, 4, 3, 2, 10, 8)
ID <- c(0, 0, 0, 1, 1, 1)
dat <- data.frame(Y, X, ID)
# Here some controls which are "outside" the dataset:
C1 <- c(1, 2, 3, 4, 5, 6)
C2 <- c(6, 4, 2, 8, 0, 13)
C <- matrix(NA, 6, 2)
C[ , 1:2] <- t(c(C1, C2))
# A matrix to store the standard errors:
R <- matrix(NA, 2, 1)
# Now I do a loop to regress Y on X adding the controls sequentially and storing s.e.:
for(i in 1:2){
reg <- summary(lm(data=dat, Y ~ X + C[, i]))
R[i,1] <- reg$coefficients[3,2]
}
R
# This produces the following output:
# [,1]
# [1,] 0.4255123
# [2,] 0.1015860

# However, the loop does not work when using the clustered s.e.:
for(i in 1:2){
reg <- summary(lm(data=dat, Y ~ X + C[, i]), cluster=c("ID"))
R[i,1] <- reg$coefficients[2,2]
}
R
# Error in get(paste(object$call$data)) : invalid first argument
# Called from: get(paste(object$call$data))
##

Any idea of why this is happening or how it can be solved?

Thanks a lot in advance.
Martin

Reply
1. ad says:
  
  April 11, 2018 at 5:10 am
  
  Hi Martin
  
  Sorry for my late reply. The problem arises from your loop and is not directly related to the function. An easy way to solve the problem is to estimate each regression separately. I don’t know if this is a practicable solution in your case. Otherwise you could check out alternative ways to estimate clustered standard errors in R.
  
  Best
  ad
  
  Reply
Jen says:

April 3, 2018 at 6:27 pm

How can I cite your function? Is there an official means/way to do so or should I cite the blog?

Reply
1. ad says:
  
  April 11, 2018 at 5:04 am
  
  Hi Jen
  
  You can cite the function as follows
  
  Dibiasi, A. (2016). Clustered Standard Errors in R [Blog post]. Retrieved from https://economictheoryblog.com/2016/12/13/clustered-standard-errors-in-r/
  
  Thank you!
  ad
  
  Reply
Micaela Chavez says:

May 16, 2018 at 1:17 pm

Hey

Thanks so much for the code.
I was wondering if there is a possibility to get the results in a nice table, like with stargazer or something like that.

Thanks 🙂

Reply
1. ad says:
  
  May 18, 2018 at 4:25 am
  
  Hey
  
  Thank you for your comment. Yes, you can do that. Save you summary output and recover the coefficients. Now you can add them to Stargazer. For instance,
  
  summary_save <- summary(reg,cluster = c("class_id"))
  clustered_errors <- as.vector(summary_save$coefficients[,c("Std. Error")])
  
  Hope this helps.
  Cheers, ad
  
  Reply
2. ad says:
  
  May 18, 2018 at 7:11 pm
  
  Hi again,
  
  I prepared a short post that explains how one can obtain nice tables in stargazer with clustered standard errors.
  
  Hope it will help you.
  Kindly, ad
  
  Reply
3. Micaela Chavez says:
  
  May 20, 2018 at 2:11 pm
  
  thank you very much
  I will try this imediatly 🙂
  
  Reply
  1. ad says:
    
    May 20, 2018 at 3:10 pm
    
    Hope it works fine for you!
Pingback: Cluster Robust Standard Errors in Stargazer | Economic Theory Blog
Harun says:

May 19, 2018 at 12:48 pm

Hello, first of all thank you for making all this effort but I get an error when I try to use your function add on:

Error in get(paste(object$call$data))[, c(n_coef, cluster)] :
object of type ‘closure’ is not subsettable
Called from: na.omit(get(paste(object$call$data))[, c(n_coef, cluster)])

Do you know where the problem is?

Best Wishes

Reply
1. ad says:
  
  May 19, 2018 at 2:09 pm
  
  Hi,
  
  Thank you for your comment. This error message arises if we try to index a function. However, without knowing your specific case it is a little difficult to evaluate where the error is caused. Could you provide a reproducible example–a short R code that produces the same error? That will allow me to check where the error is coming from.
  
  Best, ad
  
  Reply
  1. Harun says:
    
    May 20, 2018 at 12:58 pm
    
    Since I can’t provide you the .csv file, imagine something like this:
    
    setwd(“~/R/folder”)
    library(RCurl)
    House1 <- read.csv("House.csv")
    attach(House1 )
    reg1 <- lm(equi ~ dummy + interactions + controls,
    data=subset(House1, money< 100 & debt == 0))
    url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
    eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)),
    envir=.GlobalEnv)
    
    summary(reg1,cluster = c("PLZ"))
    
    I don't have anything "fancy" installed like perl or something else
    
    Best Wishes
  2. ad says:
    
    May 20, 2018 at 3:16 pm
    
    Hi again,
    
    Thanks a lot for the example. It looks fine to me. The only potential problem that I could detect is that you subset the data within the lm() function. Could you try to subset the data before running your regression. Something like this:
    
    df=subset(House1, money< 100 & debt == 0)
    reg1 <- lm(equi ~ dummy + interactions + controls, data=df)
    
    Let me know if that solves the problem.
    
    Best, ad
  3. Harun says:
    
    May 20, 2018 at 5:29 pm
    
    Hello,
    
    Yes, this solved my problem! Thank you 🙂
    
    Best Wishes
  4. ad says:
    
    May 20, 2018 at 6:08 pm
    
    You’re welcome!
    
    Cheers, ad
MP says:

September 3, 2018 at 5:41 pm

Hello,
Thank you very much for writing this function. I was just stumbling across a potential problem. It seems that your function computes the p value corresponding to the normal distribution (or corresponding to the t distribution with degrees of freedom depending on the number of observations). In Stata, however, I get the same t statistics but different p-values. It seems to be the case that Stata uses the t distribtuion where degrees of freedom depend on the number of clusters rather than on the number of observations! I am quite new to R and also to statistics, could you shed some light on which approach should be used and why?
Best regards!

Reply
1. ad says:
  
  September 7, 2018 at 9:22 am
  
  Hi! This is actually a good point. When having clusters you converge over the number of clusters and not over the number of total observations. Hence, I should adapt the function accordingly. Thank you for that.
  Best, ad
  
  Reply
Nilima Pisharody says:

December 2, 2018 at 3:35 pm

Hi, I am super new to R (like 2 months now) and I’m trying to sort of learn it by myself. And I came across this code and I was happy for it, but I am facing some troubles making it work.

I will try to explain it as simply as I can (because it sounds complicated in my head)

Problem: I don’t have variables for which I want to find correlations hanging around in my global environment. I have data-frames. The particular one I am using now for the regression is called regdata. I am modeling my lm regression like this. The size of the dataframe is 160 x 9, 160 rows and 9 columns.

result 2″ to an “invalid object”. I have tried all of the following and nothing works

summary(result, cluster = c (regdata$x3))
summary(result, cluster = c (“regdata$x3”))
summary(result, cluster = c (x3))
summary(result, cluster = c (“x3”))
summary(result, cluster = c (160, regdata$x3))

In this instance, x1, x2, x3 are all categorical variables with

x1 ranging from 1 to 5
x2 has 3 values 0,1,2
x3 has 4 values ranging from 1 to 4

Can I not cluster if the number of clusters in more than 2?

Reply
1. ad says:
  
  December 5, 2018 at 10:00 am
  
  Hi, thank you for the comment. Your fourth example is the way is should work, i.e. you pass on the variable name to function. Do you pass on the DataFrame in your regression? Unfortunately, the information you give does not provide sufficient information in order for me to really help you. Could you provide a reproducible example? Or at least state the error message? Best, ad
  
  Reply
JAN says:

April 9, 2021 at 3:43 pm

Hi,

I’m relatively new to regression analyses and am currently writing a thesis on the effect of certain ESG date on costs of equity (COE). A prior study that I’m looking at (https://www.sciencedirect.com/science/article/abs/pii/S037842661100081) is using pooled cross-sectional time-series regressions with robust standard errors clustered at the firm level.

I wish to cluster the standard errors at firm level as well, but I don’t understand how.

The model looks somewhat like this:

model1 <- lm( COE ~ ESG1 + ESG2 + ESG3 + controls)

What am i supposed to use as "VARIABLE" in:
"summary.lm(model1, cluster= c("VARIABLE") ?

Regards, JAN

Reply
1. ad says:
  
  September 16, 2021 at 6:08 am
  
  Hey Jan! “VARIABLE” should be the name of the Variable you want to cluster you errors on.
  
  Reply
Dustin says:

June 5, 2021 at 1:09 am

Hi,
Thank you very much for writing and sharing this function. It is amazingly helpful. I realized that the function generates an error, specifically “Error in varcovar[pos_coef, pos_coef] : subscript out of bounds”. If a variable is dropped from the regression. I ran a regression on a subgroup in which there was no variation in a binary variable. It took me some time to figure out what the problem was. I’m commenting mostly to help someone with a error in the future.

Reply
1. ad says:
  
  September 16, 2021 at 6:07 am
  
  Cheers!
  
  Reply