The easiest way to compute clustered standard errors in R is the modified summary()
. I added an additional parameter, called cluster,
to the conventional summary()
function. This parameter allows to specify a variable that defines the group / cluster in your data. The summary output will return clustered standard errors. Here is the syntax:
summary(lm.object, cluster=c("variable"))
Furthermore, I uploaded the function to a github.com repository. This makes it easy to load the function into your R session. The following lines of code import the function into your R session. You can also download the function directly from this post yourself.
# load necessary packages for importing the function library(RCurl) # import the function from repository url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R" eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv)
Below you will find a tutorial that demonstrates how to import the modified summary()
function into you R session. It also explains the application of the function in greater detail. The tutorial carries out an OLS estimation in R that is based on an fake data that I generate here and which you can download here.
# start with an empty workspace rm(list=ls()) # load necessary packages for importing the function library(RCurl) # load necessary packages for the example library(gdata) library(zoo) # import the function url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R" eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv) # download data set for example url_data <- "https://economictheoryblog.files.wordpress.com/2016/12/data.xls" data <- read.xls(gsub("s:",":",url_data)) # estimate simple linear model reg <- lm(id_score ~ class_size, data=data) # use new summary function summary(reg) summary(reg,cluster = c("class_id")) # Note, the standard errors are much # larger with clustered standard errors
One can also easily include the obtained clustered standard errors in stargazer and create perfectly formatted tex or html tables. This post describes how one can achieve it.
Will this function work with two clustering variables? Something like: summary(lm.object, cluster=c(“variable1”, “variable2”))?
Yes. This function allows two clustering variables. Cheers.
Thank you. One more question: is the function specific to linear models? Or can it work for generalized linear model like logistic regression or other non-linear models?
Currently, the function only works with the lm class in R. I am working on generalizing the function. However, it will still take some time until a general version of the function is available, I suppose.
Thank you so much. I tried the function and it worked well with a single clustering variable. But it gives an error with two clustering variables. Any clues? Here is what I have done:
> SITE URLdata VarNames test fm url_robust eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv)
# one clustering variable “firmid”
> summary(fm, cluster=c(“firmid”))
Call:
lm(formula = y ~ x, data = test)
Residuals:
Min 1Q Median 3Q Max
-6.7611 -1.3680 -0.0166 1.3387 8.6779
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02968 0.06701 0.443 0.658
x 1.03483 0.05060 20.453 <2e-16 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.005 on 4998 degrees of freedom
Multiple R-squared: 0.2078, Adjusted R-squared: 0.2076
F-statistic: 418.3 on 1 and 499 DF, p-value: summary(fm, cluster=c(“year”))
Call:
lm(formula = y ~ x, data = test)
Residuals:
Min 1Q Median 3Q Max
-6.7611 -1.3680 -0.0166 1.3387 8.6779
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02968 0.02339 1.269 0.204
x 1.03483 0.03339 30.993 summary(fm, cluster=c(“firmid”, “year”))
Error in summary.lm(fm, cluster = c(“firmid”, “year”)) :
object ‘M’ not found
You are right. There was a bug in the code. I fixed it. I guess it should work now. However, you should be careful now with interpreting the F-Statistic. I am not sure if I took the right amount of degrees of freedom. The rest of the output should be fine.
Besides the coding, from you code I see that you are working with non-nested clusters. I cannot remember from the top of my head. But should you not be careful with such a structure?
Hi,
Thanks a lot first of all for putting in so much effort to write this function.
I am getting an error for twoway clustering. This is the error I get:
Error in if (nrow(dat)
Is this error specific to me?
Thanks,
Pankaj
Hey. Thank you for comment. There was a bug in the code. Basically, not all of your observations have a cluster, i.e. one or both of your cluster variables contain NA’s. The problem was that I did not set-up the warning properly. That is, the warning only worked for the single clustering case, but did not work for twoway clustering. I fixed it and now it should work. Let me know if you encounter any other problems. Cheers.
The error didn’t paste properly in the previous comment.
Error in if (nrow(dat)
I am sorry my comment above is a bit of a mess. It changed when I posted it. But basically when I use two clustering variables [e.g., summary(fm, cluster=c(“firmid”, “year”))], I get the error message: “Error in summary.lm(fm, cluster = c(“firmid”, “year”)) :
object ‘M’ not found”
No worries, in my browser it appears quite clear.
Thank you for your response and your great function. It really helps.
I’m glad it helps.
Hi! Is there any way to use this code when using weights in your lm model? I get an error telling me that my weights are not recognized : “Error in get(all.vars(object$call)[length(all.vars(object$call))]) : objet ‘yeardif’ introuvable”
Is there anything I can do?
Many thanks for your help!
Hi! Thank you so much for you comment. There was a problem when extracting the data object from the formula when weights were specified. I was able to fix the problem and now it should work fine. Thank you again for your help.
Hi again!
Thank you very much for your reply! I tried again, and now I only get NAs in the Standard error, t-value, and p value column, even though I have no missing values in my data… I don’t get it! Do you know what might be going on?
Thanks again!
Hi! I conducted some additional robustness tests and everything works fine for me. Can you provide a reproducible example?
Cheers!
Hey! Sorry to come back to you after all this time. But I wonder, were you ever able to solve your problem with the function?
Best, ad.
This is an incredibly helpful function. Thank you! My query is also regarding the use of survey weights. When using survey weights, i get no error warning, but the SEs do not appear to be clustered: they are identical to the unclustered……
Thanks a lot. How exactly do you specify the weights? Are you using the weight option of lm? Is there any way to provide a reproducible example?
Hi!
Thanks for the function.
I think I am getting the same problem as ct. Even reproducing the example you provide I get a bunch of NAs.
Is it only me?
Many thanks!
Hey!
Thank you for you remark. I tried the example and it works fine for me. Could you provide some more details? Do you have the package “sandwich” installed?
Cheers!
Below a printout of my console.
Maybe I am missing some packages.
Btw, sorry for taking up so much space.
And apologies for I am new to R and probably this is why I am not seeing the obvious.
Thank you for the printout. I now removed it from your comment. Unfortunately, I still cannot find the error. I tried the example with the newest R Version (3.4.3) and went to a completely different PC, in both cases the example worked fine. Can you check if you have the sandwich package installed? Furthermore, I noticed that you download the data differently – not that this should matter – but did the gdata package not work for you? Finally, you might have some packages loaded in your memory that mask other functions. Could you restart R and only run my example? Best, ad.
Hi! Thanks so much for making this available. I am a newbie to R, and I am having some trouble making the modified summary() function work. Although the example you provide in the short tutorial above worked smoothly, I tried to use it with a toy example of mine and I got the error message
“Error in summary.lm(mod, cluster = c(i)) :
The function only allows max. 2 clusters. You provided more.”
Here is a reproducible example (I realize that since each cluster is a singleton, clustering should be irrelevant for the calculation of standard errors; but I don’t see why that should make the function return an error message):
rm(list=ls())
library(RCurl)
url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv)
i <- seq(1,100,1)
x <- rnorm(100)
y <- 1 + 2*x + rnorm(100)
simpledata <- as.data.frame(cbind(i,x,y))
mod <- lm(y~x, data = simpledata)
summary(mod, cluster = c(i))
Thanks a lot!
Hi! Thank you for you remark. I am glad to hear that you are using my function. The reason that your example does not work properly has actually nothing to do with the cluster function, but is caused by a small syntax error. Try to put the variable i in last line of you code, i.e. summary(mod, cluster = c(i)), in parentheses such that it looks like this “i”. Your example should work fine then. Let me know if it works. Also, just get in touch in case you encounter any other problems. Cheers
Thanks a lot for the quick reply! It worked perfectly.
Hello, many thanks for creating this useful function. I read in the comments above that you are working to extend it so it works for the the glm family, and let me just add that I would be really, really glad to see it implemented for the glm.nb (negative binomial regression) command. Thanks again for your work!
Thank you for your comment. I’ll try my best. Currently, I am working on a different project. Hence, it will take longer than expected 🙂 Cheers
Hello ad, thx a lot for this function! I had the same issue than ct and Ricky and after examining the code, I realized that it came from the cluster object. I modified the function accordingly, and it works like a charm :
cluster <- dat[,cluster] #Max P : since dat is a df, cluster will also be a df
require(sandwich, quietly = TRUE)
M <- res_length <- length(unique(cluster[[1]])) #Max P : instead of length(unique(cluster)) , =1
N <- length(cluster[[1]]) #Max P : instead of length(cluster),=1 since cluster is a df
The same modifications should work for the 2 clusters case.
Thank you again for sharing your R thoughts and functions!
Best,
MP
Hi Max
Thank you for reaching out. The solution that you proposed does not to work properly. The object cluster does contain all possible clusters and you interested in the unique clusters. Using cluster[[1]] you select only the first element of the date.frame.
I did now change the function a little. Maybe this helps to get rid of the NA problem. As I am not able to reproduce this problem, I find it incredibly hard to tackle it. Could you by any chance provide a reproducible example?
Best, ad
Hi and thanks for the amazing work!
I think I’ve done everything right, but I’m getting NA’s for Std. error, t value and Pr(>|t|). Do you have any solutions for this?
Hi tb
Thank you for you comment. Unfortunately, I am not able to reproduce t the NA problem. Can you, by any chance, provide a reproducible example? That would help a lot! Best, ad
Hi,
First of all, thank you so much for this fantastic function! It’s been very helpful for my research.
I would like to tell you about a problem I am having when using the clustered robust standard errors while changing regressors in a loop. I will illustrate it with an example:
# Here some sample data
Y <- c(1, 3, 2, 0, 5, 6)
X <- c(2, 4, 3, 2, 10, 8)
ID <- c(0, 0, 0, 1, 1, 1)
dat <- data.frame(Y, X, ID)
# Here some controls which are "outside" the dataset:
C1 <- c(1, 2, 3, 4, 5, 6)
C2 <- c(6, 4, 2, 8, 0, 13)
C <- matrix(NA, 6, 2)
C[ , 1:2] <- t(c(C1, C2))
# A matrix to store the standard errors:
R <- matrix(NA, 2, 1)
# Now I do a loop to regress Y on X adding the controls sequentially and storing s.e.:
for(i in 1:2){
reg <- summary(lm(data=dat, Y ~ X + C[, i]))
R[i,1] <- reg$coefficients[3,2]
}
R
# This produces the following output:
# [,1]
# [1,] 0.4255123
# [2,] 0.1015860
# However, the loop does not work when using the clustered s.e.:
for(i in 1:2){
reg <- summary(lm(data=dat, Y ~ X + C[, i]), cluster=c("ID"))
R[i,1] <- reg$coefficients[2,2]
}
R
# Error in get(paste(object$call$data)) : invalid first argument
# Called from: get(paste(object$call$data))
##
Any idea of why this is happening or how it can be solved?
Thanks a lot in advance.
Martin
Hi Martin
Sorry for my late reply. The problem arises from your loop and is not directly related to the function. An easy way to solve the problem is to estimate each regression separately. I don’t know if this is a practicable solution in your case. Otherwise you could check out alternative ways to estimate clustered standard errors in R.
Best
ad
How can I cite your function? Is there an official means/way to do so or should I cite the blog?
Hi Jen
You can cite the function as follows
Dibiasi, A. (2016). Clustered Standard Errors in R [Blog post]. Retrieved from https://economictheoryblog.com/2016/12/13/clustered-standard-errors-in-r/
Thank you!
ad
Hey
Thanks so much for the code.
I was wondering if there is a possibility to get the results in a nice table, like with stargazer or something like that.
Thanks 🙂
Hey
Thank you for your comment. Yes, you can do that. Save you summary output and recover the coefficients. Now you can add them to Stargazer. For instance,
summary_save <- summary(reg,cluster = c("class_id"))
clustered_errors <- as.vector(summary_save$coefficients[,c("Std. Error")])
Hope this helps.
Cheers, ad
Hi again,
I prepared a short post that explains how one can obtain nice tables in stargazer with clustered standard errors.
Hope it will help you.
Kindly, ad
thank you very much
I will try this imediatly 🙂
Hope it works fine for you!
Hello, first of all thank you for making all this effort but I get an error when I try to use your function add on:
Error in get(paste(object$call$data))[, c(n_coef, cluster)] :
object of type ‘closure’ is not subsettable
Called from: na.omit(get(paste(object$call$data))[, c(n_coef, cluster)])
Do you know where the problem is?
Best Wishes
Hi,
Thank you for your comment. This error message arises if we try to index a function. However, without knowing your specific case it is a little difficult to evaluate where the error is caused. Could you provide a reproducible example–a short R code that produces the same error? That will allow me to check where the error is coming from.
Best, ad
Since I can’t provide you the .csv file, imagine something like this:
setwd(“~/R/folder”)
library(RCurl)
House1 <- read.csv("House.csv")
attach(House1 )
reg1 <- lm(equi ~ dummy + interactions + controls,
data=subset(House1, money< 100 & debt == 0))
url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)),
envir=.GlobalEnv)
summary(reg1,cluster = c("PLZ"))
I don't have anything "fancy" installed like perl or something else
Best Wishes
Hi again,
Thanks a lot for the example. It looks fine to me. The only potential problem that I could detect is that you subset the data within the lm() function. Could you try to subset the data before running your regression. Something like this:
df=subset(House1, money< 100 & debt == 0)
reg1 <- lm(equi ~ dummy + interactions + controls, data=df)
Let me know if that solves the problem.
Best, ad
Hello,
Yes, this solved my problem! Thank you 🙂
Best Wishes
You’re welcome!
Cheers, ad
Hello,
Thank you very much for writing this function. I was just stumbling across a potential problem. It seems that your function computes the p value corresponding to the normal distribution (or corresponding to the t distribution with degrees of freedom depending on the number of observations). In Stata, however, I get the same t statistics but different p-values. It seems to be the case that Stata uses the t distribtuion where degrees of freedom depend on the number of clusters rather than on the number of observations! I am quite new to R and also to statistics, could you shed some light on which approach should be used and why?
Best regards!
Hi! This is actually a good point. When having clusters you converge over the number of clusters and not over the number of total observations. Hence, I should adapt the function accordingly. Thank you for that.
Best, ad
Hi, I am super new to R (like 2 months now) and I’m trying to sort of learn it by myself. And I came across this code and I was happy for it, but I am facing some troubles making it work.
I will try to explain it as simply as I can (because it sounds complicated in my head)
Problem: I don’t have variables for which I want to find correlations hanging around in my global environment. I have data-frames. The particular one I am using now for the regression is called regdata. I am modeling my lm regression like this. The size of the dataframe is 160 x 9, 160 rows and 9 columns.
result 2″ to an “invalid object”. I have tried all of the following and nothing works
summary(result, cluster = c (regdata$x3))
summary(result, cluster = c (“regdata$x3”))
summary(result, cluster = c (x3))
summary(result, cluster = c (“x3”))
summary(result, cluster = c (160, regdata$x3))
In this instance, x1, x2, x3 are all categorical variables with
x1 ranging from 1 to 5
x2 has 3 values 0,1,2
x3 has 4 values ranging from 1 to 4
Can I not cluster if the number of clusters in more than 2?
Hi, thank you for the comment. Your fourth example is the way is should work, i.e. you pass on the variable name to function. Do you pass on the DataFrame in your regression? Unfortunately, the information you give does not provide sufficient information in order for me to really help you. Could you provide a reproducible example? Or at least state the error message? Best, ad
Hi,
I’m relatively new to regression analyses and am currently writing a thesis on the effect of certain ESG date on costs of equity (COE). A prior study that I’m looking at (https://www.sciencedirect.com/science/article/abs/pii/S037842661100081) is using pooled cross-sectional time-series regressions with robust standard errors clustered at the firm level.
I wish to cluster the standard errors at firm level as well, but I don’t understand how.
The model looks somewhat like this:
model1 <- lm( COE ~ ESG1 + ESG2 + ESG3 + controls)
What am i supposed to use as "VARIABLE" in:
"summary.lm(model1, cluster= c("VARIABLE") ?
Regards, JAN
Hey Jan! “VARIABLE” should be the name of the Variable you want to cluster you errors on.
Hi,
Thank you very much for writing and sharing this function. It is amazingly helpful. I realized that the function generates an error, specifically “Error in varcovar[pos_coef, pos_coef] : subscript out of bounds”. If a variable is dropped from the regression. I ran a regression on a subgroup in which there was no variation in a binary variable. It took me some time to figure out what the problem was. I’m commenting mostly to help someone with a error in the future.
Cheers!