The easiest way to compute clustered standard errors in R is the modified ` summary()`

. I added an additional parameter, called `cluster,`

to the conventional ` summary() `

function. This parameter allows to specify a variable that defines the group / cluster in your data. The summary output will return clustered standard errors. Here is the syntax:

` summary(lm.object, cluster=c("variable")) `

Furthermore, I uploaded the function to a github.com repository. This makes it easy to load the function into your R session. The following lines of code import the function into your R session. You can also download the function directly from this post yourself.

# load necessary packages for importing the function
library(RCurl)
# import the function from repository
url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)),
envir=.GlobalEnv)

Below you will find a tutorial that demonstrates how to import the modified ` summary() `

function into you R session. It also explains the application of the function in greater detail. The tutorial carries out an OLS estimation in R that is based on an fake data that I generate here and which you can download here.

# start with an empty workspace
rm(list=ls())
# load necessary packages for importing the function
library(RCurl)
# load necessary packages for the example
library(gdata)
library(zoo)
# import the function
url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)),
envir=.GlobalEnv)
# download data set for example
url_data <- "https://economictheoryblog.files.wordpress.com/2016/12/data.xls"
data <- read.xls(gsub("s:",":",url_data))
# estimate simple linear model
reg <- lm(id_score ~ class_size,
data=data)
# use new summary function
summary(reg)
summary(reg,cluster = c("class_id"))
# Note, the standard errors are much
# larger with clustered standard errors

One can also easily include the obtained clustered standard errors in stargazer and create perfectly formatted tex or html tables. This post describes how one can achieve it.

### Like this:

Like Loading...

*Related*

Will this function work with two clustering variables? Something like: summary(lm.object, cluster=c(“variable1”, “variable2”))?

Yes. This function allows two clustering variables. Cheers.

Thank you. One more question: is the function specific to linear models? Or can it work for generalized linear model like logistic regression or other non-linear models?

Currently, the function only works with the lm class in R. I am working on generalizing the function. However, it will still take some time until a general version of the function is available, I suppose.

Thank you so much. I tried the function and it worked well with a single clustering variable. But it gives an error with two clustering variables. Any clues? Here is what I have done:

> SITE URLdata VarNames test fm url_robust eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv)

# one clustering variable “firmid”

> summary(fm, cluster=c(“firmid”))

Call:

lm(formula = y ~ x, data = test)

Residuals:

Min 1Q Median 3Q Max

-6.7611 -1.3680 -0.0166 1.3387 8.6779

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.02968 0.06701 0.443 0.658

x 1.03483 0.05060 20.453 <2e-16 ***

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.005 on 4998 degrees of freedom

Multiple R-squared: 0.2078, Adjusted R-squared: 0.2076

F-statistic: 418.3 on 1 and 499 DF, p-value: summary(fm, cluster=c(“year”))

Call:

lm(formula = y ~ x, data = test)

Residuals:

Min 1Q Median 3Q Max

-6.7611 -1.3680 -0.0166 1.3387 8.6779

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.02968 0.02339 1.269 0.204

x 1.03483 0.03339 30.993 summary(fm, cluster=c(“firmid”, “year”))

Error in summary.lm(fm, cluster = c(“firmid”, “year”)) :

object ‘M’ not found

You are right. There was a bug in the code. I fixed it. I guess it should work now. However, you should be careful now with interpreting the F-Statistic. I am not sure if I took the right amount of degrees of freedom. The rest of the output should be fine.

Besides the coding, from you code I see that you are working with non-nested clusters. I cannot remember from the top of my head. But should you not be careful with such a structure?

Hi,

Thanks a lot first of all for putting in so much effort to write this function.

I am getting an error for twoway clustering. This is the error I get:

Error in if (nrow(dat)

Is this error specific to me?

Thanks,

Pankaj

Hey. Thank you for comment. There was a bug in the code. Basically, not all of your observations have a cluster, i.e. one or both of your cluster variables contain NA’s. The problem was that I did not set-up the warning properly. That is, the warning only worked for the single clustering case, but did not work for twoway clustering. I fixed it and now it should work. Let me know if you encounter any other problems. Cheers.

The error didn’t paste properly in the previous comment.

Error in if (nrow(dat)

I am sorry my comment above is a bit of a mess. It changed when I posted it. But basically when I use two clustering variables [e.g., summary(fm, cluster=c(“firmid”, “year”))], I get the error message: “Error in summary.lm(fm, cluster = c(“firmid”, “year”)) :

object ‘M’ not found”

No worries, in my browser it appears quite clear.

Thank you for your response and your great function. It really helps.

I’m glad it helps.

Hi! Is there any way to use this code when using weights in your lm model? I get an error telling me that my weights are not recognized : “Error in get(all.vars(object$call)[length(all.vars(object$call))]) : objet ‘yeardif’ introuvable”

Is there anything I can do?

Many thanks for your help!

Hi! Thank you so much for you comment. There was a problem when extracting the data object from the formula when weights were specified. I was able to fix the problem and now it should work fine. Thank you again for your help.

Hi again!

Thank you very much for your reply! I tried again, and now I only get NAs in the Standard error, t-value, and p value column, even though I have no missing values in my data… I don’t get it! Do you know what might be going on?

Thanks again!

Hi! I conducted some additional robustness tests and everything works fine for me. Can you provide a reproducible example?

Cheers!

Hey! Sorry to come back to you after all this time. But I wonder, were you ever able to solve your problem with the function?

Best, ad.

Hi!

Thanks for the function.

I think I am getting the same problem as ct. Even reproducing the example you provide I get a bunch of NAs.

Is it only me?

Many thanks!

Hey!

Thank you for you remark. I tried the example and it works fine for me. Could you provide some more details? Do you have the package “sandwich” installed?

Cheers!

Below a printout of my console.

Maybe I am missing some packages.

Btw, sorry for taking up so much space.

And apologies for I am new to R and probably this is why I am not seeing the obvious.

Thank you for the printout. I now removed it from your comment. Unfortunately, I still cannot find the error. I tried the example with the newest R Version (3.4.3) and went to a completely different PC, in both cases the example worked fine. Can you check if you have the sandwich package installed? Furthermore, I noticed that you download the data differently – not that this should matter – but did the gdata package not work for you? Finally, you might have some packages loaded in your memory that mask other functions. Could you restart R and only run my example? Best, ad.

Hi! Thanks so much for making this available. I am a newbie to R, and I am having some trouble making the modified summary() function work. Although the example you provide in the short tutorial above worked smoothly, I tried to use it with a toy example of mine and I got the error message

“Error in summary.lm(mod, cluster = c(i)) :

The function only allows max. 2 clusters. You provided more.”

Here is a reproducible example (I realize that since each cluster is a singleton, clustering should be irrelevant for the calculation of standard errors; but I don’t see why that should make the function return an error message):

rm(list=ls())

library(RCurl)

url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"

eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv)

i <- seq(1,100,1)

x <- rnorm(100)

y <- 1 + 2*x + rnorm(100)

simpledata <- as.data.frame(cbind(i,x,y))

mod <- lm(y~x, data = simpledata)

summary(mod, cluster = c(i))

Thanks a lot!

Hi! Thank you for you remark. I am glad to hear that you are using my function. The reason that your example does not work properly has actually nothing to do with the cluster function, but is caused by a small syntax error. Try to put the variable

iin last line of you code, i.e. summary(mod, cluster = c(i)), in parentheses such that it looks like this“i”. Your example should work fine then. Let me know if it works. Also, just get in touch in case you encounter any other problems. CheersThanks a lot for the quick reply! It worked perfectly.

Hello, many thanks for creating this useful function. I read in the comments above that you are working to extend it so it works for the the glm family, and let me just add that I would be really, really glad to see it implemented for the glm.nb (negative binomial regression) command. Thanks again for your work!

Thank you for your comment. I’ll try my best. Currently, I am working on a different project. Hence, it will take longer than expected 🙂 Cheers

Hello ad, thx a lot for this function! I had the same issue than ct and Ricky and after examining the code, I realized that it came from the cluster object. I modified the function accordingly, and it works like a charm :

cluster <- dat[,cluster] #Max P : since dat is a df, cluster will also be a df

require(sandwich, quietly = TRUE)

M <- res_length <- length(unique(cluster[[1]])) #Max P : instead of length(unique(cluster)) , =1

N <- length(cluster[[1]]) #Max P : instead of length(cluster),=1 since cluster is a df

The same modifications should work for the 2 clusters case.

Thank you again for sharing your R thoughts and functions!

Best,

MP

Hi Max

Thank you for reaching out. The solution that you proposed does not to work properly. The object cluster does contain all possible clusters and you interested in the unique clusters. Using cluster[[1]] you select only the first element of the date.frame.

I did now change the function a little. Maybe this helps to get rid of the NA problem. As I am not able to reproduce this problem, I find it incredibly hard to tackle it. Could you by any chance provide a reproducible example?

Best, ad

Hi and thanks for the amazing work!

I think I’ve done everything right, but I’m getting NA’s for Std. error, t value and Pr(>|t|). Do you have any solutions for this?

Hi tb

Thank you for you comment. Unfortunately, I am not able to reproduce t the NA problem. Can you, by any chance, provide a reproducible example? That would help a lot! Best, ad

Hi,

First of all, thank you so much for this fantastic function! It’s been very helpful for my research.

I would like to tell you about a problem I am having when using the clustered robust standard errors while changing regressors in a loop. I will illustrate it with an example:

# Here some sample data

Y <- c(1, 3, 2, 0, 5, 6)

X <- c(2, 4, 3, 2, 10, 8)

ID <- c(0, 0, 0, 1, 1, 1)

dat <- data.frame(Y, X, ID)

# Here some controls which are "outside" the dataset:

C1 <- c(1, 2, 3, 4, 5, 6)

C2 <- c(6, 4, 2, 8, 0, 13)

C <- matrix(NA, 6, 2)

C[ , 1:2] <- t(c(C1, C2))

# A matrix to store the standard errors:

R <- matrix(NA, 2, 1)

# Now I do a loop to regress Y on X adding the controls sequentially and storing s.e.:

for(i in 1:2){

reg <- summary(lm(data=dat, Y ~ X + C[, i]))

R[i,1] <- reg$coefficients[3,2]

}

R

# This produces the following output:

# [,1]

# [1,] 0.4255123

# [2,] 0.1015860

# However, the loop does not work when using the clustered s.e.:

for(i in 1:2){

reg <- summary(lm(data=dat, Y ~ X + C[, i]), cluster=c("ID"))

R[i,1] <- reg$coefficients[2,2]

}

R

# Error in get(paste(object$call$data)) : invalid first argument

# Called from: get(paste(object$call$data))

##

Any idea of why this is happening or how it can be solved?

Thanks a lot in advance.

Martin

Hi Martin

Sorry for my late reply. The problem arises from your loop and is not directly related to the function. An easy way to solve the problem is to estimate each regression separately. I don’t know if this is a practicable solution in your case. Otherwise you could check out alternative ways to estimate clustered standard errors in R.

Best

ad

How can I cite your function? Is there an official means/way to do so or should I cite the blog?

Hi Jen

You can cite the function as follows

Dibiasi, A. (2016). Clustered Standard Errors in R [Blog post]. Retrieved from https://economictheoryblog.com/2016/12/13/clustered-standard-errors-in-r/

Thank you!

ad

Hey

Thanks so much for the code.

I was wondering if there is a possibility to get the results in a nice table, like with stargazer or something like that.

Thanks 🙂

Hey

Thank you for your comment. Yes, you can do that. Save you summary output and recover the coefficients. Now you can add them to Stargazer. For instance,

summary_save <- summary(reg,cluster = c("class_id"))

clustered_errors <- as.vector(summary_save$coefficients[,c("Std. Error")])

Hope this helps.

Cheers, ad

Hi again,

I prepared a short post that explains how one can obtain nice tables in stargazer with clustered standard errors.

Hope it will help you.

Kindly, ad

thank you very much

I will try this imediatly 🙂

Hope it works fine for you!

Hello, first of all thank you for making all this effort but I get an error when I try to use your function add on:

Error in get(paste(object$call$data))[, c(n_coef, cluster)] :

object of type ‘closure’ is not subsettable

Called from: na.omit(get(paste(object$call$data))[, c(n_coef, cluster)])

Do you know where the problem is?

Best Wishes

Hi,

Thank you for your comment. This error message arises if we try to index a function. However, without knowing your specific case it is a little difficult to evaluate where the error is caused. Could you provide a reproducible example–a short R code that produces the same error? That will allow me to check where the error is coming from.

Best, ad

Since I can’t provide you the .csv file, imagine something like this:

setwd(“~/R/folder”)

library(RCurl)

House1 <- read.csv("House.csv")

attach(House1 )

reg1 <- lm(equi ~ dummy + interactions + controls,

data=subset(House1, money< 100 & debt == 0))

url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"

eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)),

envir=.GlobalEnv)

summary(reg1,cluster = c("PLZ"))

I don't have anything "fancy" installed like perl or something else

Best Wishes

Hi again,

Thanks a lot for the example. It looks fine to me. The only potential problem that I could detect is that you subset the data within the lm() function. Could you try to subset the data before running your regression. Something like this:

df=subset(House1, money< 100 & debt == 0)

reg1 <- lm(equi ~ dummy + interactions + controls, data=df)

Let me know if that solves the problem.

Best, ad

Hello,

Yes, this solved my problem! Thank you 🙂

Best Wishes

You’re welcome!

Cheers, ad

Hello,

Thank you very much for writing this function. I was just stumbling across a potential problem. It seems that your function computes the p value corresponding to the normal distribution (or corresponding to the t distribution with degrees of freedom depending on the number of observations). In Stata, however, I get the same t statistics but different p-values. It seems to be the case that Stata uses the t distribtuion where degrees of freedom depend on the number of clusters rather than on the number of observations! I am quite new to R and also to statistics, could you shed some light on which approach should be used and why?

Best regards!

Hi! This is actually a good point. When having clusters you converge over the number of clusters and not over the number of total observations. Hence, I should adapt the function accordingly. Thank you for that.

Best, ad