Clustered Standard Errors in R

The easiest way to compute clustered standard errors in R is the modified summary(). I added an additional parameter, called cluster, to the conventional  summary() function. This parameter allows to specify a variable that defines the group / cluster in your data. The summary output will return clustered standard errors. Here is the syntax:

summary(lm.object, cluster=c("variable"))

Furthermore, I uploaded the function to a github.com repository. This makes it easy to load the function into your R session. The following lines of code import the function into your R session. You can also download the function directly from this post yourself.

 

# load necessary packages for importing the function
library(RCurl)

# import the function from repository
url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)),
     envir=.GlobalEnv)

 

Below you will find a tutorial that demonstrates how to import the modified  summary() function into you R session. It also explains the application of the function in greater detail. The tutorial carries out an OLS estimation in R that is based on an fake data that I generate here and which you can download here.

 

# start with an empty workspace
rm(list=ls())

# load necessary packages for importing the function
library(RCurl)
# load necessary packages for the example
library(gdata) 
library(zoo)

# import the function
url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)),
     envir=.GlobalEnv)


# download data set for example
url_data <- "https://economictheoryblog.files.wordpress.com/2016/12/data.xls"
data <- read.xls(gsub("s:",":",url_data))


# estimate simple linear model
reg <- lm(id_score ~ class_size, 
             data=data)

# use new summary function
summary(reg)
summary(reg,cluster = c("class_id"))

# Note, the standard errors are much 
# larger with clustered standard errors

One can also easily include the obtained clustered standard errors in stargazer and create perfectly formatted tex or html tables. This post describes how one can achieve it.

Advertisement

60 thoughts on “Clustered Standard Errors in R”

  1. Will this function work with two clustering variables? Something like: summary(lm.object, cluster=c(“variable1”, “variable2”))?

      1. Thank you. One more question: is the function specific to linear models? Or can it work for generalized linear model like logistic regression or other non-linear models?

      2. Currently, the function only works with the lm class in R. I am working on generalizing the function. However, it will still take some time until a general version of the function is available, I suppose.

  2. Thank you so much. I tried the function and it worked well with a single clustering variable. But it gives an error with two clustering variables. Any clues? Here is what I have done:

    > SITE URLdata VarNames test fm url_robust eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv)

    # one clustering variable “firmid”
    > summary(fm, cluster=c(“firmid”))

    Call:
    lm(formula = y ~ x, data = test)

    Residuals:
    Min 1Q Median 3Q Max
    -6.7611 -1.3680 -0.0166 1.3387 8.6779

    Coefficients:
    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 0.02968 0.06701 0.443 0.658
    x 1.03483 0.05060 20.453 <2e-16 ***

    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 2.005 on 4998 degrees of freedom
    Multiple R-squared: 0.2078, Adjusted R-squared: 0.2076
    F-statistic: 418.3 on 1 and 499 DF, p-value: summary(fm, cluster=c(“year”))

    Call:
    lm(formula = y ~ x, data = test)

    Residuals:
    Min 1Q Median 3Q Max
    -6.7611 -1.3680 -0.0166 1.3387 8.6779

    Coefficients:
    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 0.02968 0.02339 1.269 0.204
    x 1.03483 0.03339 30.993 summary(fm, cluster=c(“firmid”, “year”))
    Error in summary.lm(fm, cluster = c(“firmid”, “year”)) :
    object ‘M’ not found

    1. You are right. There was a bug in the code. I fixed it. I guess it should work now. However, you should be careful now with interpreting the F-Statistic. I am not sure if I took the right amount of degrees of freedom. The rest of the output should be fine.

      Besides the coding, from you code I see that you are working with non-nested clusters. I cannot remember from the top of my head. But should you not be careful with such a structure?

      1. Hi,

        Thanks a lot first of all for putting in so much effort to write this function.

        I am getting an error for twoway clustering. This is the error I get:
        Error in if (nrow(dat)

        Is this error specific to me?

        Thanks,
        Pankaj

      2. Hey. Thank you for comment. There was a bug in the code. Basically, not all of your observations have a cluster, i.e. one or both of your cluster variables contain NA’s. The problem was that I did not set-up the warning properly. That is, the warning only worked for the single clustering case, but did not work for twoway clustering. I fixed it and now it should work. Let me know if you encounter any other problems. Cheers.

      3. The error didn’t paste properly in the previous comment.

        Error in if (nrow(dat)

  3. I am sorry my comment above is a bit of a mess. It changed when I posted it. But basically when I use two clustering variables [e.g., summary(fm, cluster=c(“firmid”, “year”))], I get the error message: “Error in summary.lm(fm, cluster = c(“firmid”, “year”)) :
    object ‘M’ not found”

  4. Hi! Is there any way to use this code when using weights in your lm model? I get an error telling me that my weights are not recognized : “Error in get(all.vars(object$call)[length(all.vars(object$call))]) : objet ‘yeardif’ introuvable”
    Is there anything I can do?

    Many thanks for your help!

    1. Hi! Thank you so much for you comment. There was a problem when extracting the data object from the formula when weights were specified. I was able to fix the problem and now it should work fine. Thank you again for your help.

      1. Hi again!

        Thank you very much for your reply! I tried again, and now I only get NAs in the Standard error, t-value, and p value column, even though I have no missing values in my data… I don’t get it! Do you know what might be going on?

        Thanks again!

      2. Hi! I conducted some additional robustness tests and everything works fine for me. Can you provide a reproducible example?

        Cheers!

    2. Hey! Sorry to come back to you after all this time. But I wonder, were you ever able to solve your problem with the function?
      Best, ad.

    3. This is an incredibly helpful function. Thank you! My query is also regarding the use of survey weights. When using survey weights, i get no error warning, but the SEs do not appear to be clustered: they are identical to the unclustered……

      1. Thanks a lot. How exactly do you specify the weights? Are you using the weight option of lm? Is there any way to provide a reproducible example?

  5. Hi!
    Thanks for the function.
    I think I am getting the same problem as ct. Even reproducing the example you provide I get a bunch of NAs.
    Is it only me?

    Many thanks!

    1. Hey!

      Thank you for you remark. I tried the example and it works fine for me. Could you provide some more details? Do you have the package “sandwich” installed?

      Cheers!

  6. Below a printout of my console.
    Maybe I am missing some packages.
    Btw, sorry for taking up so much space.
    And apologies for I am new to R and probably this is why I am not seeing the obvious.

    1. Thank you for the printout. I now removed it from your comment. Unfortunately, I still cannot find the error. I tried the example with the newest R Version (3.4.3) and went to a completely different PC, in both cases the example worked fine. Can you check if you have the sandwich package installed? Furthermore, I noticed that you download the data differently – not that this should matter – but did the gdata package not work for you? Finally, you might have some packages loaded in your memory that mask other functions. Could you restart R and only run my example? Best, ad.

  7. Hi! Thanks so much for making this available. I am a newbie to R, and I am having some trouble making the modified summary() function work. Although the example you provide in the short tutorial above worked smoothly, I tried to use it with a toy example of mine and I got the error message

    “Error in summary.lm(mod, cluster = c(i)) :
    The function only allows max. 2 clusters. You provided more.”

    Here is a reproducible example (I realize that since each cluster is a singleton, clustering should be irrelevant for the calculation of standard errors; but I don’t see why that should make the function return an error message):

    rm(list=ls())
    library(RCurl)
    url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R&quot;
    eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv)

    i <- seq(1,100,1)
    x <- rnorm(100)
    y <- 1 + 2*x + rnorm(100)
    simpledata <- as.data.frame(cbind(i,x,y))
    mod <- lm(y~x, data = simpledata)
    summary(mod, cluster = c(i))

    Thanks a lot!

    1. Hi! Thank you for you remark. I am glad to hear that you are using my function. The reason that your example does not work properly has actually nothing to do with the cluster function, but is caused by a small syntax error. Try to put the variable i in last line of you code, i.e. summary(mod, cluster = c(i)), in parentheses such that it looks like this “i”. Your example should work fine then. Let me know if it works. Also, just get in touch in case you encounter any other problems. Cheers

  8. Hello, many thanks for creating this useful function. I read in the comments above that you are working to extend it so it works for the the glm family, and let me just add that I would be really, really glad to see it implemented for the glm.nb (negative binomial regression) command. Thanks again for your work!

    1. Thank you for your comment. I’ll try my best. Currently, I am working on a different project. Hence, it will take longer than expected 🙂 Cheers

  9. Hello ad, thx a lot for this function! I had the same issue than ct and Ricky and after examining the code, I realized that it came from the cluster object. I modified the function accordingly, and it works like a charm :

    cluster <- dat[,cluster] #Max P : since dat is a df, cluster will also be a df
    require(sandwich, quietly = TRUE)
    M <- res_length <- length(unique(cluster[[1]])) #Max P : instead of length(unique(cluster)) , =1
    N <- length(cluster[[1]]) #Max P : instead of length(cluster),=1 since cluster is a df

    The same modifications should work for the 2 clusters case.

    Thank you again for sharing your R thoughts and functions!

    Best,
    MP

    1. Hi Max

      Thank you for reaching out. The solution that you proposed does not to work properly. The object cluster does contain all possible clusters and you interested in the unique clusters. Using cluster[[1]] you select only the first element of the date.frame.

      I did now change the function a little. Maybe this helps to get rid of the NA problem. As I am not able to reproduce this problem, I find it incredibly hard to tackle it. Could you by any chance provide a reproducible example?

      Best, ad

  10. Hi and thanks for the amazing work!
    I think I’ve done everything right, but I’m getting NA’s for Std. error, t value and Pr(>|t|). Do you have any solutions for this?

    1. Hi tb

      Thank you for you comment. Unfortunately, I am not able to reproduce t the NA problem. Can you, by any chance, provide a reproducible example? That would help a lot! Best, ad

  11. Hi,
    First of all, thank you so much for this fantastic function! It’s been very helpful for my research.
    I would like to tell you about a problem I am having when using the clustered robust standard errors while changing regressors in a loop. I will illustrate it with an example:

    # Here some sample data
    Y <- c(1, 3, 2, 0, 5, 6)
    X <- c(2, 4, 3, 2, 10, 8)
    ID <- c(0, 0, 0, 1, 1, 1)
    dat <- data.frame(Y, X, ID)
    # Here some controls which are "outside" the dataset:
    C1 <- c(1, 2, 3, 4, 5, 6)
    C2 <- c(6, 4, 2, 8, 0, 13)
    C <- matrix(NA, 6, 2)
    C[ , 1:2] <- t(c(C1, C2))
    # A matrix to store the standard errors:
    R <- matrix(NA, 2, 1)
    # Now I do a loop to regress Y on X adding the controls sequentially and storing s.e.:
    for(i in 1:2){
    reg <- summary(lm(data=dat, Y ~ X + C[, i]))
    R[i,1] <- reg$coefficients[3,2]
    }
    R
    # This produces the following output:
    # [,1]
    # [1,] 0.4255123
    # [2,] 0.1015860

    # However, the loop does not work when using the clustered s.e.:
    for(i in 1:2){
    reg <- summary(lm(data=dat, Y ~ X + C[, i]), cluster=c("ID"))
    R[i,1] <- reg$coefficients[2,2]
    }
    R
    # Error in get(paste(object$call$data)) : invalid first argument
    # Called from: get(paste(object$call$data))
    ##

    Any idea of why this is happening or how it can be solved?

    Thanks a lot in advance.
    Martin

    1. Hi Martin

      Sorry for my late reply. The problem arises from your loop and is not directly related to the function. An easy way to solve the problem is to estimate each regression separately. I don’t know if this is a practicable solution in your case. Otherwise you could check out alternative ways to estimate clustered standard errors in R.

      Best
      ad

  12. Hey

    Thanks so much for the code.
    I was wondering if there is a possibility to get the results in a nice table, like with stargazer or something like that.

    Thanks 🙂

    1. Hey

      Thank you for your comment. Yes, you can do that. Save you summary output and recover the coefficients. Now you can add them to Stargazer. For instance,

      summary_save <- summary(reg,cluster = c("class_id"))
      clustered_errors <- as.vector(summary_save$coefficients[,c("Std. Error")])

      Hope this helps.
      Cheers, ad

    2. Hi again,

      I prepared a short post that explains how one can obtain nice tables in stargazer with clustered standard errors.

      Hope it will help you.
      Kindly, ad

  13. Hello, first of all thank you for making all this effort but I get an error when I try to use your function add on:

    Error in get(paste(object$call$data))[, c(n_coef, cluster)] :
    object of type ‘closure’ is not subsettable
    Called from: na.omit(get(paste(object$call$data))[, c(n_coef, cluster)])

    Do you know where the problem is?

    Best Wishes

    1. Hi,

      Thank you for your comment. This error message arises if we try to index a function. However, without knowing your specific case it is a little difficult to evaluate where the error is caused. Could you provide a reproducible example–a short R code that produces the same error? That will allow me to check where the error is coming from.

      Best, ad

      1. Since I can’t provide you the .csv file, imagine something like this:

        setwd(“~/R/folder”)
        library(RCurl)
        House1 <- read.csv("House.csv")
        attach(House1 )
        reg1 <- lm(equi ~ dummy + interactions + controls,
        data=subset(House1, money< 100 & debt == 0))
        url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R&quot;
        eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)),
        envir=.GlobalEnv)

        summary(reg1,cluster = c("PLZ"))

        I don't have anything "fancy" installed like perl or something else

        Best Wishes

      2. Hi again,

        Thanks a lot for the example. It looks fine to me. The only potential problem that I could detect is that you subset the data within the lm() function. Could you try to subset the data before running your regression. Something like this:

        df=subset(House1, money< 100 & debt == 0)
        reg1 <- lm(equi ~ dummy + interactions + controls, data=df)

        Let me know if that solves the problem.

        Best, ad

  14. Hello,
    Thank you very much for writing this function. I was just stumbling across a potential problem. It seems that your function computes the p value corresponding to the normal distribution (or corresponding to the t distribution with degrees of freedom depending on the number of observations). In Stata, however, I get the same t statistics but different p-values. It seems to be the case that Stata uses the t distribtuion where degrees of freedom depend on the number of clusters rather than on the number of observations! I am quite new to R and also to statistics, could you shed some light on which approach should be used and why?
    Best regards!

    1. Hi! This is actually a good point. When having clusters you converge over the number of clusters and not over the number of total observations. Hence, I should adapt the function accordingly. Thank you for that.
      Best, ad

  15. Hi, I am super new to R (like 2 months now) and I’m trying to sort of learn it by myself. And I came across this code and I was happy for it, but I am facing some troubles making it work.

    I will try to explain it as simply as I can (because it sounds complicated in my head)

    Problem: I don’t have variables for which I want to find correlations hanging around in my global environment. I have data-frames. The particular one I am using now for the regression is called regdata. I am modeling my lm regression like this. The size of the dataframe is 160 x 9, 160 rows and 9 columns.

    result 2″ to an “invalid object”. I have tried all of the following and nothing works

    summary(result, cluster = c (regdata$x3))
    summary(result, cluster = c (“regdata$x3”))
    summary(result, cluster = c (x3))
    summary(result, cluster = c (“x3”))
    summary(result, cluster = c (160, regdata$x3))

    In this instance, x1, x2, x3 are all categorical variables with

    x1 ranging from 1 to 5
    x2 has 3 values 0,1,2
    x3 has 4 values ranging from 1 to 4

    Can I not cluster if the number of clusters in more than 2?

    1. Hi, thank you for the comment. Your fourth example is the way is should work, i.e. you pass on the variable name to function. Do you pass on the DataFrame in your regression? Unfortunately, the information you give does not provide sufficient information in order for me to really help you. Could you provide a reproducible example? Or at least state the error message? Best, ad

  16. Hi,

    I’m relatively new to regression analyses and am currently writing a thesis on the effect of certain ESG date on costs of equity (COE). A prior study that I’m looking at (https://www.sciencedirect.com/science/article/abs/pii/S037842661100081) is using pooled cross-sectional time-series regressions with robust standard errors clustered at the firm level.

    I wish to cluster the standard errors at firm level as well, but I don’t understand how.

    The model looks somewhat like this:

    model1 <- lm( COE ~ ESG1 + ESG2 + ESG3 + controls)

    What am i supposed to use as "VARIABLE" in:
    "summary.lm(model1, cluster= c("VARIABLE") ?

    Regards, JAN

  17. Hi,
    Thank you very much for writing and sharing this function. It is amazingly helpful. I realized that the function generates an error, specifically “Error in varcovar[pos_coef, pos_coef] : subscript out of bounds”. If a variable is dropped from the regression. I ran a regression on a subgroup in which there was no variation in a binary variable. It took me some time to figure out what the problem was. I’m commenting mostly to help someone with a error in the future.

Leave a Reply to Max Pger Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.