Clustered Standard Errors in R

The easiest way to compute clustered standard errors in R is the modified summary(). I added an additional parameter, called cluster, to the conventional  summary() function. This parameter allows to specify a variable that defines the group / cluster in your data. The summary output will return clustered standard errors. Here is the syntax:

summary(lm.object, cluster=c("variable"))

Furthermore, I uploaded the function to a github.com repository. This makes it easy to load the function into your R session. The following lines of code import the function into your R session. You can also download the function directly from this post yourself.

 

# load necessary packages for importing the function
library(RCurl)

# import the function from repository
url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)),
     envir=.GlobalEnv)

 

Below you will find a tutorial that demonstrates how to import the modified  summary() function into you R session. It also explains the application of the function in greater detail. The tutorial is based on an fake data that I generate here and which you can download here.

 

# start with an empty workspace
rm(list=ls())

# load necessary packages for importing the function
library(RCurl)
# load necessary packages for the example
library(gdata) 
library(zoo)

# import the function
url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)),
     envir=.GlobalEnv)


# download data set for example
url_data <- "https://economictheoryblog.files.wordpress.com/2016/12/data.xls"
data <- read.xls(gsub("s:",":",url_data))


# estimate simple linear model
reg <- lm(id_score ~ class_size, 
             data=data)

# use new summary function
summary(reg)
summary(reg,cluster = c("class_id"))

# Note, the standard errors are much 
# larger with clustered standard errors
Advertisements
This entry was posted in Computing and Others, Econometrics and tagged , , , . Bookmark the permalink.

20 Responses to Clustered Standard Errors in R

  1. Pingback: Clustered Standard Errors | Economic Theory Blog

  2. Pingback: Example data – Clustered Standard Errors | Economic Theory Blog

  3. Ranaivo says:

    Will this function work with two clustering variables? Something like: summary(lm.object, cluster=c(“variable1”, “variable2”))?

    • ad says:

      Yes. This function allows two clustering variables. Cheers.

      • Ranaivo says:

        Thank you. One more question: is the function specific to linear models? Or can it work for generalized linear model like logistic regression or other non-linear models?

      • ad says:

        Currently, the function only works with the lm class in R. I am working on generalizing the function. However, it will still take some time until a general version of the function is available, I suppose.

  4. Ranaivo says:

    Thank you so much. I tried the function and it worked well with a single clustering variable. But it gives an error with two clustering variables. Any clues? Here is what I have done:

    > SITE URLdata VarNames test fm url_robust eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv)

    # one clustering variable “firmid”
    > summary(fm, cluster=c(“firmid”))

    Call:
    lm(formula = y ~ x, data = test)

    Residuals:
    Min 1Q Median 3Q Max
    -6.7611 -1.3680 -0.0166 1.3387 8.6779

    Coefficients:
    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 0.02968 0.06701 0.443 0.658
    x 1.03483 0.05060 20.453 <2e-16 ***

    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 2.005 on 4998 degrees of freedom
    Multiple R-squared: 0.2078, Adjusted R-squared: 0.2076
    F-statistic: 418.3 on 1 and 499 DF, p-value: summary(fm, cluster=c(“year”))

    Call:
    lm(formula = y ~ x, data = test)

    Residuals:
    Min 1Q Median 3Q Max
    -6.7611 -1.3680 -0.0166 1.3387 8.6779

    Coefficients:
    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 0.02968 0.02339 1.269 0.204
    x 1.03483 0.03339 30.993 summary(fm, cluster=c(“firmid”, “year”))
    Error in summary.lm(fm, cluster = c(“firmid”, “year”)) :
    object ‘M’ not found

    • ad says:

      You are right. There was a bug in the code. I fixed it. I guess it should work now. However, you should be careful now with interpreting the F-Statistic. I am not sure if I took the right amount of degrees of freedom. The rest of the output should be fine.

      Besides the coding, from you code I see that you are working with non-nested clusters. I cannot remember from the top of my head. But should you not be careful with such a structure?

      • Pankaj Jindal says:

        Hi,

        Thanks a lot first of all for putting in so much effort to write this function.

        I am getting an error for twoway clustering. This is the error I get:
        Error in if (nrow(dat)

        Is this error specific to me?

        Thanks,
        Pankaj

      • ad says:

        Hey. Thank you for comment. There was a bug in the code. Basically, not all of your observations have a cluster, i.e. one or both of your cluster variables contain NA’s. The problem was that I did not set-up the warning properly. That is, the warning only worked for the single clustering case, but did not work for twoway clustering. I fixed it and now it should work. Let me know if you encounter any other problems. Cheers.

      • Pankaj Jindal says:

        The error didn’t paste properly in the previous comment.

        Error in if (nrow(dat)

  5. Ranaivo says:

    I am sorry my comment above is a bit of a mess. It changed when I posted it. But basically when I use two clustering variables [e.g., summary(fm, cluster=c(“firmid”, “year”))], I get the error message: “Error in summary.lm(fm, cluster = c(“firmid”, “year”)) :
    object ‘M’ not found”

  6. Ranaivo says:

    Thank you for your response and your great function. It really helps.

  7. ct says:

    Hi! Is there any way to use this code when using weights in your lm model? I get an error telling me that my weights are not recognized : “Error in get(all.vars(object$call)[length(all.vars(object$call))]) : objet ‘yeardif’ introuvable”
    Is there anything I can do?

    Many thanks for your help!

    • ad says:

      Hi! Thank you so much for you comment. There was a problem when extracting the data object from the formula when weights were specified. I was able to fix the problem and now it should work fine. Thank you again for your help.

      • ct says:

        Hi again!

        Thank you very much for your reply! I tried again, and now I only get NAs in the Standard error, t-value, and p value column, even though I have no missing values in my data… I don’t get it! Do you know what might be going on?

        Thanks again!

      • ad says:

        Hi! I conducted some additional robustness tests and everything works fine for me. Can you provide a reproducible example?

        Cheers!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s