# Robust Standard Errors in R – Function

One can calculate robust standard errors easily in STATA. However, one can easily reach its limit when calculating robust standard errors in R. Although there exist several possibilities to calculate heteroscedasticity consistent standard errors most of them are not easy to implement, especially for beginners. I modified the ` summary() ` function in R so that it replicates the simple way of STATA. You can find the new ` summary() ` function below. Furthermore, I uploaded the function to a github.com repository. This makes it easy to load the function into your R session. In order to see how you can import the new ` summary() ` function into your R session and how you can use it see this post here.

More details
To be precise, I adjusted the ` summary.lm() ` function. The function ` summary() ` calls ` summary.lm() ` if applied on a ` lm ` object. I added the parameter ` robust ` to the ` summary.lm() ` function that calculates robust standard errors if one sets the parameter to true. Ultimately, one can calculate robust standard errors by applying the following code:

` summary(lm.object, robust=T) `

```summary.lm <- function (object, correlation = FALSE,
symbolic.cor = FALSE, robust=FALSE,
cluster=c(NULL,NULL),...) {
# add extension for robust standard errors
if(robust==TRUE){
# save variable that are necessary to calcualte robust sd
X <- model.matrix(object)
u2 <- residuals(object)^2
XDX <- 0

## One needs to calculate X'DX. But due to the fact that
## D is huge (NxN), it is better to do it with a cycle.
for(i in 1:nrow(X)) {
XDX <- XDX + u2[i]*X[i,]%*%t(X[i,])
}

# inverse(X'X)
XX1 <- solve(t(X)%*%X,tol = 1e-100)

varcovar <- XX1 %*% XDX %*% XX1

dfc_r <- sqrt(nrow(X))/sqrt(nrow(X)-ncol(X))

# Standard errors of the coefficient estimates are the
# square roots of the diagonal elements
rstdh <- dfc_r*sqrt(diag(varcovar))
}
# add extension for clustered standard errors
if(!is.null(cluster)&robust==T){warning("Robust standard errors are calculated. Set robust=F to calculate clustered standard errors.")}
if(!is.null(cluster)&robust==F){
if(""%in%cluster){stop("No variable for clustering provided.")}
if(length(cluster)>2){stop("The function only allows max. 2 clusters. You provided more.")}
n_coef <- all.vars(object\$call\$formula)
if(length(cluster)==1){
dat <- na.omit(get(paste(object\$call\$data))[,c(n_coef,cluster)])
if(nrow(dat)<nrow(object\$model)){stop("Not all observation have a cluster.")}
cluster_vector <- dat[,cluster]
require(sandwich, quietly = TRUE)
M <- res_length <- length(unique(cluster_vector))
N <- length(cluster_vector)
K <- object\$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj  <- na.omit(apply(estfun(object),2, function(x) tapply(x, cluster_vector, sum)));
varcovar <- dfc*sandwich(object, meat=crossprod(uj)/N)
rstdh <- sqrt(diag(varcovar))
}
if(length(cluster)==2){
dat_1 <- na.omit(get(paste(object\$call\$data))[,c(n_coef,cluster)])
if(nrow(dat_1)<nrow(object\$model)){stop("Not all observation have a cluster.")}
dat_2 <- na.omit(get(paste(object\$call\$data))[,c(n_coef,cluster)])
if(nrow(dat_2)<nrow(object\$model)){stop("Not all observation have a cluster.")}

dat <- na.omit(get(paste(object\$call\$data))[,c(n_coef,cluster)])
library(sandwich,quietly = TRUE)
cluster1 <- dat[,cluster]
cluster2 <- dat[,cluster]
cluster12 = paste(cluster1,cluster2, sep="")
M1  <- length(unique(cluster1))
M2  <- length(unique(cluster2))
M12 <- res_length <-length(unique(cluster12))
N   <- length(cluster1)
K   <- object\$rank
dfc1  <- (M1/(M1-1))*((N-1)/(N-K))
dfc2  <- (M2/(M2-1))*((N-1)/(N-K))
dfc12 <- (M12/(M12-1))*((N-1)/(N-K))
u1j   <- apply(estfun(object), 2, function(x) tapply(x, cluster1,  sum))
u2j   <- apply(estfun(object), 2, function(x) tapply(x, cluster2,  sum))
u12j  <- apply(estfun(object), 2, function(x) tapply(x, cluster12, sum))
vc1   <-  dfc1*sandwich(object, meat=crossprod(u1j)/N )
vc2   <-  dfc2*sandwich(object, meat=crossprod(u2j)/N )
vc12  <- dfc12*sandwich(object, meat=crossprod(u12j)/N)
varcovar <- vc1 + vc2 - vc12
rstdh <- sqrt(diag(varcovar))
}

}
z <- object
p <- z\$rank
rdf <- z\$df.residual
if (p == 0) {
r <- z\$residuals
n <- length(r)
w <- z\$weights
if (is.null(w)) {
}
else {
r <- sqrt(w) * r
}
ans <- z[c("call", "terms", if (!is.null(z\$weights)) "weights")]
class(ans) <- "summary.lm"
ans\$aliased <- is.na(coef(object))
ans\$residuals <- r
ans\$df <- c(0L, n, length(ans\$aliased))
ans\$coefficients <- matrix(NA, 0L, 4L)
dimnames(ans\$coefficients) <- list(NULL, c("Estimate",
"Std. Error", "t value", "Pr(>|t|)"))
ans\$sigma <- sqrt(resvar)
return(ans)
}
if (is.null(z\$terms))
stop("invalid 'lm' object:  no 'terms' component")
if (!inherits(object, "lm"))
warning("calling summary.lm(<fake-lm-object>) ...")
Qr <- stats:::qr.lm(object)
n <- NROW(Qr\$qr)
if (is.na(z\$df.residual) || n - p != z\$df.residual)
warning("residual degrees of freedom in object suggest this is not an \"lm\" fit")
r <- z\$residuals
f <- z\$fitted.values
w <- z\$weights
if (is.null(w)) {
mss <- if (attr(z\$terms, "intercept"))
sum((f - mean(f))^2)
else sum(f^2)
}
else {
mss <- if (attr(z\$terms, "intercept")) {
m <- sum(w * f/sum(w))
sum(w * (f - m)^2)
}
else sum(w * f^2)
r <- sqrt(w) * r
}
if (is.finite(resvar) && resvar < (mean(f)^2 + var(f)) *
1e-30)
warning("essentially perfect fit: summary may be unreliable")
p1 <- 1L:p
R <- chol2inv(Qr\$qr[p1, p1, drop = FALSE])
se <- sqrt(diag(R) * resvar)

if(robust==T){se <- rstdh}
if(!is.null(cluster)&robust==F){se <- rstdh}
est <- z\$coefficients[Qr\$pivot[p1]]
tval <- est/se
ans <- z[c("call", "terms", if (!is.null(z\$weights)) "weights")]
ans\$residuals <- r
pval <- 2 * pt(abs(tval),
rdf, lower.tail = FALSE)
ans\$coefficients <- cbind(est, se, tval, pval)
dimnames(ans\$coefficients) <- list(names(z\$coefficients)[Qr\$pivot[p1]],
c("Estimate", "Std. Error", "t value", "Pr(>|t|)"))
ans\$aliased <- is.na(coef(object))
ans\$sigma <- sqrt(resvar)
ans\$df <- c(p, rdf, NCOL(Qr\$qr))
if (p != attr(z\$terms, "intercept")) {
df.int <- if (attr(z\$terms, "intercept"))
1L
else 0L
ans\$adj.r.squared <- 1 - (1 - ans\$r.squared) * ((n -
df.int)/rdf)
ans\$fstatistic <- c(value = (mss/(p - df.int))/resvar,
numdf = p - df.int, dendf = rdf)
if(robust==T|(!is.null(cluster))){
if(!is.null(cluster)){rdf <- res_length -1}
pos_coef <- match(names(z\$coefficients)[-match("(Intercept)",
names(z\$coefficients))],
names(z\$coefficients))

P_m <- matrix(z\$coefficients[pos_coef])

R_m <- diag(1,
length(pos_coef),
length(pos_coef))

ans\$fstatistic <- c(value = t(R_m%*%P_m)%*%
(solve(varcovar[pos_coef,pos_coef],tol = 1e-100))%*%
(R_m%*%P_m)/(p - df.int),
numdf = p - df.int, dendf = rdf)

}

}
else ans\$r.squared <- ans\$adj.r.squared <- 0
ans\$cov.unscaled <- R
dimnames(ans\$cov.unscaled) <- dimnames(ans\$coefficients)[c(1,
1)]
if (correlation) {
ans\$correlation <- (R * resvar)/outer(se, se)
dimnames(ans\$correlation) <- dimnames(ans\$cov.unscaled)
ans\$symbolic.cor <- symbolic.cor
}
if (!is.null(z\$na.action))
ans\$na.action <- z\$na.action
class(ans) <- "summary.lm"
ans
}
```

## 8 thoughts on “Robust Standard Errors in R – Function”

1. Frikster says:

Hey,
thanks for the function. I used it and it seems to be working finde. I just noticed some issue, that the absolute value of the F-Statistics are extremely huge but the sign fluctuates heavily.
I have a specification with a F-Value of 5.984e+14, dropping one variable my F-value is -1.527e+14.

Did anybody have the same “issue”?

1. ad says:

Hey,

Thank you for your comment. First of all, the function actually returns an F-Statistic that is based on a Wald test instead of sum of squares. In that sense it is a little different than the usual F-Test. The reason for that is that one should not use the sums of squares to calculate the F-statistic if data are heteroscedastic.

Nevertheless, I have never experienced heavy fluctuations of the F-Statistic. Would it be possible to provide a reproducible example? Generally, I estimated several models using different R functions and other statistical programs to cross-validate the function and always found that the function returned identical results.

HTH

2. SI says:

Hi, many thanks for sharing your work and the function. With one way clustering, I get an error “object ‘M12’ not found.” I see “if(!is.null(cluster)){rdf <- M12-1}" is giving it, as M12 is defined only for two way clustering, and changing M12 to M resolved it after changing length(cluster)==1 to if conditioning. I.e.,
if (length(cluster) == 1) {rdf <- M-1} else
if (length(cluster) == 2) {rdf <- M12-1}
(length(NULL) is zero, so is.null is dropped in the above.) I hope I got it right. Otherwise, please ignore me. Cheers.

1. ad says:

Hi! You are right. There was a bug in the function posted here. I corrected the mistake already months ago on github, but never updated the blogpost. I now corrected the bug also here. Thanks a lot for your comment. Let me know if it still does not work.

This site uses Akismet to reduce spam. Learn how your comment data is processed.