Calculate OLS estimator manually in R

This post shows how to manually construct the OLS estimator in R (see this post for the exact mathematical derivation of the OLS estimator). The code will go through each single step of the calculation and estimate the coefficients, standard errors and p-values.  In case you are interested the coding an OLS function rather than in the step wise calculation of the estimation itself I recommend you to have a look at this post

For demonstration purpose, we will construct a fake data set with simulated height and weight data. We will calculate the relationship between height and weight using the $lm()$ function of R. The $lm()$ function is the build-in OLS estimator of R. We will then continue to construct the OLS estimator ourselves and estimate the coefficients ($\beta$) of the relationship between height and weight and calculate the standard errors ($s.e.$) around the estimated betas. Finally, we will compare the output of $lm()$ and our manual constructed estimator and show that they are equivalent.

The following code simulates the data we are going to use

## start with clean work-space
rm(list=ls())

## create artificial data set
#  set seed for reproducibility
set.seed(501)

#  height in cm
height <- rnorm(200,175,sd = 15)

#  weight in kilogram (the relationship between height
#  and weight is completely random)
weight <- height-80+1.02*(height)^0.01*
rnorm(200,0,sd = 15)*
rnorm(200,1.1,sd = 0.2)

#  join height and weight in data frame
df <- data.frame(height,weight)


We will regress height on weight, after the construction of the data set.

## use R build-in OLS estimaor (lm())
reg = lm(weight ~ height, data=df)
summary(reg)


Finally, we will construct the OLS estimator manually and compare the results to the lm() output. You will see that they are equivalent.

## build OLS estimator manually

#  define X matrix and y vector
X <- as.matrix(cbind(1,df$height)) y <- as.matrix(df$weight)

#  estimate the coeficients beta
#  beta = ((X'X)^(-1))X'y
beta <- solve(t(X)%*%X)%*%t(X)%*%y

## calculate residuals
#  res = y - beta1 - beta2*X2
res <- as.matrix(y-beta[1]-beta[2]*X[,2])

## define the number of observations (n) and the number of
#  parameters (k)
n <- nrow(df)
k <- ncol(X)

## calculate the Variance-Covariance matrix (VCV)
#  VCV = (1/(n-k))res'res(X'X)^(-1)
VCV <- 1/(n-k) * as.numeric(t(res)%*%res) * solve(t(X)%*%X)

## calculate standard errors (se) of coefficients
se <- sqrt(diag(VCV))

## calculate the p-values
p_value <- rbind(2*pt(abs(beta[1]/se[1]), df=n-k,
lower.tail= FALSE),
2*pt(abs(beta[2]/se[2]), df=n-k,
lower.tail= FALSE))

## combine all necessary information
output <- as.data.frame(cbind(c("(Intercept)","height"),
beta,se,p_value))
names(output) <- c("Coefficients:","Estimate",
"Std. Error","Pr(>|t|)")

#compare automatic (lm) and manual output
output
summary(reg)


This post showed how to compute the OLS estimator in R by are going slowly through each step of the calculation. The following post takes OLS estimation in R to the next level and wraps a function around the presented code. Ultimately, it is far more convenient to conduct analysis when using function rather than consistently going through each step of the calculation.

4 thoughts on “Calculate OLS estimator manually in R”

1. sam says:

# beta = ((X’X)^(-1))X’y
beta <- solve(t(X)%*%X)%*%t(X)%*%y

where is ^(-1) in the code?