Calculate OLS estimator manually in R

This post shows how to manually construct the OLS estimator in R (see this post for the exact mathematical derivation of the OLS estimator). The code will go through each single step of the calculation and estimate the coefficients, standard errors and p-values.  In case you are interested the coding an OLS function rather than in the step wise calculation of the estimation itself I recommend you to have a look at this post

For demonstration purpose, we will construct a fake data set with simulated height and weight data. We will calculate the relationship between height and weight using the lm() function of R. The lm() function is the build-in OLS estimator of R. We will then continue to construct the OLS estimator ourselves and estimate the coefficients (\beta) of the relationship between height and weight and calculate the standard errors (s.e.) around the estimated betas. Finally, we will compare the output of lm() and our manual constructed estimator and show that they are equivalent.

The following code simulates the data we are going to use

## start with clean work-space
rm(list=ls())

## create artificial data set
#  set seed for reproducibility
set.seed(501)

#  height in cm
height <- rnorm(200,175,sd = 15)

#  weight in kilogram (the relationship between height
#  and weight is completely random)
weight <- height-80+1.02*(height)^0.01*
  rnorm(200,0,sd = 15)*
  rnorm(200,1.1,sd = 0.2)

#  join height and weight in data frame
df <- data.frame(height,weight)

We will regress height on weight, after the construction of the data set.

## use R build-in OLS estimaor (lm())
reg = lm(weight ~ height, data=df)
summary(reg)

Finally, we will construct the OLS estimator manually and compare the results to the lm() output. You will see that they are equivalent.

## build OLS estimator manually

#  define X matrix and y vector
X <- as.matrix(cbind(1,df$height))
y <- as.matrix(df$weight)

#  estimate the coeficients beta
#  beta = ((X'X)^(-1))X'y
beta <- solve(t(X)%*%X)%*%t(X)%*%y

## calculate residuals
#  res = y - beta1 - beta2*X2
res <- as.matrix(y-beta[1]-beta[2]*X[,2])

## define the number of observations (n) and the number of
#  parameters (k)
n <- nrow(df)
k <- ncol(X)

## calculate the Variance-Covariance matrix (VCV)
#  VCV = (1/(n-k))res'res(X'X)^(-1)
VCV <- 1/(n-k) * as.numeric(t(res)%*%res) * solve(t(X)%*%X)

## calculate standard errors (se) of coefficients
se <- sqrt(diag(VCV))

## calculate the p-values
p_value <- rbind(2*pt(abs(beta[1]/se[1]), df=n-k,
                      lower.tail= FALSE),
                 2*pt(abs(beta[2]/se[2]), df=n-k,
                      lower.tail= FALSE))

## combine all necessary information
output <- as.data.frame(cbind(c("(Intercept)","height"),
                                beta,se,p_value))
names(output) <- c("Coefficients:","Estimate", 
                   "Std. Error","Pr(>|t|)")

#compare automatic (lm) and manual output
output
summary(reg)

This post showed how to compute the OLS estimator in R by are going slowly through each step of the calculation. The following post takes OLS estimation in R to the next level and wraps a function around the presented code. Ultimately, it is far more convenient to conduct analysis when using function rather than consistently going through each step of the calculation.

Advertisement

4 thoughts on “Calculate OLS estimator manually in R”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.