This post shows how to manually construct the OLS estimator in R (see this post for the exact mathematical derivation of the OLS estimator). The code will go through each single step of the calculation and estimate the coefficients, standard errors and p-values. In case you are interested the coding an OLS function rather than in the step wise calculation of the estimation itself I recommend you to have a look at this post.
For demonstration purpose, we will construct a fake data set with simulated height and weight data. We will calculate the relationship between height and weight using the function of R. The
function is the build-in OLS estimator of R. We will then continue to construct the OLS estimator ourselves and estimate the coefficients (
) of the relationship between height and weight and calculate the standard errors (
) around the estimated betas. Finally, we will compare the output of
and our manual constructed estimator and show that they are equivalent.
The following code simulates the data we are going to use
## start with clean work-space rm(list=ls()) ## create artificial data set # set seed for reproducibility set.seed(501) # height in cm height <- rnorm(200,175,sd = 15) # weight in kilogram (the relationship between height # and weight is completely random) weight <- height-80+1.02*(height)^0.01* rnorm(200,0,sd = 15)* rnorm(200,1.1,sd = 0.2) # join height and weight in data frame df <- data.frame(height,weight)
We will regress height on weight, after the construction of the data set.
## use R build-in OLS estimaor (lm()) reg = lm(weight ~ height, data=df) summary(reg)
Finally, we will construct the OLS estimator manually and compare the results to the lm() output. You will see that they are equivalent.
## build OLS estimator manually # define X matrix and y vector X <- as.matrix(cbind(1,df$height)) y <- as.matrix(df$weight) # estimate the coeficients beta # beta = ((X'X)^(-1))X'y beta <- solve(t(X)%*%X)%*%t(X)%*%y ## calculate residuals # res = y - beta1 - beta2*X2 res <- as.matrix(y-beta[1]-beta[2]*X[,2]) ## define the number of observations (n) and the number of # parameters (k) n <- nrow(df) k <- ncol(X) ## calculate the Variance-Covariance matrix (VCV) # VCV = (1/(n-k))res'res(X'X)^(-1) VCV <- 1/(n-k) * as.numeric(t(res)%*%res) * solve(t(X)%*%X) ## calculate standard errors (se) of coefficients se <- sqrt(diag(VCV)) ## calculate the p-values p_value <- rbind(2*pt(abs(beta[1]/se[1]), df=n-k, lower.tail= FALSE), 2*pt(abs(beta[2]/se[2]), df=n-k, lower.tail= FALSE)) ## combine all necessary information output <- as.data.frame(cbind(c("(Intercept)","height"), beta,se,p_value)) names(output) <- c("Coefficients:","Estimate", "Std. Error","Pr(>|t|)") #compare automatic (lm) and manual output output summary(reg)
This post showed how to compute the OLS estimator in R by are going slowly through each step of the calculation. The following post takes OLS estimation in R to the next level and wraps a function around the presented code. Ultimately, it is far more convenient to conduct analysis when using function rather than consistently going through each step of the calculation.
# beta = ((X’X)^(-1))X’y
beta <- solve(t(X)%*%X)%*%t(X)%*%y
where is ^(-1) in the code?
Hi Sam, thank you for your comment. In R, the function solve() inverts a matrix. Best, ad