Derivation of the Least Squares Estimator for Beta in Matrix Notation

The following post is going to derive the least squares estimator for $\beta$ , which we will denote as $b$ . In general start by mathematically formalizing relationships we think are present in the real world and write it down in a formula.

(1) $y= X\beta +\epsilon$

Formula (1) depicts such a model, where $\beta$ represents the true relationship between variables in our population. However, it is rare that we observe the whole population and with it the true relationship $\beta$ . Most times we observe just a small fraction of what is really going on in the world. Nevertheless, even if you just observe a faction, it is our job to estimate the true value $\beta$ as good as possible. One way to estimate the value of $\beta$ is done by using Ordinary Least Squares Estimator (OLS). In the following we we are going to derive an estimator for $\beta$ . The estimated values for $\beta$ will be called $b$ .

Assume we collected some data and have a dataset which represents a sample of the real world. Let the following equation (2) represent the mathematical model of relationships we presume to exist in the real world and consequently in our sample.

(2) $y= Xb +\epsilon$

Equation (3) is supposed to present equation (2) in a more intuitively accessible way for those of you who still need some routine in reading matrix notation, however it is really just the same as equation (2).

(3) $\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{N} \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1K} \\ x_{21} & x_{22} & \cdots & x_{2K} \\ \vdots & \ddots & \ddots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{NK} \end{bmatrix} * \begin{bmatrix} b_{1} \\ b_{2} \\ \vdots \\ b_{K} \end{bmatrix} + \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \vdots \\ \epsilon_{N} \end{bmatrix}$

The idea of the ordinary least squares estimator (OLS) consists in choosing $b_{i}$ in such a way that, the sum of squared residual (i.e. $\sum_{i=1}^{N} \epsilon_{i}$ ) in the sample is as small as possible. Mathematically this means that in order to estimate the $b$ we have to minimize $\sum_{i=1}^{N} \epsilon_{i}$ which in matrix notation is nothing else than $e'e$ .

(4) $\epsilon'\epsilon = \begin{bmatrix} e_{1} & e_{2} & \cdots & e_{N} \\ \end{bmatrix} \begin{bmatrix} e_{1} \\ e_{2} \\ \vdots \\ e_{N} \end{bmatrix} = \sum_{i=1}^{N}e_{i}^{2}$

In order to estimate $b$ we need to minimize $e'e$ . This is what we are going to do. Per definition we know that $e = y - Xb$ which follows directly from formula (2). Consequently we can write $e'e$ as $(y-Xb)'(y-Xb)$ by simply plugging in the expression $e = y - Xb$ into $e'e$ . This leaves us with the following minimization problem:

(5) $min_{b}$ $e'e = (y-Xb)'(y-Xb)$

(6) $min_{b}$ $e'e = (y'-b'X')(y-Xb)$

(7) $min_{b}$ $e'e = y'y - b'X'y - y'Xb + b'X'Xb$

(8) $min_{b}$ $e'e = y'y - 2b'X'y + b'X'Xb$

It is important to understand that $b'X'y=(b'X'y)'=y'Xb$ . As both terms are are scalars, meaning of dimension 1×1, the transposition of the term is the same term.

In order to minimize the expression in (8), we have to differentiate the expression in (8) with respect to $b$ and set the derivative equal zero. In order to be able to do that we make use of the following mathematical statements:

$\frac{\partial b'X'y}{\partial b}=X'y$
$\frac{\partial b'X'Xb}{\partial b} =2X'Xb$ (proof)

Using the two statements allows us to minimize expression (8).

(8) $min_{b}$ $e'e = y'y - 2b'X'y + b'X'Xb$

(9) $\frac{\partial(e'e)}{\partial b} = -2X'y + 2X'Xb \stackrel{!}{=} 0$

(10) $X'Xb=X'y$

Finally to solve expression (9) for $b$ it is necessary to pre-multiply expression (10) with $(X'X)^{-1}$ . This gives us the least squares estimator for $\beta$ .

(11) $b=(X'X)^{-1}X'y$

One last mathematical thing, the second order condition for a minimum requires that the matrix $X'X$ is positive definite. This requirement is fulfilled in case $X$ has full rank.

Congratulation you just derived the least squares estimator $b$ .

23 thoughts on “Derivation of the Least Squares Estimator for Beta in Matrix Notation”

Pingback: Deriving the Least Squares Estimates - Page 2 - Math Help Forum

Pingback: Proof Gauss Markov Theorem | Economic Theory Blog

Pingback: Violation of CLRM – Assumption 4.1: Consequences when the expected value of the error term is non-zero | Economic Theory Blog

Pingback: Calculate OLS estimator manually in R | Economic Theory Blog

Pingback: Construct the OLS estimator as a function in R | Economic Theory Blog

Equation 3 has a typo:

You have put it as y = xb * error

It should be y = xb + error

ad says:

December 5, 2016 at 6:19 am

You are right. Thank you!

Reply

Pingback: Linear model and least squares in matrix notation | Data driven nerd out

Pingback: Regression (1) – Linear Regression – kerneleconomics

Hi there! the matrix representation of the multivariable linear regression is clear thanks a lot for the post. I just wonder if the error vector should have (K,1) dimension instead of (N,1) dimension.
Thanks.
E.NGONGA

ad says:

September 8, 2017 at 12:21 pm

Hi Eric! Thank you for your comment. You are right, the vector should have (K,1) dimension instead of (N,1) dimension. I updated the post. Cheers!

Reply

Hello, great post!

“In order to be able to do that we make use of the following mathematical statements”: statement 2. isn’t clear to me, could you provide a brief explanation? Thanks!

ad says:

March 5, 2018 at 10:12 am

Hi, thanks!

The two statements are the outcome of matrix derivation rules. Does that help you? If not, I can deliver a short mathematical proof that shows how derive these two statements.

Cheers

Reply
1. hieuttbk says:
  
  October 16, 2018 at 3:34 pm
  
  Can you show me the derivation of 2nd statements or document having matrix derivation rules.
  Tks !
2. ad says:
  
  October 17, 2018 at 8:02 am
  
  Thank you for you message.
  
  I published the derivation of the 2nd statement in a separate post.
  
  Hope it helps.
  
  Cheers, ad

Pingback: Linear Regression | Economic Theory Blog

Pingback: Derivation of the Least Squares Estimator for Beta in Matrix Notation – Proof Nr. 1 | Economic Theory Blog

Very clearly written; logical and easy to follow. Thank you very much.

ad says:

January 9, 2020 at 6:54 am

Thanks for the praise!

Reply

Very helpful. I understood much of this, which says a lot given my weak Linear Algebra.

I was surprised by this statement: “It is important to understand that b’X’y=(b’X’y)’=y’Xb.” … because “both terms are scalars.” Is b not a vector, y a vector, and X a matrix of observed data with d-dimensions or features?

ad says:

November 20, 2020 at 9:13 pm

Thanks for this comment. You really made ne think there. The term b’X’y is a scalar, because 1xn * n x n * 1 x n is 1 x 1. Hope this helps. Cheers, ad

Reply

Pingback: Speed up Linear Regression with Matrix Math – Data Science Austria

Economic Theory Blog

Derivation of the Least Squares Estimator for Beta in Matrix Notation

23 thoughts on “Derivation of the Least Squares Estimator for Beta in Matrix Notation”

Leave a reply to ad Cancel reply

“In God we trust; all others must bring data.” W. Edwards Deming

Share this:

23 thoughts on “Derivation of the Least Squares Estimator for Beta in Matrix Notation”

Leave a reply to ad Cancel reply

“In God we trust; all others must bring data.” W. Edwards Deming