# Derivation of the Least Squares Estimator for Beta in Matrix Notation

The following post is going to derive the least squares estimator for $\beta$, which we will denote as $b$. In general start by mathematically formalizing relationships we think are present in the real world and write it down in a formula.

(1) $y= X\beta +\epsilon$

Formula (1) depicts such a model, where $\beta$ represents the true relationship between variables in our population. However, it is rare that we observe the whole population and with it the true relationship $\beta$.  Most times we observe just a small fraction of what is really going on in the world. Nevertheless, even if you just observe a faction, it is our job to estimate the true value $\beta$ as good as possible. One way to estimate the value of $\beta$ is done by using Ordinary Least Squares Estimator (OLS). In the following we we are going to derive an estimator for $\beta$. The estimated values for $\beta$ will be called $b$.

Assume we collected some data and have a dataset which represents a sample of the real world. Let the following equation (2) represent the mathematical model of relationships we presume to exist in the real world and consequently in our sample.

(2) $y= Xb +\epsilon$

Equation (3) is supposed to present equation (2) in a more intuitively accessible way for those of you who still need some routine in reading matrix notation, however it is really just the same as equation (2).

(3) $\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{N} \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1K} \\ x_{21} & x_{22} & \cdots & x_{2K} \\ \vdots & \ddots & \ddots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{NK} \end{bmatrix} * \begin{bmatrix} b_{1} \\ b_{2} \\ \vdots \\ b_{K} \end{bmatrix} + \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \vdots \\ \epsilon_{N} \end{bmatrix}$

The idea of the ordinary least squares estimator (OLS) consists in choosing $b_{i}$ in such a way that, the sum of squared residual (i.e. $\sum_{i=1}^{N} \epsilon_{i}$) in the sample is as small as possible. Mathematically this means that in order to estimate the $b$ we have to minimize $\sum_{i=1}^{N} \epsilon_{i}$ which in matrix notation is nothing else than $e'e$.

(4) $\epsilon'\epsilon = \begin{bmatrix} e_{1} & e_{2} & \cdots & e_{N} \\ \end{bmatrix} \begin{bmatrix} e_{1} \\ e_{2} \\ \vdots \\ e_{N} \end{bmatrix} = \sum_{i=1}^{N}e_{i}^{2}$

In order to estimate $b$ we need to minimize $e'e$. This is what we are going to do. Per definition we know that $e = y - Xb$ which follows directly from formula (2). Consequently we can write $e'e$ as $(y-Xb)'(y-Xb)$ by simply plugging in the expression $e = y - Xb$ into $e'e$. This leaves us with the following minimization problem:

(5) $min_{b}$ $e'e = (y-Xb)'(y-Xb)$

(6) $min_{b}$ $e'e = (y'-b'X')(y-Xb)$

(7) $min_{b}$ $e'e = y'y - b'X'y - y'Xb + b'X'Xb$

(8) $min_{b}$ $e'e = y'y - 2b'X'y + b'X'Xb$

It is important to understand that $b'X'y=(b'X'y)'=y'Xb$. As both terms are are scalars, meaning of dimension 1×1, the transposition of the term is the same term.

In order to minimize the expression in (8), we have to differentiate the expression in (8) with respect to $b$ and set the derivative equal zero. In order to be able to do that we make use of the following mathematical statements:

1. $\frac{\partial b'X'y}{\partial b}=X'y$
2. $\frac{\partial b'X'Xb}{\partial b} =2X'Xb$ (proof)

Using the two statements allows us to minimize expression (8).

(8) $min_{b}$ $e'e = y'y - 2b'X'y + b'X'Xb$

(9) $\frac{\partial(e'e)}{\partial b} = -2X'y + 2X'Xb \stackrel{!}{=} 0$

(10) $X'Xb=X'y$

Finally to solve expression (9) for $b$ it is necessary to pre-multiply expression (10) with $(X'X)^{-1}$. This gives us the least squares estimator for $\beta$.

(11) $b=(X'X)^{-1}X'y$

One last mathematical thing, the second order condition for a minimum requires that the matrix $X'X$ is positive definite. This requirement is fulfilled in case $X$ has full rank.

Congratulation you just derived the least squares estimator $b$.

## 20 thoughts on “Derivation of the Least Squares Estimator for Beta in Matrix Notation”

1. Deborah Digges says:

Equation 3 has a typo:

You have put it as y = xb * error

It should be y = xb + error

1. ad says:

You are right. Thank you!

2. Eric NGONGA says:

Hi there! the matrix representation of the multivariable linear regression is clear thanks a lot for the post. I just wonder if the error vector should have (K,1) dimension instead of (N,1) dimension.
Thanks.
E.NGONGA

1. ad says:

Hi Eric! Thank you for your comment. You are right, the vector should have (K,1) dimension instead of (N,1) dimension. I updated the post. Cheers!

3. Tom says:

Hello, great post!

“In order to be able to do that we make use of the following mathematical statements”: statement 2. isn’t clear to me, could you provide a brief explanation? Thanks!

1. ad says:

Hi, thanks!

The two statements are the outcome of matrix derivation rules. Does that help you? If not, I can deliver a short mathematical proof that shows how derive these two statements.

Cheers

1. hieuttbk says:

Can you show me the derivation of 2nd statements or document having matrix derivation rules.
Tks !

2. ad says:

Thank you for you message.

I published the derivation of the 2nd statement in a separate post.

Hope it helps.

4. Patrick says:
1. ad says: