Omitted Variable Bias: Violation of CLRM–Assumption 3: Explanatory Variables must be exogenous

One reason why the omitted variable leads to biased estimates is that omitting a relevant variable violates assumption 3 of the necessary assumptions of the classical regression model that states that all explanatory variables must be exogenous, i.e.

$E(\epsilon_{i}|X)=0$

From this post, we know that omitting a relevant variable from the regression causes the error term and the explanatory variables to be correlated.

Suppose that the data generating process in the population is as follows:

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon$

However, we omit a variable, in this case $X_2$ and estimate the following regression model:

$Y = b_0 + b_1 X_1 + u$ ,

where $\epsilon \sim iid(0,\sigma^{2})$ and $u = \epsilon + \beta_2 X_2$

Note that, if $E(X_1|X_2) \neq 0)$ it implies that $E(X_1|u) \neq 0)$ . Hence, we violate assumption three. That is, the error term will be correlated with another explanatory variable.

Now, how can we mathematically prove that omitting $X_2$ indeed causes endogeneity. To prove this, lets start from the limit of the OLS estimator. Let $X$ denote the full matrix of explanatory variables, in our case $X = [1,X_1]$ , and let $u$ be the error term containing $\beta_2 X_2$ , that is $u = \epsilon + \beta_2 X_2$ . Additionally, let $\beta$ be the vector of parameters that we want to estimate, i.e. $\beta = (\beta_0,\beta_1)$ .

$\mathbb{E}(\hat{\beta}) = \mathbb{E}(\left[ (X'X)^{-1}X'Y \right]) \\ \\ = \mathbb{E}( \left[ (X'X)^{-1}X'(X\beta + u) \right]) \\ = \mathbb{E}( \left[ (X'X)^{-1}X'X\beta \right]) + \mathbb{E}(\left[ (X'X)^{-1}X'u \right]) \\ = \mathbb{E}( \left[ (X'X)^{-1}X'X \right] \beta) + \mathbb{E}( \left[ (X'X)^{-1}X'(\beta_2 X_2 + \epsilon) \right]) \\ = \beta + \beta_2 \mathbb{E}(\left[ (X'X)^{-1}X' X_2 \right]) + \mathbb{E}(\left[ (X'X)^{-1}X'\epsilon \right]) \\ = \beta + \beta_2 \mathbb{E}(\left[ (X'X)^{-1}X' X_2 \right]) \\ = \beta + \beta_2 \mathbb{E}(X'X)^{-1} \mathbb{E}(X' X_2)$

Immediately, we see that $\mathbb{E}(\hat{\beta}) \ne \beta$ . This is always the case whenever $\mathbb{E}(X'X_2) \ne 0$ , i.e. whenever we have a correlation between $X_1$ and $X_2$ .

In case we specify our model correctly, the second term $\mathbb{E}(\left[ (X'X)^{-1}X'u \right])$ in the third row would be \mathbb{E}(\left[ (X’X)^{-1}X’u \epsilon]) and collaps to zero. That is $\mathbb{E} \left[ (X'X)^{-1}X'\epsilon \right] =0$ . This happens because $\mathbb{E} \left[ (X'X)^{-1}X'\epsilon \right] =$ $(\mathbb{E} X'X)^{-1} \mathbb{E} (X'\epsilon) =$ $[\mathbb{E}(X'X)]^{-1} \mathbb{E}(X'\epsilon)$ , since $\mathbb{E}(X'\epsilon)=0$ , which holds because the original assumption that each of the explanatory variables are uncorrelated with the error term $\epsilon$ .