Omitted Variable Bias: Violation of CLRM–Assumption 3: Explanatory Variables must be exogenous

One reason why the omitted variable leads to biased estimates is that omitting a relevant variable violates assumption 3 of the necessary assumptions of the classical regression model that states that all explanatory variables must be exogenous, i.e.

E(\epsilon_{i}|X)=0

From this post, we know that omitting a relevant variable from the regression causes the error term and the explanatory variables to be correlated.

Suppose that the data generating process in the population is as follows:

Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon

However, we omit a variable, in this case X_2 and estimate the following regression model:

Y = b_0 + b_1 X_1 + u,

where \epsilon \sim iid(0,\sigma^{2}) and u = \epsilon + \beta_2 X_2

Note that, if E(X_1|X_2) \neq 0) it implies that E(X_1|u) \neq 0). Hence, we violate assumption three. That is, the error term will be correlated with another explanatory variable.

Now, how can we mathematically prove that omitting X_2 indeed causes endogeneity. To prove this, lets start from the limit of the OLS estimator. Let X denote the full matrix of explanatory variables, in our case X = [1,X_1], and let u be the error term containing \beta_2 X_2, that is u = \epsilon + \beta_2 X_2. Additionally, let \beta be the vector of parameters that we want to estimate, i.e. \beta = (\beta_0,\beta_1).

\mathbb{E}(\hat{\beta}) = \mathbb{E}(\left[ (X'X)^{-1}X'Y \right]) \\  \\ = \mathbb{E}( \left[ (X'X)^{-1}X'(X\beta + u) \right])  \\ = \mathbb{E}( \left[ (X'X)^{-1}X'X\beta \right]) + \mathbb{E}(\left[ (X'X)^{-1}X'u \right])  \\ = \mathbb{E}( \left[ (X'X)^{-1}X'X \right] \beta) + \mathbb{E}( \left[ (X'X)^{-1}X'(\beta_2 X_2 + \epsilon) \right])  \\ = \beta + \beta_2 \mathbb{E}(\left[ (X'X)^{-1}X' X_2 \right]) + \mathbb{E}(\left[ (X'X)^{-1}X'\epsilon \right])  \\ = \beta + \beta_2 \mathbb{E}(\left[ (X'X)^{-1}X' X_2 \right])  \\ = \beta + \beta_2 \mathbb{E}(X'X)^{-1} \mathbb{E}(X' X_2)

Immediately, we see that \mathbb{E}(\hat{\beta}) \ne \beta. This is always the case whenever \mathbb{E}(X'X_2) \ne 0, i.e. whenever we have a correlation between X_1 and X_2.

In case we specify our model correctly, the second term \mathbb{E}(\left[ (X'X)^{-1}X'u \right]) in the third row would be \mathbb{E}(\left[ (X’X)^{-1}X’u \epsilon]) and collaps to zero. That is \mathbb{E} \left[ (X'X)^{-1}X'\epsilon \right] =0. This happens because \mathbb{E} \left[ (X'X)^{-1}X'\epsilon \right] = (\mathbb{E} X'X)^{-1} \mathbb{E} (X'\epsilon) = [\mathbb{E}(X'X)]^{-1} \mathbb{E}(X'\epsilon) , since \mathbb{E}(X'\epsilon)=0, which holds because the original assumption that each of the explanatory variables are uncorrelated with the error term \epsilon.

 

Omitted Variable Bias
  1. Overview
  2. Introduction
  3. Understanding the Bias
  4. Explanation and Example
  5. Consequences
  6. What can we do about it?
  7. Concluding Remarks

2 thoughts on “Omitted Variable Bias: Violation of CLRM–Assumption 3: Explanatory Variables must be exogenous”

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.