One reason why the omitted variable leads to biased estimates is that omitting a relevant variable violates assumption 3 of the necessary assumptions of the classical regression model that states that all explanatory variables must be exogenous, i.e.
From this post, we know that omitting a relevant variable from the regression causes the error term and the explanatory variables to be correlated.
Suppose that the data generating process in the population is as follows:
However, we omit a variable, in this case and estimate the following regression model:
,
where and
Note that, if it implies that
. Hence, we violate assumption three. That is, the error term will be correlated with another explanatory variable.
Now, how can we mathematically prove that omitting indeed causes endogeneity. To prove this, lets start from the limit of the OLS estimator. Let
denote the full matrix of explanatory variables, in our case
, and let
be the error term containing
, that is
. Additionally, let
be the vector of parameters that we want to estimate, i.e.
.
Immediately, we see that . This is always the case whenever
, i.e. whenever we have a correlation between
and
.
In case we specify our model correctly, the second term in the third row would be \mathbb{E}(\left[ (X’X)^{-1}X’u \epsilon]) and collaps to zero. That is
. This happens because
, since
, which holds because the original assumption that each of the explanatory variables are uncorrelated with the error term
.
2 thoughts on “Omitted Variable Bias: Violation of CLRM–Assumption 3: Explanatory Variables must be exogenous”