The second part of the series on the Omitted Variable Bias intends to increase the readers understanding of the bias. Let’s continue with the example from the Introduction. Let the dependent variable be the price of a car and the explanatory variables be the car’s millage and the car’s age. In our case, both millage and age are important factors to that determine the price of a car.

The Venn Diagram below illustrates the problems that arise when we neglect an important variable from our regression analysis. Note that, the overlap of miles and price (area C) is the true impact of variable miles on price. The overlap of age and price (area D) is the true impact of variable age on price. Now, assume that you include millage in your regression analysis, but you omitted age. By doing so, you are estimating the impact of miles on price by areas C and B and not just area C. What can you say about the estimate of miles in the regression? What will be the consequences of neglecting age? Here are some general statements on what will happen if you neglect an important variable, in our case age.

- The coefficient of miles is biased because area B actually belongs to both variables miles and age.
- Since the coefficient of variable miles is estimated by both areas miles and age, its variance is reduced.
- Finally, the unexplained variance of price (the dependent variable) increases because you have omitted an important variable.

The following post further explains the nature of the omitted variable bias. Particularly, the post discusses the effects of the omitted variable bias on single coefficients.

###### Omitted Variable Bias

- Overview
- Introduction
- Understanding the Bias
- Explanation and Example
- Consequences
- What can we do about it?
- Concluding Remarks

Very nice article! I’m struggling to understand why the variance of coefficient before miles would be less if we omit variable age. Could someone clarify this, please?

Thank you! Neglecting the variable age form our regression reduces the variance of the estimated coefficient from variable miles as we attribute area B and C to the variable miles. Hence, we get a much stronger signal, even though it is a wrong signal, form the data, that leads to a more precise estimate of the coefficient. Your question is somewhat related to the Problem of Multicollinearity. Hope this helps.

Cheers, ad