From this post, we know that omitting a relevant variable from the regression causes the error term and the explanatory variables to be correlated.

Suppose that the data generating process in the population is as follows:

However, we omit a variable, in this case and estimate the following regression model:

,

where and

Note that, if it implies that . Hence, we violate assumption three. That is, the error term will be correlated with another explanatory variable.

Now, how can we mathematically prove that omitting indeed causes endogeneity. To prove this, lets start from the limit of the OLS estimator. Let denote the full matrix of explanatory variables, in our case , and let be the error term containing , that is . Additionally, let be the vector of parameters that we want to estimate, i.e. .

Immediately, we see that . This is always the case whenever , i.e. whenever we have a correlation between and .

In case we specify our model correctly, the second term in the third row would be \mathbb{E}(\left[ (X’X)^{-1}X’u \epsilon]) and collaps to zero. That is . This happens because , since , which holds because the original assumption that each of the explanatory variables are uncorrelated with the error term .

Our dependent variable will be my weekly average weight, the explanatory variable latex x_{2}$ is a binary variable that takes a value of 1 in case I was cycling the week earlier and 0 otherwise. For a more detailed description of the data see here.

# load a couple of packages using Distributions using GLM using DataFrames using DataArrays # load Taro - Pkg to read Excel Data using Taro Taro.init() # get data path = "https://economictheoryblog.files.wordpress.com/2016/08/data.xlsx" data = Taro.readxl(download(path), "data", "A1:C357") data = deleterows!(data,find(isna(data[:,1])|isna(data[:,2]))) data[:,1] = convert(DataArrays.DataArray{Float64,1},data[:,1]) data[:,2] = convert(DataArrays.DataArray{Float64,1},data[:,2]) data[:,3] = convert(DataArrays.DataArray{Float64,1},data[:,3]) # estimate the linear regression model glm(@formula(weight ~ lag_calories + lag_cycling), data, Normal(), IdentityLink())]]>

All in all, the omitted variable bias is a severe problem. Neglecting a relevant variable leads to biased and inconsistent estimates. Hence, as a general advice, when you are working with linear regression models, you should pay close attention to potentially omitted variables. In particular, you should ask yourself the following questions:

- What variables might potentially impact the dependent variable but are not (yet) included in the model?
- Out of the variables identified in question one, what variables are likely to be correlated with other explanatory variables included in the model?
- For those omitted variables that are likely to be correlated with the dependent variable and at least one other explanatory variable, what is the expected sign of the correlation — positive or negative?
- Based on the sign of the correlation, what bias — upward or downward — are the estimates suffering from?
- Finally, you should ask yourself what is the magnitude of the bias. Could it be strong enough to completely impact your regression?

Generally, with time (and experience) it will become easier to determine what variables are important and relevant and which variables are not.

Finally, from the previous posts, you should know that the omitted variable bias leads to biased estimates, but also leads to a decrease of the variance. In certain cases, one might want to weight one against the other, i.e. increase in bias versus decrease in variance. That is, sometimes it can be better to have wrong, but precise estimates rather than unbiased and imprecise ones. Thus, if you want to decrease the variance, the trade-off is to increase your bias, and if you want to decrease your bias, the trade-off is increased variance.

- Overview
- Introduction
- Understanding the Bias
- Explanation and Example
- Consequences
- Concluding Remarks

]]>

In order to understand the consequences of the omitted variable bias, we first have to understand what is needed to obtain good estimates. When studying the linear regression models, you necessarily come across the Gauss-Markov theorem. This theorem states that if your regression model fulfills a set of assumptions (the assumptions of classical linear regression model), then you will obtain the best, linear, and unbiased estimates (BLUE ). One important assumption of this set of assumptions states that the error term of the regression model must be uncorrelated with the explanatory variables. However, as you will see in a minute, omitting a relevant variable introduces a correlation between the explanatory variables and the error term.

What happens when you omit an important variable? From the introductory post, you should know that one of the conditions for an omitted variable bias to exist is that the omitted variable is correlated with the independent variable and with at least one other explanatory variable. Now, when omitting a variable, it will show up in the residual, i.e. it will show up in the error term. Thus, the error term and independent variables are necessarily going to be correlated. This clearly violates the assumption that the error term and the independent variables must be uncorrelated. A violation of this assumption causes the OLS estimator to be biased and inconsistent. For a mathematical proof of this statement see this post.

Furthermore, when looking at the discussion using the Venn diagram, note that omitting a variable causes the unexplained variance of Y (the dependent variable) to increase as well as the variance of the estimated coefficient to decrease. This might lead to a situation in which you reject the null-hypothesis and believe that your coefficients are statistically significant at a given significance level although they are not.

**How serious is the omitted variable bias..
**

The problem of the omitted variable bias is pretty serious. An omitted variable leads to biased and inconsistent coefficient estimate. And as we all know, biased and inconsistent estimates are not reliable. From our previous post, you might remember how omitting a variable can change the signs of the coefficients, depending on the correlation of the omitted variable with the independent and explanatory variables. Thus, coefficients also become unreliable. Hence, the regression model will fail completely.

**… And what can be done about it?**

To deal with an omitted variables bias is not easy. However, one can try several things.

First, one can try, if the required data is available, to include as many variables as you can in the regression model. Of course, this will have other possible implications that one has to consider carefully. First, you need to have a sufficient number of data points to include additional explanatory variables or else you will not be able to estimate your model. Second, depending on how many extra variables you include, the issues of including unnecessary variables may arise and start to seriously influence your estimates. However, additional explanatory variables can help to mitigate the problems associated with the omitted variable bias. That is, additional control variables can lower the bias.

Second, if you think that a variable is important and leaving it out of your regression model could cause an omitted variable bias, but at the same time you do not have data for it, you can look for proxies or find instrument variables for the omitted variables. For instance, in the car price example that we discussed earlier, the omitted variable was the age of the car. Suppose you do not have data on the age of the car, however you know how much time the last owner was in possession of the car, then the amount of time the car was owned by the last owner can be taken as a proxy for the age of a car. Note however, using proxies and instrumental variables comes with a whole set of additional assumptions and problems, most of them are quite complicated and not easily met.

Third, if you cannot resolve the omitted variable bias, you can try to make predictions in which direction your estimates are biased.

- Overview
- Introduction
- Understanding the Bias
- Explanation and Example
- Consequences
- Concluding Remarks

]]>

Let the data generating process be as follows:

This implies that the price of a car is determined by its milage and its age. However, for whatever reason you omit the variable age in your regression model and you estimate the following reduced regression model:

What are the effects of omitting the variable age? What will happen to the coefficient of miles? What sign do you expect for ?

Let’s elaborate on these questions. A priori, one would expect that a higher milage lowers the price of a car. Hence, we would expect to have a negative sign, i.e. . One would further expect that an older car is cheaper and hence traded at a lower price. Also, one would expect that an older car has more miles. We can therefore conclude that,

- Price and miles are negatively correlated
- Miles and age are positively correlated.

What does this imply for our regression analysis? We know now that a large number of miles lowers the price of a car. But, if a car has many miles it tends to be older. Thus, when omitting the variable age, the variable miles may actually be accounting also for the effects of age and not only miles.

Thus, , suffers from a bias.

But can we say something more about the bias? Yes. We know that suffers from a downward bias. This is because both age and miles have a negative effect on the price. Leaving out age lets the coefficient of miles pick up parts of the negative effects of age.

Hence, it follows that the true . This implies that if then it is not necessarily true that .

The illustration below summarizes the direction of the omitted variable bias. Let Y be the dependent variable, A and B the independent variables, and B the omitted variable.

The Venn Diagram below illustrates the problems that arise when we neglect an important variable from our regression analysis. Note that, the overlap of miles and price (area C) is the true impact of variable miles on price. The overlap of age and price (area D) is the true impact of variable age on price. Now, assume that you include millage in your regression analysis, but you omitted age. By doing so, you are estimating the impact of miles on price by areas C and B and not just area C. What can you say about the estimate of miles in the regression? What will be the consequences of neglecting age? Here are some general statements on what will happen if you neglect an important variable, in our case age.

- The coefficient of miles is biased because area B actually belongs to both variables miles and age.
- Since the coefficient of variable miles is estimated by both areas miles and age, its variance is reduced.
- Finally, the unexplained variance of price (the dependent variable) increases because you have omitted an important variable.

The following post further explains the nature of the omitted variable bias. Particularly, the post discusses the effects of the omitted variable bias on single coefficients.

- Overview
- Introduction
- Understanding the Bias
- Explanation and Example
- Consequences
- Concluding Remarks

]]>

The omitted variable bias occurs because of a misspecification of the linear regression model. The problem can arise for various reasons, either because the effect of the omitted variable on the dependent variable is unknown or because a variable is simply not available. In the latter case, you might be forces to omit that variable from your model. However, one needs to be aware that omitting a variable might lead to an over-estimation (upward bias) or under-estimation (downward bias) of the coefficient of one or more explanatory variables.

In order for the omitted variable to bias your coefficients, two requirements must be fulfilled:

- The omitted variable must be correlated with the
variable.**dependent** - The omitted variable must be correlated with
variables.*one or more other explanatory*

In our example, the age of the car is negatively correlated with the price of the car and positively correlated with the cars milage. Hence, omitting age in your regression results in an omitted variable bias.

Part three of the series on the omitted variable bias, intends to increase the readers understanding of the bias.

Generally, one can use various libraries to read Excel files, including XLSXReader, ExcelReaders or Taro. This tutorial will focus on Taro as it created the fewest problems and provides – at least in my eyes – an easy to understand syntax. In order to download and read an Excel file into Julia it is sufficient to execute the following lines of code. This script downloads some sample data provided by this blog and reads it into Julia.

# in case you have not installed to Pkg yet Pkg.add("Taro") # load Taro - Pkg to read Excel Data using Taro Taro.init() # get data path = "https://economictheoryblog.files.wordpress.com/2016/08/data.xlsx" df = Taro.readxl(download(path), "data", "A1:C357") # # The dataframe df contains the downloaded data #

This routine will read the selected data fields, i.e. field A1 to field C357, of data into a dataframe in Julia. You can also directly consult the documentation of the Taro Package here.

]]>

The following code produces the Venn diagram used in the post explaining the omitted variable bias.

# start with an empty workspace rm(list = ls()) # load EulerR pkg library(eulerr) # create sets fit <- euler(c(Price = 500, Miles = 500, Age= 500, "Price&Miles" = 200, "Price&Age" = 200, "Miles&Age" = 200)) png(filename="venn.png") plot(fit, fill_opacity = 0.3) dev.off()]]>

In order to read an Excel file form your computer into Julia it is sufficient to execute the following lines of code. However, if you want to download the Excel file directly from the web and read it into Julia you should look at this post.

# in case you have not installed to Pkg yet Pkg.add("Taro") # load Taro - Pkg to read Excel Data using Taro Taro.init() # get data df = Taro.readxl("path to Excel", "sheet1", "A1:C357")

This code will read the selected data fields, i.e. field A1 to field C357, of sheet1 into a dataframe in Julia. You can also directly consult the documentation of the Taro Package here.

]]>