# Violation of CLRM – Assumption 4.1: Consequences when the expected value of the error term is non-zero

Violating assumption 4.1 of the OLS assumptions, i.e. $E(\epsilon_i|X) = 0$, can affect our estimation in various ways. The exact ways a violation affects our estimates depends on the way we violate $E(\epsilon_i|X) = 0$. This post looks at different cases and elaborates on the consequences of the violation. We start with a less severe case and then continue discussing a far more sensitive violation of assumption 4.1.

1st case of violating OLS assumption 4.1

The first case we consider is, when the conditional expected value of the error term $\epsilon_i$ on $X$ not zero, but a non-zero constant $c$, i.e. $E(\epsilon_i|X) = c$. One can show that violating $E(\epsilon_i|X) = 0$ in this case leads ”only” to wrongly estimated intercept, while the other beta coefficients are not affected. We demonstrate this with a simple example. Assume that $\epsilon_i = c + \mu_i$, where $c$ is a constant and $\mu_i$ fulfills the usual error term characteristics, i.e. $\mu_{i} \sim iid(0,\sigma^{2})$. In this case we can rewrite the model $y_{i} = \beta_0 + \beta_{1}x_{i}+\epsilon_{i}$ as

$y_{i} = (\beta_0+ c) + \beta_{1}x_{i}+\epsilon_{i}$

Note, because $c$ is constant it is added to the intercept. This follows from the mathematical rule that

$E(\epsilon_i|X) = c E(\epsilon_i|X) = c$

In this case the OLS estimation of $\beta_{1}$ is not affected by the violation of the assumption. However, we are not able to estimate $\beta_{0}$ or $\mu$ separately. The coefficient of the intercept rather provides the effect of the sum of both variables $\beta_{0} + \mu$. In this case the data does not contain enough information to fully identify the effect of $\beta_{0}$.

2nd case of violating OLS assumption 4.1

Things turn messy if the conditional expected value of the error term is not a constant but a function of the explanatory variables, i.e. $X$.

We cannot solve the problem by including the intercept in this case. If we include an intercept it will only catch the magnitude that depends on the specific sample and realizations of $X$. We can look at it that way, in case the conditional expected value of the error term is not a constant but rather a function of $X$, our estimated coefficient of the intercept is not the coefficient of a constant anymore. It rather depends on $X$ through the non-constant conditional mean of the error term, i.e. $E(\epsilon_i|X) = f(X)$, where $f()$ is a non-constant function.

You will show that the OLS estimator will not be efficient anymore, if the conditional expected value of the error term is a function of the explanatory variables. In the following, you will find a mathematical demonstration of this statement. Assume that $E(\epsilon_i|X) = f(X)$, i.e. the conditional expected value of the error term is a function of $X$. Further, we will now derive the OLS estimator (you can find a more detailed derivation of the OLS estimator here) under this assumption. We start one step ahead of the usual notation of the basic OLS model by separating the intercept from the rest of the explanatory variables:

$y = inter \alpha + X \beta + \epsilon$

where $inter$ is the intercept. Mathematically $inter$ is just a vector of ones. We now join $inter$ and $X$ into matrix $Z$, where $Z$ is nothing else than $X$ with the first column full of ones. Further, we join $\alpha$ with $\beta$ into $\gamma$. This leaves us with

$y = Z \gamma + \epsilon$

We derive the OLS estimator for the model above which leaves us with

$\hat{\gamma} = \gamma + (Z'Z)^{-1} Z'\epsilon$

where $(Z'Z)^{-1}Z'\epsilon$ is the bias, which usually would disappear if $E(\epsilon_i|X) = 0$ or $E(\epsilon_i|X) = c$. However, in our case we have

$E(\epsilon_i|X) = f(X)$

which is per assumption already non-zero for all $i$.

Consequently, if the conditional expected value of the error term is a function of $X$ the OLS estimator will be biased, even if we include an intercept in the regression. Further, this means also that we lose the Gauss-Markov result on efficiency. Finally, in the case that the conditional expected value of the error term is a function of $X$ the following holds true

$E(\epsilon_i|X) = f(X) \neq E(\epsilon_j|X) = f(X)$

which means that $\epsilon$ has a different mean and variance for each $i$. In order words, the distribution of $\epsilon$ is conditional on $X$ and varies consequently across $i$.

This has several implications on our estimator. First, we face heteroscedasticity. Second, we will have biased estimates of the coefficients and there is no way to say in which direction the bias goes. Note, not even the assumption of normal distributed error terms will solve this problem. Hence, hypothesis testing in no longer valid, if the conditional expected value of the error term is a function of $X$. Put mildly different, but with the same meaning, we lose all finite sample properties.

Finally, you will better be careful with interpreting your findings, if the conditional expected value of the error term is a function of the explanatory variables $X$. Nevertheless, if that is all you have you might want to draw asymptotically valid inferences. In order to do so, you have to fulfill different additional assumptions.