# Graphically Illustrate Multicollinearity: Venn Diagram

Multicollinearity is a common problem in econometrics. As explained in a previous post, multicollinearity arises when we have too few observations to precisely estimate the effects of two or more highly correlated variables on the dependent variable. This post tries to graphically illustrate the problem of multicollinearity using venn-diagrams. The venn-diagrams below all represent the following regression model

$y =x_1 + x_2 + \epsilon$

Thereby, each circle depicts the variance of one variable of the regression model. That is, the circle $y$ depicts the variance of the dependent variable $y$, the circle $x_1$ depicts the variance of variable $x_1$ and the circle $x_2$ shows the variance of the variable $x_2$. The overlapping areas show variation that variables have in common. For instance, the overlapping area of variable $y$ and variable $x_1$ represents the variation of variable $y$ that can be explained by variable $x_1$.

In the first figure, the circles $x_1$ and $x_2$ do both intersect with the circle $y$. However, there is no overlap between the circle $x_1$ and the circle $x_2$. In this case, variable $x_1$ and variable $x_2$ are both correlated with variable $y$, but the two explanatory variables themselves are uncorrelated. Thus, one can precisely identify the effect of each explanatory variable ($x_1$ and $x_2$) on the independent variable ($y$).

Figure 2 shows a case in which there exists some correlation between the two explanatory variables. Note that, in Figure 2 there exists some overlap between the circle $x_1$ and the circle $x_2$ meaning that the two variables have some variation in common. You see that it becomes less clear to determine what the effect of one explanatory variable on the dependent variable actually is, i.e. there is some area overlapping all three variables. Although there exists some correlation between variable $x_1$ and $x_2$, there is still enough variation left to determine the effect of $x_1$ and $x_2$ rather precisely.

Moderate multicollinearity is not much of a concern. However, if the correlation between two or more explanatory variables is very strong is get continuously harder to precisely estimate the pure effect of one explanatory variable on the dependent variable. Figure 3 depicts a case in which the variables $x_1$ and $x_2$ are strongly correlated. There is increasingly less variation left that can be associated to only one explanatory variable and $y$. In this case we need more data to precisely estimate the effect of one explanatory variable on the dependent variable. Generally, multicollinearity lets our estimates become less accurate.

Finally, as already stated in this post, multicollinearity does not cause problems from a mathematical point of view as long as we do not have perfect multicollinearity. In the representation of a venn-diagram, perfect multicollinearity between variable $x_1$ and $x_2$ would mean that the circle of variable $x_1$ and the circle of variable $x_2$ are identical, i.e. there exists a perfectly overlap between the two circles. Hence, one variable is a linear combination of the other one. There is no variation left to be estimated and the estimator breaks down as we violate the second assumption (full rank assumptions) of the Gauss-Markov assumptions.