Multicollinearity or collinearity refers to a situation where two or more variables of a regression model are highly correlated. Because of the high correlation, it is difficult to disentangle the pure effect of one single explanatory variables on the dependent variable . From a mathematical point of view, multicollinearity only becomes an issue when we face perfect multicollinearity. That is, when we have identical variables in our regression model.
However, when talking about multicollinearity we rarely refer to the case of perfect multicollinearity. much more often we do refer to the case where two or more variables are highly correlated. Hence, multicolliearity does not violate any gauss markov assumptions and the OLS-estimator is still BLUE. That is, under multicolliearity the OLS-estimator is still unbiased and has the lowest variance among all other estimators. Moreover, not only the coefficients are estimated in an efficient way, also the estimates for the t-values are unbiased. Thus, all confidence intervals and test-statistics remain valid.
Although the OLS-estimator provides efficient estimates for the coefficients and standard errors under multicollinearity, these estimates are not very good. Meaning that even though it is possible to estimate all coefficients and standard errors, the obtained estimates are very imprecise. The problem of multicollinearity can best be thought of a as small sample size problem. That is, if we have only few observations, we cannot precisely estimate certain relationships. A similar problem occurs when we have multicollinearity in our data. If we have variables that are strongly correlated with one another, they often to not contain enough information to allow precise estimates. That is, it is difficult to disentangle the true effect of two highly collinear variables on the dependent variable.
Perfect Multicollinearity and the Dummy Variable Trap
The Gauss Markov assumptions require the matrix of the OLS estimator to have full rank. In the case of perfect multicollinearity, we violate assumption 2 of the gauss markov assumptions as at least one variable can be represented as a liner combination of one or more variables. In this case, the matrix has not full rank. Hence, under perfect multicollinearity the matrix is singular and cannot be inverted and the OLS-estimator is not defined.
The most common case of perfect multicollinearity occurs when we specify binary variables. Assume for instance the following: We are interested in demography and would like to know if men live longer than women. In order to answer this question, we gather all necessary data , i.e. age of death, sex, occupation, marital status, and several other variables, and specify a binary variable that takes a value of one in case an observation refers to a man and zero otherwise. Additionally, we specify a second binary variable that takes a value of one in case of a woman and zero otherwise. We then try to explain our variable of interest , i.e. age of death, using the two binary variables and additional controls. In this case, our regression model would look something like this:
Written in matrix form, the matrix of our OLS-estimator would be then be defined as
Note that, the first three columns of the matrix above are linearly dependent. That is, the first column of the matrix, the regression constant, can be expressed as the sum of column 2 () and column 3 (). In this case, matrix does not have full rank and we cannot compute estimates for the coefficients . This problem is also referred to as the dummy variable trap. The way to solve this problem is to simply neglect one variable. In reality, perfect multicollinearity is rarely an issue and can easily be detected as the estimator cannot be computed. Statistical software packages automatically detect perfect multicollinearity and issue a warning or simply drop one variable.