Sunday, October 10, 2010

My Understanding about Linear Regression - Part IV

Multicollinearity


Multi-collinearity is a condition when independent variables included in a model are correlated with each other. The real damage caused by multi-collinearity is that it causes large standard errors in estimating the coefficients. In simpler terms it causes the estimated t-statistics for correlated or multi-collinear variables to be insignificant, thus resulting in significant variables to appear to be insignificant. Multi-collinearity can be identified by the Variance Inflation factor (VIF), which is a statistic calculated for each variable in a model. A VIF greater than 2 may suggest that the concerned variable is multi-collinear with others in the model and may need to be dropped. The Variance Inflation Factor (VIF) is 1/Tolerance, it is always greater than or equal to 1.


A variance inflation factor is attached to each variable in the model and it measures the severity of multicollinearity for that variable. Statistically, VIF for the ith variable is defined as



Where R2i is the R2 value obtained by regressing the ith predictor on the remaining predictors. Note that a variance inflation factor exists for each of the i predictors in a multiple regression model.



Xk = β0 + β1x1 + β2x2 + … + βkxk , k ≠ i



To determine which variables are collinear to each other, we need to look at the collinearity diagnostics. For this, we have to do factor analysis, where factor loadings corresponding to each variable are computed for each factor. If two variables are collinear then the factor loadings for them in a particular factor will be higher as compared to other variables. We select the most important factor to begin with. Once two variables with high collinearity between each other are located, we have to remove one of them from the model. The condition index is calculated using a factor analysis on the independent variables. Values of 10-30 indicate a mediocre multicollinearity in the linear regression variables; values > 30 indicate strong multicollinearity.



Once a linear regression model is created, various validation techniques are used to quantify the effectiveness of the model. A good model should pass these tests as well as show similar patterns in the modeling and validation data.



No comments:

Post a Comment