Analytics Knowledge Sharing Forum: October 2010

Model Validation Statistics

Once a linear regression model is built, it is necessary to validate the performance of the model. There are various validation techniques. Some of the widely used techniques are listed below:

R-Square

Any model is only as good as it is able to predict the actual outcome with accuracy. R-Square is a measure of how well the model is able to predict the changes in the actual data. R-Square ranges between 0 and 1, generally with values over 0.7 indicating a good fit between the predictions and actual data.

Mean Absolute Percent Error (MAPE)

MAPE is a measure of how high or low are the differences between the predictions and actual data. For e.g. 10% MAPE means on average the predictions from a model will be 10% higher or lower than actual.

Mape is defined by the formula:

defined by the formula:

1/n (Sigma(a-f)/a*100)

Where a is the Actual value and f is the predicted value.

Rank Ordering

This is an initial validation process, used once the predicted values are created. The predicted values are sorted in descending order and are grouped into deciles. Then the number of actual numbers is computed for each decile. Generally, the mean actual values are used for Rank ordering. The model is said to have rank ordering if these numbers follow a monotonically decreasing order, i.e., the average in1st decile should be strictly higher than the average number in 2nd decile, and so on.

The equation created in training data, the same equation should be used in validation data for creating deciles.

Sign Check

This is more from a business point of view. The signs (+/-) of the parameters corresponding to each independent variable are checked to see if it makes business sense. For example, is a variable is supposed to have a positive relation with the predicted value then its sign should be positive.

Once we finalize the model in training data, we used to fit the same model in the validation data and check the sign of the coefficient of the independent variables.

Lift Curve

A lift curve is a graphical representation of the % of cumulative dependent variable (for example if the dependent variable is revenue, then cumulative percentage of revenue in each decile) captured at a specific cut-off. The cut-off can be a particular decile or a percentile. Similar, to rank ordering procedure, the data is in descending order of the predicted value and is then grouped into deciles/percentiles. The cumulative sum of dependent variable is then computed for each decile/percentile. Taking the example from rank ordering, a lift curve for the same would be as follows:

Tests for Normality of Residuals

One of the assumptions of linear regression analysis is that the residuals are normally distributed. This assumption assures that the p-values for the t-tests will be valid. As before, we will generate the residuals (called r) and predicted values (called fv) and put them in a dataset (called elem1res). We will also keep the other independent variables in that dataset.

proc reg data=best_model;
model ln_gross_rev =
Tot_unit_AnyProd
Flag_PC
Flag_other
hh_size
online_ordr_amt_avg
age;
output out=elem1res (keep= ln_gross_rev Tot_unit_AnyProd
Flag_PC
Flag_other
hh_size
online_ordr_amt_avg
age r fv) residual=r predicted=fv;
run;
quit;

proc kde data=elem1res out=den;
var r ;
run;

proc sort data=den;
by r;
run;

goptions reset=all;
symbol1 c=blue i=join v=none height=1;
proc gplot data=den;
plot density*r=1;
run;
quit;

qqplot of residual

Proc univariate will produce a normal quantile graph. qqplot plots the quantiles of a variable against the quantiles of a normal distribution. qqplotis most sensitive to non-normality near two tails.

goptions reset=all;
proc univariate data=elem1res normal;
var r;
qqplot r / normal(mu=est sigma=est);
run;

Some cosmetic treatment of a model – I learnt from my experiences

Sometimes I have seen the model is not rank ordering, there are several methods to check/correct the rank ordering problem:

a. For this we may exclude one or more variables as a time, and build the linear regression model to see when we get a rank ordering in the modeling and validation samples.

b. One can plot each of the independent variables against the decile created for rank ordering. Each of the independent variable should follow a trend.

i. One can verify the sign of the coefficient of each of the independent variable

ii. In case the independent variable is not following a trend, one can use different transformation like:

1. Linear transformation(for example , for independent continuous variable , one can create categorical variable like if recency is between 4.33 years to 3.1 years then recency_1 = 3, and if recency is <3.1 but >=2.5 then recency_1 = 2, else recency_1 = 1) .

2. Quadratic transformations are also useful sometimes

3. One can use parabolic and hyperbolic transformations too but those are difficult to explain.

Autocorrelation

Another way in which the assumption of independence can be broken is when data are collected on the same variables over time. Let's say that we collect medicine usage data every quarter for 12 years. In this situation it is likely that the errors for observation between adjacent quarters will be more highly correlated than for observations more separated in time. This is known as autocorrelation. When you have data that can be considered to be time-series, you should use the dw option that performs a Durbin-Watson test for correlated residuals.

Durbin-Watson Statistic:

One peculiar feature of data recorded over time, like monthly sales, is that it tends to be correlated over time. For e.g. high sales months may be tend to be followed by high sales months and low sales months by more low sales months. This may be caused either by seasonal/cyclical trends or seasonal promotion, marketing or competitive effects. Whatever the factor causing this correlation, correlated errors violate one of the fundamental assumptions needed for least squares regression- independence of errors or in other words random errors. Durbin-Watson Statistic is a measure used to detect such correlations. Every model has one measure for Durbin-Watson statistic. Durbin-Watson Statistic, ranges in value from 0 to 4 with an ideal value of 2 indicating that errors are not correlated (although values from 1.75 to 2.25 may be considered acceptable). A value significantly below 2 indicates a positive correlation and a value significantly greater than 2 suggests negative correlation. In either case the model specification needs to be reviewed to identify variables potentially omitted or redundant variables.

References:

http://www.ats.ucla.edu/stat/sas/library/SASReg_mf.htm

http://www.ats.ucla.edu/stat/sas/webbooks/reg/chapter2/sasreg2.htm

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#/documentation/cdl/en/statug/63033/HTML/default/statug_reg_sect007.htm

http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

http://www.sfu.ca/sasdoc/sashtml/stat/chap55/sect38.htm

http://www.stat.psu.edu/online/development/stat501/12multicollinearity/05multico_vif.html

Multicollinearity

Multi-collinearity is a condition when independent variables included in a model are correlated with each other. The real damage caused by multi-collinearity is that it causes large standard errors in estimating the coefficients. In simpler terms it causes the estimated t-statistics for correlated or multi-collinear variables to be insignificant, thus resulting in significant variables to appear to be insignificant. Multi-collinearity can be identified by the Variance Inflation factor (VIF), which is a statistic calculated for each variable in a model. A VIF greater than 2 may suggest that the concerned variable is multi-collinear with others in the model and may need to be dropped. The Variance Inflation Factor (VIF) is 1/Tolerance, it is always greater than or equal to 1.

A variance inflation factor is attached to each variable in the model and it measures the severity of multicollinearity for that variable. Statistically, VIF for the ith variable is defined as

Where R2i is the R2 value obtained by regressing the ith predictor on the remaining predictors. Note that a variance inflation factor exists for each of the i predictors in a multiple regression model.

Xk = β0 + β1x1 + β2x2 + … + βkxk , k ≠ i

To determine which variables are collinear to each other, we need to look at the collinearity diagnostics. For this, we have to do factor analysis, where factor loadings corresponding to each variable are computed for each factor. If two variables are collinear then the factor loadings for them in a particular factor will be higher as compared to other variables. We select the most important factor to begin with. Once two variables with high collinearity between each other are located, we have to remove one of them from the model. The condition index is calculated using a factor analysis on the independent variables. Values of 10-30 indicate a mediocre multicollinearity in the linear regression variables; values > 30 indicate strong multicollinearity.

Once a linear regression model is created, various validation techniques are used to quantify the effectiveness of the model. A good model should pass these tests as well as show similar patterns in the modeling and validation data.

Analytics Knowledge Sharing Forum

Saturday, October 16, 2010

My Understanding about Linear Regression - Part V

Sunday, October 10, 2010

Why use multiple linear regression?

My Understanding about Linear Regression - Part IV