Analytics Knowledge Sharing Forum: My Understanding about Linear Regression

Model Validation Statistics

Once a linear regression model is built, it is necessary to validate the performance of the model. There are various validation techniques. Some of the widely used techniques are listed below:

R-Square

Any model is only as good as it is able to predict the actual outcome with accuracy. R-Square is a measure of how well the model is able to predict the changes in the actual data. R-Square ranges between 0 and 1, generally with values over 0.7 indicating a good fit between the predictions and actual data.

Mean Absolute Percent Error (MAPE)

MAPE is a measure of how high or low are the differences between the predictions and actual data. For e.g. 10% MAPE means on average the predictions from a model will be 10% higher or lower than actual.

Mape is defined by the formula:

defined by the formula:

1/n (Sigma(a-f)/a*100)

Where a is the Actual value and f is the predicted value.

Rank Ordering

This is an initial validation process, used once the predicted values are created. The predicted values are sorted in descending order and are grouped into deciles. Then the number of actual numbers is computed for each decile. Generally, the mean actual values are used for Rank ordering. The model is said to have rank ordering if these numbers follow a monotonically decreasing order, i.e., the average in1st decile should be strictly higher than the average number in 2nd decile, and so on.

The equation created in training data, the same equation should be used in validation data for creating deciles.

Sign Check

This is more from a business point of view. The signs (+/-) of the parameters corresponding to each independent variable are checked to see if it makes business sense. For example, is a variable is supposed to have a positive relation with the predicted value then its sign should be positive.

Once we finalize the model in training data, we used to fit the same model in the validation data and check the sign of the coefficient of the independent variables.

Lift Curve

A lift curve is a graphical representation of the % of cumulative dependent variable (for example if the dependent variable is revenue, then cumulative percentage of revenue in each decile) captured at a specific cut-off. The cut-off can be a particular decile or a percentile. Similar, to rank ordering procedure, the data is in descending order of the predicted value and is then grouped into deciles/percentiles. The cumulative sum of dependent variable is then computed for each decile/percentile. Taking the example from rank ordering, a lift curve for the same would be as follows:

Tests for Normality of Residuals

One of the assumptions of linear regression analysis is that the residuals are normally distributed. This assumption assures that the p-values for the t-tests will be valid. As before, we will generate the residuals (called r) and predicted values (called fv) and put them in a dataset (called elem1res). We will also keep the other independent variables in that dataset.

proc reg data=best_model;
model ln_gross_rev =
Tot_unit_AnyProd
Flag_PC
Flag_other
hh_size
online_ordr_amt_avg
age;
output out=elem1res (keep= ln_gross_rev Tot_unit_AnyProd
Flag_PC
Flag_other
hh_size
online_ordr_amt_avg
age r fv) residual=r predicted=fv;
run;
quit;

proc kde data=elem1res out=den;
var r ;
run;

proc sort data=den;
by r;
run;

goptions reset=all;
symbol1 c=blue i=join v=none height=1;
proc gplot data=den;
plot density*r=1;
run;
quit;

qqplot of residual

Proc univariate will produce a normal quantile graph. qqplot plots the quantiles of a variable against the quantiles of a normal distribution. qqplotis most sensitive to non-normality near two tails.

goptions reset=all;
proc univariate data=elem1res normal;
var r;
qqplot r / normal(mu=est sigma=est);
run;

Some cosmetic treatment of a model – I learnt from my experiences

Sometimes I have seen the model is not rank ordering, there are several methods to check/correct the rank ordering problem:

a. For this we may exclude one or more variables as a time, and build the linear regression model to see when we get a rank ordering in the modeling and validation samples.

b. One can plot each of the independent variables against the decile created for rank ordering. Each of the independent variable should follow a trend.

i. One can verify the sign of the coefficient of each of the independent variable

ii. In case the independent variable is not following a trend, one can use different transformation like:

1. Linear transformation(for example , for independent continuous variable , one can create categorical variable like if recency is between 4.33 years to 3.1 years then recency_1 = 3, and if recency is <3.1 but >=2.5 then recency_1 = 2, else recency_1 = 1) .

2. Quadratic transformations are also useful sometimes

3. One can use parabolic and hyperbolic transformations too but those are difficult to explain.

Autocorrelation

Another way in which the assumption of independence can be broken is when data are collected on the same variables over time. Let's say that we collect medicine usage data every quarter for 12 years. In this situation it is likely that the errors for observation between adjacent quarters will be more highly correlated than for observations more separated in time. This is known as autocorrelation. When you have data that can be considered to be time-series, you should use the dw option that performs a Durbin-Watson test for correlated residuals.

Durbin-Watson Statistic:

One peculiar feature of data recorded over time, like monthly sales, is that it tends to be correlated over time. For e.g. high sales months may be tend to be followed by high sales months and low sales months by more low sales months. This may be caused either by seasonal/cyclical trends or seasonal promotion, marketing or competitive effects. Whatever the factor causing this correlation, correlated errors violate one of the fundamental assumptions needed for least squares regression- independence of errors or in other words random errors. Durbin-Watson Statistic is a measure used to detect such correlations. Every model has one measure for Durbin-Watson statistic. Durbin-Watson Statistic, ranges in value from 0 to 4 with an ideal value of 2 indicating that errors are not correlated (although values from 1.75 to 2.25 may be considered acceptable). A value significantly below 2 indicates a positive correlation and a value significantly greater than 2 suggests negative correlation. In either case the model specification needs to be reviewed to identify variables potentially omitted or redundant variables.

References:

http://www.ats.ucla.edu/stat/sas/library/SASReg_mf.htm

http://www.ats.ucla.edu/stat/sas/webbooks/reg/chapter2/sasreg2.htm

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#/documentation/cdl/en/statug/63033/HTML/default/statug_reg_sect007.htm

http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

http://www.sfu.ca/sasdoc/sashtml/stat/chap55/sect38.htm

http://www.stat.psu.edu/online/development/stat501/12multicollinearity/05multico_vif.html

Analytics Knowledge Sharing Forum

Saturday, October 16, 2010

My Understanding about Linear Regression - Part V

No comments:

Post a Comment