Thursday, August 12, 2010

My Understanding about Linear Regression - Part III

Outlier Treatment


A single observation that is substantially different from all other observations can make a large difference in the results of the regression analysis. If a single observation (or small group of observations) substantially changes the results, one would want to know about this and investigate further.

In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.

High values are known as upper outliers and low ones are known as lower outliers. Such values should be modified or else they would bias the estimation of the model parameters. The most simple and commonly used outlier treatment technique is by capping the values which are above 99% or below 1% of the population. This means that if a value is above the 99th percentile then it is replaced by the value corresponding to the 99th percentile. Similar capping is done for values below 1st percentile.

Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. These leverage points can have an effect on the estimate of regression coefficients.

Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.

Detecting Unusual and Influential Data

o Scatterplots of the dependent variable versus the independent variables

We need to examine the relationship between the dependent variable, Y1, and a continuous predictor, X1. We first look at a scatterplot, with a regression line included to see the relationship between Y and X and decide if it appears to be linear (degree = 1 is used for the regression line). We also look for any outliers. Here is the SAS code:

title "Scatter Plot with Regression Line";

proc sgplot data=a;
reg y=y1 x=x1 / degree=1;
run;

o looking at the largest values of the studentized residuals, leverage, Cook's D, DFFITS and DFBETAs . Here is the code:

proc reg data = best_model ;
model ln_gross_rev =
Tot_unit_AnyProd
Flag_PC
Flag_other
hh_size
online_ordr_amt_avg
age
/ vif collin ;

output out = predicted_output p=Predicted r=Error cookd = CooksD dffits=dffit h=lev;
run;
ods graphics off ;

data a;
set predicted_output ;
cov = finv(0.05,7,518433);
if CooksD > cov then flag_cook = 1;
else flag_cook = 0;
run;
proc freq data = a; tables flag_cook; run;

Ideally, flag_cook = 0 should be 100%. Flag_cook = 1 indicates presence of outliers. Incase of unusual data points, we need to do the proper treatments or remove those data points from the model data (if the percentage of unusual data points are not very high).

Cook's d follows a F dist with (K,n-K) df, where n= number of obs, K=parameters in model(includes intercept)

The lowest value that Cook's D can assume is zero, and the higher the Cook's D is, the more influential the point is. The conventional cut-off point is 4/n. We can either follow the above mentioned method or list any observation above the cut-off point by doing the following. Here is the code for the same:

data xxx(keep = cust_id Tot_unit_AnyProd Flag_PC Flag_other hh_size
online_ordr_amt_avg age_n CooksD); set predicted_output;
where CooksD > (4/518433);
run;

(Note: In this example the model dependent variable was revenue per customer, so the data is at customer level i.e. all the variables are summarized at customer level and one customer is having one row. That is the reason I have kept cust_id variable in the above output. It will help to understand customers are having most influential data point)

Plot the DFFITS statistic by observation number. Observations whose DFFITS’ statistic is greater in magnitude than , where is the number of observations used and is the number of regressors, are deemed to be influential.

DFITS can be either positive or negative, with numbers close to zero corresponding to the points with small or zero influence.

data yyy(keep = cust_id Tot_unit_AnyProd
Flag_PC
Flag_other
hh_size
online_ordr_amt_avg
age dffit) ; set predicted_output;
where abs(dffit)> (2*sqrt(6/518433));
run;

The DFFITS statistic is a scaled measure of the change in the predicted value for the ith observation and is calculated by deleting the ith observation. A large value indicates that the observation is very influential in its neighborhood of the X space.


The above measures are general measures of influence. One can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. Apparently this is more computationally intensive than summary statistics such as Cook's D because the more predictors a model has, the more computation it may involve. We can restrict our attention to only those predictors that we are most concerned with and to see how well behaved those predictors are. In SAS, we need to use the ods output OutStatistics statement to produce the DFBETAs for each of the predictors. The names for the new variables created are chosen by SAS automatically and begin with DFB_.

proc reg data = best_model ;
model ln_gross_rev= Tot_unit_AnyProd
Flag_PC Flag_other hh_size online_ordr_amt_avg age_n/influence;
ods Output OutputStatistics= revbetas;
id Cust_cid;
run;

(note: for bigger datasets the html output will be huge, once the code is done no need to save the ods output as all the betas will be stored in the output dataset)

This created six variables, DFB_Tot_unit_AnyProd, DFB_Flag_PC , DFB_Flag_other, DFB_hh_size, DFB_online_ordr_amt_avg and DFB_age

The DFBETAS statistics are the scaled measures of the change in each parameter estimate and are calculated by deleting the ith observation:



where (X'X)jj is the (j,j)th element of (X'X)-1.

In general, large values of DFBETAS indicate observations that are influential in estimating a given parameter. Belsley, Kuh, and Welsch recommend 2 as a general cutoff value to indicate influential observations and (2/sqrt n) as a size-adjusted cutoff.

If a point lies far from the other data in the horizontal direction, it is known as an influential observation. The reason for this distinction is that these points have may have a significant impact on the slope of the regression line. We need to remove influential data points or need to do the proper treatments of influential points.

Now let's look at the leverage's to identify observations that will have potential great influence on regression coefficient estimates. Here is the sample SAS code:

proc univariate data= predicted_output plots plotsize=30;
var lev;
run;

Generally, a point with leverage greater than (2k+2)/n should be carefully examined, where k is the number of predictors and n is the number of observations.

The following table summarizes the general rules of thumb we use for these measures to identify observations worthy of further investigation (where k is the number of predictors and n is the number of observations).

Measure Value
leverage >(2k+2)/n
abs(rstu) > 2
Cook's D > 4/n
abs(DFITS) > 2*sqrt(k/n)
abs(DFBETA) > 2/sqrt(n)

For more details, please refer

http://www.ats.ucla.edu/stat/sas/webbooks/reg/chapter2/sasreg2.htm

Model building


Once the data has been cleaned by missing imputation and outlier treatment, the initial model is built on it. For the initial/ first cut model, all the independent variables are put into the model. The objective is to finally have a limited number of independent variables (5-10) which are significant in all respect, without sacrificing too much on the model performance. The reason behind not-including having too many variables is that the model would be over fitted and would become unstable when tested on the validation sample. The variable reduction is done using forward or backward or stepwise variable selection procedures. These procedures are described below:

Forward Selection - In a forward selection analysis we start out with no predictors in the model. Each of the available predictors is evaluated with respect to how much R2 would be increased by adding it to the model. The one which will most increase R2 will be added if it meets the statistical criterion for entry. With SAS the statistical criterion is the significance level for the increase in the R2 produced by addition of the predictor. If no predictor meets that criterion, the analysis stops. If a predictor is added, then the second step involves re-evaluating all of the available predictors which have not yet been entered into the model. If any satisfy the criterion for entry, the one which most increases R2 is added. This procedure is repeated until there remain no more predictors that are eligible for entry.

Backwards Elimination - In a backwards elimination analysis we start out with all of the predictors in the model. At each step we evaluate the predictors which are in the model and eliminate any that meet the criterion for removal.

Stepwise Selection - With fully stepwise selection we start out just as in forwards selection, but at each step variables that are already in the model are first evaluated for removal, and if any are eligible for removal, the one whose removal would least lower R2 is removed. You might wonder why a variable would enter at one point and leave later -- well, a variable might enter early, being well correlated with the criterion variable, but later become redundant with predictors that follow it into the model.

An entry significance level of 0.15, specified in the slentry=0.15 option, means a variable must have a p-value < 0.15 in order to enter the model during forward selection and stepwise regression. An exit significance level of 0.05, specified in the slstay=0.05 option, means a variable must have a p-value > 0.05 in order to leave the model during backward selection and stepwise regression. One can change the entry and exit criteria based on the situation and requirement.

The following SAS code performs the forward selection method by specifying the option selection=forward.

proc reg data=a outest=est1;
model y=x1 x2 x3 x4 x5 x6 x7 x8 x9 x10…..Xn / slstay=0.15 slentry=0.15
selection=forward ss2 sse aic;
output out=out1 p=p r=r; run;

The following SAS code performs the backward elimination method by specifying the option selection=backward.

proc reg data=a outest=est2;
model y=x1 x2 x3 x4 x5 x6 x7 x8 x9 x10….Xn / slstay=0.05 slentry=0.15
selection=backward ss2 sse aic;
output out=out1 p=p r=r; run;

The following SAS code performs stepwise regression by specifying the option selection=stepwise.

proc reg data=a outest=est3;
model y=x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 …Xn/ slstay=0.05 slentry=0.15
selection=stepwise ss2 sse aic;
output out=out3 p=p r=r; run;

Once the first cut model is ready, one might have to couple of manual iteration based on business logic. Once the first cut model is ready, one needs to check the multicollinearity. One can use the following code for the same:

proc reg data=a outest=parameter;
model y=x1 x2 x3 x4 x5 x6 x7 x8 x9 x10…..xn /
/ vif collin;
output out=out3 p=p r=r;
run;

3 comments:

  1. Cooks D Does not follow any known distribution. Its a popular notion it follows F distribution but its wrong. Check out Wiki...nowhere is the distribution mentioned.

    ReplyDelete
  2. nice & crisp...keep it up Anjanita!

    ReplyDelete
  3. @Argha: That is a new learning for me but it would be great if you can check the following links:

    http://people.virginia.edu/~der/pdf/der61.pdf

    http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_cook_distance.htm

    http://webscripts.softpedia.com/script/Scientific-Engineering-Ruby/Statistics-and-Probability/Cookdist-35791.html

    ReplyDelete