Analytics Knowledge Sharing Forum: 2010

Wednesday, December 29, 2010

A Powerful Classification Technique in Data Mining - Discriminant Analysis(part – V)

Data preparation for the Analysis

Preparing data for Discriminant Analysis is an important step.

Missing Imputation and normality of dependent variable

All the independent variables and the dependent variable that go into the model should not have any missing values. If the dependent variable has a missing value for an observation, which is rarely the case, then it should be discarded.

For the independent variables, the missing values are replaced / imputed. Some of the commonly used imputation techniques include

Replacing by median,
Replaced by mode,
Replaced by 0
Replaced by other logical values.

Median is preferred to mean because it is not impacted by extreme values. Mode is used when the variable is discrete or categorical, where the mode is the most frequently occurring value. 0 is mostly used for indicator/dummy/binary variables. Other logical values can also be used based on their business implications.

Outlier Treatment

A single observation that is substantially different from all other observations can make a large difference in the results of the Discriminant analysis. If a single observation (or small group of observations) substantially changes the results, one would want to know about this and investigate further.

High values are known as upper outliers and low ones are known as lower outliers. Such values should be modified or else they would bias the estimation of the model parameters. The most simple and commonly used outlier treatment technique is by capping the values which are above 99% or below 1% of the population. This means that if a value is above the 99^th percentile then it is replaced by the value corresponding to the 99^th percentile. Similar capping is done for values below 1^st percentile.

Use the entire data for identifying attributes pertaining to group differences and comparison.

Training & Validation data

Once the data preparation is done, the entire population is split into modeling and validation population. This split is done in a random way, so that the distribution of dependent/class variable for both these samples is roughly the same. It is assumed that the characteristics of the independent variables would be similar in the two samples, as it is a random split. The modeling population is used to build the model and then it is implemented on the validation population. The performance of a model should be similar in both modeling and validation samples.

Split the Sample data into two data sets viz. Training sample and validation/Holdout sample
Split ratio can be 50-50 or 60-40 depending on the sample size.
Ensure that equal proportion of the grouping variable (categorical variable) has been put under Training and Holdout samples.

Training sample: The data set used to compute the Discriminant function.

Validation/Holdout sample: The data set used to validate the accuracy of classification (prediction) based on the function computed using the training data set.

Selection of variables for analysis(reduction of variables)

Dependent variable: The grouping variable could be >= 2 group
Independent variable: Selected based on

a) Apriori Business understanding or logic.

b) Ensure there is no multicollinearity among the selected independent variables.

c) Stepwise Discriminant Analysis

a. Could be performed to understand the set independent variables (discriminators) having high discriminating power among the set of large independent variables.

b. The selection of the independent variables in the function is based on the individual F-value. Hence, check for the independent variable(s) that have been excluded from the Discriminant function as the data was not supporting but, the variable still makes logical sense.

c. SAS Code:

PROC STEPDISC data = <libname.data set-name> ;
CLASS < grouping variable>;
VAR ;
RUN;

For details about STEPDISC Procedure Options, please see the link below:

http://www.okstate.edu/sas/v8/saspdf/stat/chap60.pdf

http://support.sas.com/documentation/cdl/en/statugstepdisc/61834/PDF/default/statugstepdisc.pdf

Discriminant Analysis (Direct Method)

Once the first cut variable selection has been done using Stepdisc, one can use proc discrim for Discriminant function

PROC DISCRIM DATA= <libname.data set-name> POOL= YES CANONICAL CROSSVALIDATE ANOVA MANOVA LIST OUTSTAT = <filename>;
CLASS < grouping variable>;
VAR <Independent variables>;
PRIORS PROP;
RUN;

Some useful options and their meaning:

PROC DISCRIM = Procedure for Linear Discriminant Analysis
PROC STEPDISC = Procedure for Stepwise Discriminant Analysis
DATA = Data set
POOL = YES : To use the pooled covariance matrix in calculating the Discriminant function

ü POOL=NO | TEST | YES

Determines whether the pooled or within-group covariance matrix is the basis of the measure of the squared distance. If you specify POOL=YES, PROC DISCRIM uses the pooled covariance matrix in calculating the (generalized) squared distances. Linear discriminant functions are computed. If you specify POOL=NO, the procedure uses the individual within-group covariance matrices in calculating the distances. Quadratic discriminant functions are computed. The default is POOL=YES. When you specify METHOD=NORMAL, the option POOL=TEST requests Bartlett’s modification of the likelihood ratio test (Morrison 1976; Anderson 1984) of the homogeneity of the within-group covariance matrices. The test is unbiased (Perlman 1980). However, it is not robust to non-normality. If the test statistic is significant at the level specified by the SLPOOL= option, the within-group covariance matrices are used. Otherwise, the pooled covariance matrix is used. The discriminant function coefficients are displayed only when the pooled covariance matrix is used.

CANONICAL : Performs canonical Discriminant analysis
CROSSVALIDATE: Classifies each observation in the data set by computing a Discriminant function from the other observation in the data set, excluding the observation being classified (i.e. leave one-out classification)
ANOVA: Displays univariate statistics
MANOVA: Displays multivariate statistics
LIST: Displays the Hit ratio (correct classification) based on Discriminant function computed using the entire data set.
OUTSTAT: To save the calibration information (Discriminant function)
CLASS: Grouping variable
VAR: Set of independent variables
PRIORS PROP: specifies the prior probabilities of group membership based on probabilities proportional to the sample size.

Reference:

http://www.statsoft.com/textbook/discriminant-function-analysis/?button=1

http://www.okstate.edu/sas/v8/saspdf/stat/chap25.pdf

Thursday, December 2, 2010

A Powerful Classification Technique in Data Mining - Discriminant Analysis(part – IV)

Two Goals for Discriminant Analysis

 Interpretation: “How are the groups different?” Find and interpret linear combinations of variables that optimally predict group differences

 Classification: “How accurately can observations be classified into groups?” Using functions of variables to predict group membership for a data set and evaluate expected error rates

Steps involved in Discriminant Analysis Process

 Specify the dependent & the predictor variables

 Test the model’s assumptions a priori

 Determine the method for selection and criteria for entering the predictor variables into the model

 Estimate the parameters of the model

 Determine the goodness-of-fit of the model and examine the residuals

 Determine the significance of the predictors

 Test the assumptions

 Validate the results

Sunday, November 21, 2010

A Powerful Classification Technique in Data Mining - Discriminant Analysis(part – III)

Terminologies:

• F values : Is the ratio of the between sum of squares to the within sum of squares of variable.

• Wilks’ Lambda: Is the ratio of the within sum of squares to the total sum of squares for the entire set of variables in analysis. Wilks’ Lambda varies between 0 to 1. Also called U statistics.

• Classification matrix: Is a matrix that contains the number of correctly classified and misclassified cases.

• Hit Ratio: Percentage of cases correctly classified by the discriminant function.

The DISCRIM Procedure

• PROC DISCRIM can be used for many different types of analysis including

• canonical discriminant analysis

• assessing and confirming the usefulness of the functions (empirical validation and crossvalidation)

• predicting group membership on new data using the functions (scoring)

• linear and quadratic discriminant analysis

• nonparametric discriminant analysis

Discriminant Function

Linear discriminant analysis constructs one or more discriminant equations (linear combinations of the predictor variables Xk) such that the different groups differ as much as possible on Z.

Where,

Z = Discriminant score, a number used to predict group membership of a case

a = Discriminant constant

Wk = Discriminant weight or coefficient, a measure of the extent to which variable Xk discriminates among the groups of the DV

Xk = An Independent Variable or Predictor variable. Can be metric or non-metric.

Number of discriminant functions = min (number of groups – 1, k).

k = Number of predictor variables.

Discriminant Function : Interpretation

• The weights are chosen so that one will be able to compute a discriminant score for each subject and then do an ANOVA on Z.
• More precisely, the weights of the discriminant function are calculated in such a way, that the ratio (between groups SS)/(within groups SS) is as large as possible.
• The value of this ratio is the eigenvalue
• First discriminant function Z1 distinguishes first group from groups 2,3,..N.
• Second discriminant function Z2 distinguishes second group from groups 3, 4…,N. etc
Note : Discriminant analysis uses OLS to estimate the values of the parameters (a) and Wk that minimize the Within Group SS.

Partitioning Sums of Squares in Discriminant Analysis

In Linear Regression:

• Total sums of squares are partitioned into Regression sums of squares and Residual sums of squares.

• And Goal is to estimate parameters that minimize the Residual SS.

In Discriminant Analysis:

• The Total sums of squares is partitioned into Between Group sums of squares and Within Groups sums of squares

Where,
i = an individual case,

j = group j

Zi = individual discriminant score

Z = grand mean of the discriminant scores

Zj = mean discriminant score for group j

Here, Goal is to estimate parameters that minimize the Within Group Sums of Squares

A Powerful Classification Technique in Data Mining - Discriminant Analysis(part – II)

Discriminant Analysis attempts to find a rule that separates clusters to the maximum possible extent.

Discriminant Analysis - Assumptions

The underlying assumptions of Discriminant Analysis (DA) are:

– Each group is normally distributed, Discriminant Analysis is relatively robust to departures from normality.

– The groups defined by the dependent variable exist a priori.

– The Predictor variable, Xk are multivariate normally distributed, independent, and non-collinear

– The variance/covariance matrix of the predictor variable across the various groups are the same in the population, (i.e. Homogeneous)

– The relationship is linear in its parameters

– Absence of leverage point outliers

– The sample is large enough: Unequal sample sizes are acceptable. The sample size of the smallest group needs to exceed the number of predictor variables. As a “rule of thumb”, the smallest sample size should be at least 20 for a few (4 or 5) predictors. The maximum number of independent variables is n - 2, where n is the sample size. While this low sample size may work, it is not encouraged, and generally it is best to have 4 or 5 times as many observations and independent variables

– Errors are randomly distributed

Drawback of Discriminant Analysis

– An important drawback of discriminant analysis is its dependence on a relatively equal distribution of group membership. If one group within the population is substantially larger than the other group, as is often the case in real life, Discriminant analysis might classify all observations in only one group. An equal good-bad sample should be chosen for building the discriminant analysis model.

– Another significant restriction of discriminant analysis is that it can’t handle categorical independent variables.

– Discriminant analysis is more rigid than logistic regression in its assumptions. In contrast to ordinary linear regression, discriminant analysis does not have unique coefficients. Each of the coefficients depends on the other coefficients in the estimation and therefore there is no way of determining the absolute value of any coefficient.

Discriminant Analysis Vs Logistic Regression

– Similarity: Both techniques examine an entire set of interdependent relationships

Discriminant Analysis Vs ANOVA

– Similarity: Both techniques examine an entire set of interdependent relationships

– Difference: In Discriminant analysis, Independent variables are metric where as in ANOVA it is categorical.

Reference:

http://userwww.sfsu.edu/~efc/classes/biol710/discrim/discrim.pdf
www.shsu.edu/~icc_cmf/cj_742/stats7.doc

Saturday, November 6, 2010

A Powerful Classification Technique in Data Mining - Discriminant Analysis(part – I)

Classification is a data mining technique used to predict group membership for data instances. In predictive customer analytics, classification techniques are deployed frequently and are true across most applications including acquisition, cross-sell, attrition, credit scoring, collections and classifying first time buyer etc. The objective of any classification model is to classify customers in two or more groups based on a predicted outcome associated with each customer e.g. responder or non-responder, defaulter or non-defaulter, churner or non-churner, valuable customers or non valuable customers etc. Businesses are interested in predicting likelihood of each customer behaving in a particular fashion, and classification techniques provide them with predictive models for the same.

Various parametric and non-parametric methods are used to solve classification related problems. Traditional statistical methods are parametric in nature based on the assumptions about the nature of the distributions and estimate the parameters of the distributions to solve the problem. Non-parametric methods, on the other hand, make no assumptions about the specific distributions involved, and are therefore distribution-free.

Discriminant analysis is a technique for classifying a set of observations into two or more predefined classes. The purpose is to determine the class of an observation based on a set of variables known as predictors or input variables (analogous to independent variables in regression). The model is built based on a set of observations for which the classes are known. This set of observations is sometimes referred to as the training set. Based on the training set, the technique constructs a set of linear functions of the predictors, known as discriminant functions, such that

L = b1x1 + b2x2 + …… + bnxn + c , where the b's are discriminant coefficients, the x's are the input variables or predictors and c is a constant.

These discriminant functions are used to predict the class of a new observation with unknown class. For a k class problem k discriminant functions are constructed. Given a new observation, all the k discriminant functions are evaluated and the observation is assigned to class i if the ith discriminant function has the highest value.

Discriminant Analyis (DA), a multivariate statistical technique is commonly used to build a predictive / descriptive model of group discrimination based on observed predictor variables and to classify each observation into one of the groups. In DA multiple quantitative attributes are used to discriminate single classification variable. DA is different from the cluster analysis because prior knowledge of the classes, usually in the form of a sample from each class is required.

The common objectives of DA are

i. To investigate differences between groups

ii. To discriminate groups effectively;

iii. To identify important discriminating variables;

iv. To perform hypothesis testing on the differences between the expected groupings

v. To classify new customers into pre-existing groups.

Commonly used DA techniques available in the SAS systems are :

DISCRIM: Computes various discriminant functions for classifying observations. Linear or quadratic discriminant functions can be used for data with approximately multivariate normal within-class distributions. Nonparametric methods can be used without making any assumptions about these distributions.

CANDISC: Performs a canonical analysis to find linear combinations of the quantitative variables that best summarize the differences among the classes.

STEPDISC: It uses forward selection, backward elimination, or stepwise selection to try to find a subset of quantitative variables that best reveals differences among the classes.

Reference:
http://www2.sas.com/proceedings/sugi27/p247-27.pdf

Saturday, October 16, 2010

My Understanding about Linear Regression - Part V

Model Validation Statistics

Once a linear regression model is built, it is necessary to validate the performance of the model. There are various validation techniques. Some of the widely used techniques are listed below:

R-Square

Any model is only as good as it is able to predict the actual outcome with accuracy. R-Square is a measure of how well the model is able to predict the changes in the actual data. R-Square ranges between 0 and 1, generally with values over 0.7 indicating a good fit between the predictions and actual data.

Mean Absolute Percent Error (MAPE)

MAPE is a measure of how high or low are the differences between the predictions and actual data. For e.g. 10% MAPE means on average the predictions from a model will be 10% higher or lower than actual.

Mape is defined by the formula:

defined by the formula:

1/n (Sigma(a-f)/a*100)

Where a is the Actual value and f is the predicted value.

Rank Ordering

This is an initial validation process, used once the predicted values are created. The predicted values are sorted in descending order and are grouped into deciles. Then the number of actual numbers is computed for each decile. Generally, the mean actual values are used for Rank ordering. The model is said to have rank ordering if these numbers follow a monotonically decreasing order, i.e., the average in1st decile should be strictly higher than the average number in 2nd decile, and so on.

The equation created in training data, the same equation should be used in validation data for creating deciles.

Sign Check

This is more from a business point of view. The signs (+/-) of the parameters corresponding to each independent variable are checked to see if it makes business sense. For example, is a variable is supposed to have a positive relation with the predicted value then its sign should be positive.

Once we finalize the model in training data, we used to fit the same model in the validation data and check the sign of the coefficient of the independent variables.

Lift Curve

A lift curve is a graphical representation of the % of cumulative dependent variable (for example if the dependent variable is revenue, then cumulative percentage of revenue in each decile) captured at a specific cut-off. The cut-off can be a particular decile or a percentile. Similar, to rank ordering procedure, the data is in descending order of the predicted value and is then grouped into deciles/percentiles. The cumulative sum of dependent variable is then computed for each decile/percentile. Taking the example from rank ordering, a lift curve for the same would be as follows:

Tests for Normality of Residuals

One of the assumptions of linear regression analysis is that the residuals are normally distributed. This assumption assures that the p-values for the t-tests will be valid. As before, we will generate the residuals (called r) and predicted values (called fv) and put them in a dataset (called elem1res). We will also keep the other independent variables in that dataset.

proc reg data=best_model;
model ln_gross_rev =
Tot_unit_AnyProd
Flag_PC
Flag_other
hh_size
online_ordr_amt_avg
age;
output out=elem1res (keep= ln_gross_rev Tot_unit_AnyProd
Flag_PC
Flag_other
hh_size
online_ordr_amt_avg
age r fv) residual=r predicted=fv;
run;
quit;

proc kde data=elem1res out=den;
var r ;
run;

proc sort data=den;
by r;
run;

goptions reset=all;
symbol1 c=blue i=join v=none height=1;
proc gplot data=den;
plot density*r=1;
run;
quit;

qqplot of residual

Proc univariate will produce a normal quantile graph. qqplot plots the quantiles of a variable against the quantiles of a normal distribution. qqplotis most sensitive to non-normality near two tails.

goptions reset=all;
proc univariate data=elem1res normal;
var r;
qqplot r / normal(mu=est sigma=est);
run;

Some cosmetic treatment of a model – I learnt from my experiences

Sometimes I have seen the model is not rank ordering, there are several methods to check/correct the rank ordering problem:

a. For this we may exclude one or more variables as a time, and build the linear regression model to see when we get a rank ordering in the modeling and validation samples.

b. One can plot each of the independent variables against the decile created for rank ordering. Each of the independent variable should follow a trend.

i. One can verify the sign of the coefficient of each of the independent variable

ii. In case the independent variable is not following a trend, one can use different transformation like:

1. Linear transformation(for example , for independent continuous variable , one can create categorical variable like if recency is between 4.33 years to 3.1 years then recency_1 = 3, and if recency is <3.1 but >=2.5 then recency_1 = 2, else recency_1 = 1) .

2. Quadratic transformations are also useful sometimes

3. One can use parabolic and hyperbolic transformations too but those are difficult to explain.

Autocorrelation

Another way in which the assumption of independence can be broken is when data are collected on the same variables over time. Let's say that we collect medicine usage data every quarter for 12 years. In this situation it is likely that the errors for observation between adjacent quarters will be more highly correlated than for observations more separated in time. This is known as autocorrelation. When you have data that can be considered to be time-series, you should use the dw option that performs a Durbin-Watson test for correlated residuals.

Durbin-Watson Statistic:

One peculiar feature of data recorded over time, like monthly sales, is that it tends to be correlated over time. For e.g. high sales months may be tend to be followed by high sales months and low sales months by more low sales months. This may be caused either by seasonal/cyclical trends or seasonal promotion, marketing or competitive effects. Whatever the factor causing this correlation, correlated errors violate one of the fundamental assumptions needed for least squares regression- independence of errors or in other words random errors. Durbin-Watson Statistic is a measure used to detect such correlations. Every model has one measure for Durbin-Watson statistic. Durbin-Watson Statistic, ranges in value from 0 to 4 with an ideal value of 2 indicating that errors are not correlated (although values from 1.75 to 2.25 may be considered acceptable). A value significantly below 2 indicates a positive correlation and a value significantly greater than 2 suggests negative correlation. In either case the model specification needs to be reviewed to identify variables potentially omitted or redundant variables.

References:

http://www.ats.ucla.edu/stat/sas/library/SASReg_mf.htm

http://www.ats.ucla.edu/stat/sas/webbooks/reg/chapter2/sasreg2.htm

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#/documentation/cdl/en/statug/63033/HTML/default/statug_reg_sect007.htm

http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

http://www.sfu.ca/sasdoc/sashtml/stat/chap55/sect38.htm

http://www.stat.psu.edu/online/development/stat501/12multicollinearity/05multico_vif.html

Sunday, October 10, 2010

Why use multiple linear regression?

 To investigate a collection of factors for their potential association with the outcome of interest

 To investigate a collection of known relevant factors for their ability to predict the outcome of interest.

The GOAL

 To obtain a parsimonious set of variables that efficiently predicts the response variable of interest.

Model Selection

 PROC REG supports a variety of model selection methods but does not support a CLASS statement.

 PROC GLM supports the CLASS statement but does not include the model selection methods

 PROC GLMSELECT supports the CLASS statement and includes model selection methods but does not include regression diagnostics or hypothesis testing, LS-means etc.

o Only available in SAS 9.2

o Can download from SAS website for 9.1

o http://support.sas.com/rnd/app/da/glmselect.html

My Understanding about Linear Regression - Part IV

Multicollinearity

Multi-collinearity is a condition when independent variables included in a model are correlated with each other. The real damage caused by multi-collinearity is that it causes large standard errors in estimating the coefficients. In simpler terms it causes the estimated t-statistics for correlated or multi-collinear variables to be insignificant, thus resulting in significant variables to appear to be insignificant. Multi-collinearity can be identified by the Variance Inflation factor (VIF), which is a statistic calculated for each variable in a model. A VIF greater than 2 may suggest that the concerned variable is multi-collinear with others in the model and may need to be dropped. The Variance Inflation Factor (VIF) is 1/Tolerance, it is always greater than or equal to 1.

A variance inflation factor is attached to each variable in the model and it measures the severity of multicollinearity for that variable. Statistically, VIF for the ith variable is defined as

Where R2i is the R2 value obtained by regressing the ith predictor on the remaining predictors. Note that a variance inflation factor exists for each of the i predictors in a multiple regression model.

Xk = β0 + β1x1 + β2x2 + … + βkxk , k ≠ i

To determine which variables are collinear to each other, we need to look at the collinearity diagnostics. For this, we have to do factor analysis, where factor loadings corresponding to each variable are computed for each factor. If two variables are collinear then the factor loadings for them in a particular factor will be higher as compared to other variables. We select the most important factor to begin with. Once two variables with high collinearity between each other are located, we have to remove one of them from the model. The condition index is calculated using a factor analysis on the independent variables. Values of 10-30 indicate a mediocre multicollinearity in the linear regression variables; values > 30 indicate strong multicollinearity.

Once a linear regression model is created, various validation techniques are used to quantify the effectiveness of the model. A good model should pass these tests as well as show similar patterns in the modeling and validation data.

Thursday, August 12, 2010

My Understanding about Linear Regression - Part III

Outlier Treatment

A single observation that is substantially different from all other observations can make a large difference in the results of the regression analysis. If a single observation (or small group of observations) substantially changes the results, one would want to know about this and investigate further.

In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.

High values are known as upper outliers and low ones are known as lower outliers. Such values should be modified or else they would bias the estimation of the model parameters. The most simple and commonly used outlier treatment technique is by capping the values which are above 99% or below 1% of the population. This means that if a value is above the 99th percentile then it is replaced by the value corresponding to the 99th percentile. Similar capping is done for values below 1st percentile.

Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. These leverage points can have an effect on the estimate of regression coefficients.

Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.

Detecting Unusual and Influential Data

o Scatterplots of the dependent variable versus the independent variables

We need to examine the relationship between the dependent variable, Y1, and a continuous predictor, X1. We first look at a scatterplot, with a regression line included to see the relationship between Y and X and decide if it appears to be linear (degree = 1 is used for the regression line). We also look for any outliers. Here is the SAS code:

title "Scatter Plot with Regression Line";

proc sgplot data=a;
reg y=y1 x=x1 / degree=1;
run;

o looking at the largest values of the studentized residuals, leverage, Cook's D, DFFITS and DFBETAs . Here is the code:

proc reg data = best_model ;
model ln_gross_rev =
Tot_unit_AnyProd
Flag_PC
Flag_other
hh_size
online_ordr_amt_avg
age
/ vif collin ;

output out = predicted_output p=Predicted r=Error cookd = CooksD dffits=dffit h=lev;
run;
ods graphics off ;

data a;
set predicted_output ;
cov = finv(0.05,7,518433);
if CooksD > cov then flag_cook = 1;
else flag_cook = 0;
run;
proc freq data = a; tables flag_cook; run;

Ideally, flag_cook = 0 should be 100%. Flag_cook = 1 indicates presence of outliers. Incase of unusual data points, we need to do the proper treatments or remove those data points from the model data (if the percentage of unusual data points are not very high).

Cook's d follows a F dist with (K,n-K) df, where n= number of obs, K=parameters in model(includes intercept)

The lowest value that Cook's D can assume is zero, and the higher the Cook's D is, the more influential the point is. The conventional cut-off point is 4/n. We can either follow the above mentioned method or list any observation above the cut-off point by doing the following. Here is the code for the same:

data xxx(keep = cust_id Tot_unit_AnyProd Flag_PC Flag_other hh_size
online_ordr_amt_avg age_n CooksD); set predicted_output;
where CooksD > (4/518433);
run;

(Note: In this example the model dependent variable was revenue per customer, so the data is at customer level i.e. all the variables are summarized at customer level and one customer is having one row. That is the reason I have kept cust_id variable in the above output. It will help to understand customers are having most influential data point)

Plot the DFFITS statistic by observation number. Observations whose DFFITS’ statistic is greater in magnitude than , where is the number of observations used and is the number of regressors, are deemed to be influential.

DFITS can be either positive or negative, with numbers close to zero corresponding to the points with small or zero influence.

data yyy(keep = cust_id Tot_unit_AnyProd
Flag_PC
Flag_other
hh_size
online_ordr_amt_avg
age dffit) ; set predicted_output;
where abs(dffit)> (2*sqrt(6/518433));
run;

The DFFITS statistic is a scaled measure of the change in the predicted value for the ith observation and is calculated by deleting the ith observation. A large value indicates that the observation is very influential in its neighborhood of the X space.

The above measures are general measures of influence. One can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. Apparently this is more computationally intensive than summary statistics such as Cook's D because the more predictors a model has, the more computation it may involve. We can restrict our attention to only those predictors that we are most concerned with and to see how well behaved those predictors are. In SAS, we need to use the ods output OutStatistics statement to produce the DFBETAs for each of the predictors. The names for the new variables created are chosen by SAS automatically and begin with DFB_.

proc reg data = best_model ;
model ln_gross_rev= Tot_unit_AnyProd
Flag_PC Flag_other hh_size online_ordr_amt_avg age_n/influence;
ods Output OutputStatistics= revbetas;
id Cust_cid;
run;

(note: for bigger datasets the html output will be huge, once the code is done no need to save the ods output as all the betas will be stored in the output dataset)

This created six variables, DFB_Tot_unit_AnyProd, DFB_Flag_PC , DFB_Flag_other, DFB_hh_size, DFB_online_ordr_amt_avg and DFB_age

The DFBETAS statistics are the scaled measures of the change in each parameter estimate and are calculated by deleting the ith observation:

where (X'X)jj is the (j,j)th element of (X'X)-1.

In general, large values of DFBETAS indicate observations that are influential in estimating a given parameter. Belsley, Kuh, and Welsch recommend 2 as a general cutoff value to indicate influential observations and (2/sqrt n) as a size-adjusted cutoff.

If a point lies far from the other data in the horizontal direction, it is known as an influential observation. The reason for this distinction is that these points have may have a significant impact on the slope of the regression line. We need to remove influential data points or need to do the proper treatments of influential points.

Now let's look at the leverage's to identify observations that will have potential great influence on regression coefficient estimates. Here is the sample SAS code:

proc univariate data= predicted_output plots plotsize=30;
var lev;
run;

Generally, a point with leverage greater than (2k+2)/n should be carefully examined, where k is the number of predictors and n is the number of observations.

The following table summarizes the general rules of thumb we use for these measures to identify observations worthy of further investigation (where k is the number of predictors and n is the number of observations).

Measure Value
leverage >(2k+2)/n
abs(rstu) > 2
Cook's D > 4/n
abs(DFITS) > 2*sqrt(k/n)
abs(DFBETA) > 2/sqrt(n)

For more details, please refer

http://www.ats.ucla.edu/stat/sas/webbooks/reg/chapter2/sasreg2.htm

Model building

Once the data has been cleaned by missing imputation and outlier treatment, the initial model is built on it. For the initial/ first cut model, all the independent variables are put into the model. The objective is to finally have a limited number of independent variables (5-10) which are significant in all respect, without sacrificing too much on the model performance. The reason behind not-including having too many variables is that the model would be over fitted and would become unstable when tested on the validation sample. The variable reduction is done using forward or backward or stepwise variable selection procedures. These procedures are described below:

Forward Selection - In a forward selection analysis we start out with no predictors in the model. Each of the available predictors is evaluated with respect to how much R2 would be increased by adding it to the model. The one which will most increase R2 will be added if it meets the statistical criterion for entry. With SAS the statistical criterion is the significance level for the increase in the R2 produced by addition of the predictor. If no predictor meets that criterion, the analysis stops. If a predictor is added, then the second step involves re-evaluating all of the available predictors which have not yet been entered into the model. If any satisfy the criterion for entry, the one which most increases R2 is added. This procedure is repeated until there remain no more predictors that are eligible for entry.

Backwards Elimination - In a backwards elimination analysis we start out with all of the predictors in the model. At each step we evaluate the predictors which are in the model and eliminate any that meet the criterion for removal.

Stepwise Selection - With fully stepwise selection we start out just as in forwards selection, but at each step variables that are already in the model are first evaluated for removal, and if any are eligible for removal, the one whose removal would least lower R2 is removed. You might wonder why a variable would enter at one point and leave later -- well, a variable might enter early, being well correlated with the criterion variable, but later become redundant with predictors that follow it into the model.

An entry significance level of 0.15, specified in the slentry=0.15 option, means a variable must have a p-value < 0.15 in order to enter the model during forward selection and stepwise regression. An exit significance level of 0.05, specified in the slstay=0.05 option, means a variable must have a p-value > 0.05 in order to leave the model during backward selection and stepwise regression. One can change the entry and exit criteria based on the situation and requirement.

The following SAS code performs the forward selection method by specifying the option selection=forward.

proc reg data=a outest=est1;
model y=x1 x2 x3 x4 x5 x6 x7 x8 x9 x10…..Xn / slstay=0.15 slentry=0.15
selection=forward ss2 sse aic;
output out=out1 p=p r=r; run;

The following SAS code performs the backward elimination method by specifying the option selection=backward.

proc reg data=a outest=est2;
model y=x1 x2 x3 x4 x5 x6 x7 x8 x9 x10….Xn / slstay=0.05 slentry=0.15
selection=backward ss2 sse aic;
output out=out1 p=p r=r; run;

The following SAS code performs stepwise regression by specifying the option selection=stepwise.

proc reg data=a outest=est3;
model y=x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 …Xn/ slstay=0.05 slentry=0.15
selection=stepwise ss2 sse aic;
output out=out3 p=p r=r; run;

Once the first cut model is ready, one might have to couple of manual iteration based on business logic. Once the first cut model is ready, one needs to check the multicollinearity. One can use the following code for the same:

proc reg data=a outest=parameter;
model y=x1 x2 x3 x4 x5 x6 x7 x8 x9 x10…..xn /
/ vif collin;
output out=out3 p=p r=r;
run;

Saturday, August 7, 2010

My Understanding about Linear Regression - Part II

Important terminologies

Variables: Variables are measurements of occurrences of a recurring event taken at regular intervals or measurements of different instances of similar events that can take on different possible values. E.g. the price of gasoline recorded at monthly intervals, the height of children between the age of 6 and 10, Revenue per customer.

Dependent Variable: A variable whose value depends on the value of other variables in a model. E.g. revenue from customer, which is directly dependent on their purchase quantity and price of the product/s.

Independent Variables: Variables whose value is not dependent on other variables in a model. E.g. in the above example, the price of corn would be one of the independent variables driving the price of corn oil. An independent variable is specific to a model and a variable that is independent in one model can be dependent in another.

Lurking Variables: If non-linear trends are visible in the relationship between an explanatory and dependent variable, there may be other influential variables to consider. A lurking variable exists when the relationship between two variables is significantly affected by the presence of a third variable which has not been included in the modeling effort. Since such a variable might be a factor of time (for example, the effect of political or economic cycles), a time series plot of the data is often a useful tool in identifying the presence of lurking variables.

Residual: Once a regression model has been fit to a group of data, examination of the residuals (the deviations from the fitted line to the observed values) allows the modeler to investigate the validity of his or her assumption that a linear relationship exists. Plotting the residuals on the y-axis against the explanatory variable on the x-axis reveals any possible non-linear relationship among the variables, or might alert the modeler to investigate lurking variables.

Extrapolation: Whenever a linear regression model is fit to a group of data, the range of the data should be carefully observed. Attempting to use a regression equation to predict values outside of this range is often inappropriate, and may yield incredible answers. This practice is known as extrapolation. Consider, for example, a linear model which relates weight gain to age for young children. Applying such a model to adults, or even teenagers, would be absurd, since the relationship between age and weight gain is not consistent for all age groups.

Modeling and Validation Population
Once the dependent variable has been defined, the entire population is split into modeling and validation population. This split is done in a random way, so that the average value of dependent variable for both these samples is roughly the same. It is assumed that the characteristics of the independent variables would be similar in the two samples, as it is a random split. The modeling population is used to build the model and then it is implemented on the validation population. The performance of a model should be similar in both modeling and validation samples.

Here is the sample SAS code:

data wl_model(drop=x) wl_validation(drop=x);
set vintage24_48_all_cust_1;
x=ranuni(2345);if x < 0.5 then output wl_model;
else output wl_validation;run;
proc means data = vintage24_48_all_cust_1 mean; var ln_gross_rev;
run;

proc means data = wl_model mean; var ln_gross_rev; run;
proc means data = wl_validation mean; var ln_gross_rev; run;

Instead of 50:50 split, one can use 60:40 or 70:30 split for Modeling and validation dataset.

Missing Imputation and normality of dependent variable

All the independent variables and the dependent variable that go into the model should not have any missing values. If the dependent variable has a missing value for an observation, which is rarely the case, then it should be discarded. For Linear regression, if the dependent variable is skewed, one can use the log transformation of the dependent variable so that the dependent variable is normally distributed. Here is the code to perform log transformation:

data dep_var_1;
set dep_var;
ln_gross_rev =log(gross_rev);
run;

One can use proc univariate to plot the dependent variable. The following code can be used:

proc univariate data= dep_var_1;
var ln_gross_rev;
histogram / normal
midpoints = 1 2 3 4 5 6 7 8 9 10
ctext = blue;
run;

For the independent variables, the missing values are replaced / imputed. Some of the commonly used imputation techniques include :

1. Replacing by median,
2. Replaced by mode,
3. Replaced by 0
4. Replaced by any other logical values.

Median is preferred to mean because it is not impacted by extreme values. Mode is used when the variable is discrete or categorical, where the mode is the most frequently occurring value. 0 is mostly used for indicator/dummy/binary variables. Other logical values can also be used based on their business implications.

OPTIONS ON THE PROC SQL STATEMENT

Restricting Row Processing with the INOBS= and OUTOBS= Options

There are several useful options that can be used in the PROC SQL statement to help control the appearance of the report. Note that once coded, these options will apply to all SELECT statements within the PROC SQL step unless a RESET statement is used:

PROC SQL INOBS=5 OUTOBS=4 DOUBLE;
SELECT STORE_NAME, (GROSS_SALES * .05) AS TAX
FROM Online_Sales;
QUIT;

INOBS=n
Restricts the number of rows (observations) that PROC SQL retrieves from any single source. This is very useful for debugging queries on large table.
For example, if you specify INOBS=10, then PROC SQL uses only 10 rows from any table or view that is specified in a FROM clause. If you specify INOBS=10 and join two tables without using a WHERE clause, then the resulting table (Cartesian product) contains a maximum of 100 rows. The INOBS= option is similar to the SAS system option OBS=.

OUTOBS=n
Restricts the number of rows (observations) in the output. For example, if you specify OUTOBS=10 and insert values into a table by using a query, then PROC SQL inserts a maximum of 10 rows into the resulting table. Please refer output #1.

For example, we have one dataset called ONLINE_SALES:

output #1

THE CALCULATED OPTION

The CALCULATED component refers to a previously calculated variable so recalculation is not necessary. The CALCULATED component must refer to a variable created within the same SELECT statement. Please refer output #2.

PROC SQL ;
SELECT STORE_NAME, (GROSS_SALES * .05) AS TAX,
CALCULATED TAX * .01 AS REBATE
FROM Online_Sales;
QUIT;

Output #2

Thursday, August 5, 2010

Use of LIKE in proc sql

Look at the Customer Table, and say you wanted to see all people whose last names started with "m"; The syntax would be:

data customer;
input Cust_F_name $5. Cust_L_Name $4. id;
cards;
anjo bana 201
minu mana 211
sinu shan 444
sinu malu 442
;run;

proc sql;
SELECT id, Cust_F_name
FROM customer
WHERE Cust_L_Name LIKE 'm%';
quit;

The percent sign (%) is used to represent any possible character (number, letter, or punctuation) or set of characters that might appear after the "m". To find those people with LastName's ending in "m", use '%m', or if you wanted the "m" in the middle of the word, try '%m%'. The '%' can be used for any characters in the same position relative to the given characters. NOT LIKE displays rows not fitting the given description.

Tuesday, August 3, 2010

My Understanding about Linear Regression - Part I

What is linear regression?

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. Two-variable linear regression model is a model in which the dependent variable is expressed as a linear function of only a single explanatory variable. The most common method for fitting a regression line is the method of least-squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (if one point lies on the fitted line exactly, then its vertical deviation is 0).

Multivariate linear regression

Multivariate regression takes into account several predictive variables simultaneously.

The model is now expressed as

yi = b 0 + b 1 x 1i + b 1 x 2i + … b n x ni

Where n = the number of independent variables

yi = expected or predicted value of the dependent variable for case i

b 0 = the intercept (if all x’s are zero, the expected value of y is b 0 )

b j = the slope (for every one unit increase in xj , y is expected to change by b j units, given that the other independent variables are held constant)

x i = the value of the independent variable for case i

Assumptions of Linear Regression

1. Homoscedasticity – the variance of the error terms is constant for each value of x.

2. Linearity – the relationship between each x and y is linear. To check this, look at the plot(s) of the residuals versus the X value(s). You don’t want to see a clustering of positive residuals or a clustering of negative residuals.

3. Normally Distributed Error Terms – the error terms follow the normal distribution.

4. Independence of Error Terms – successive residuals are not correlated, if they are correlated, it is known as autocorrelation. If possible, use the Durbin Watson statistic to check this.

Frequently used forms of Linear Regression

(i) Log linear Model

Log linear model is useful in case we need to measure the elasticity

lnY=β0+ β 1(lnX1)+ β2(lnx2) + ……….+ βn(lnxn) + Error.

Here, β 1 gives the elasticity of Y with respect to X, i.e., the percentage change in Y with respect to percentage change in X. This equation is also known as the constant elasticity form as in this equation, the elasticity of y with respect to changes in x as δ lny/δ lnxn = βn, which does not vary with xn . This loglinear form is often used in models of demands and production.

(ii) Semilog Model

ln(Y)= β0+ β 1X + Error.

Here, β 1 gives the relative change in Y for a absolute change in the value of X.

A semilog model is often used to model growth rates

Derivation: Yt = Y0 (1+r)^t, which is the compound interest formula.

Taking log , log(Yt)=log(Y0)+tlog(1+r) = β0+ β1 (X).

β1 = relative change in regressand/absolute change in regressor

If we multiply the relative change in Y by 100, will then give the percentage change, or the growth rate, in Y for an absolute change in X, the regressor. That is, 100 times β1 gives the growth rate in Y; is

Another type of semi log model

Y= β0+ β 1 (lnX)+Error.

Unlike the growth model, in which we are interested in finding the percent growth in Y for an absolute change in X, here we want to find the absolute change in Y for a percent change in X.

β1 = change in Y/change in ln X
= change in Y/relative change in X

Feedback option in Proc sql

The FEEDBACK option expands a SELECT * (ALL) statement into the list of columns that the statement represents.

The FEEDBACK option also displays the resolved values of macros and macro variables.

For example, the following query is expanded in the SAS log:

PROC SQL feedback; CREATE TABLE vintage24_48_all_tran AS SELECT b.HH_id, a.* FROM online_base_data AS a INNER JOIN demographic AS b ON (b.cust_no = a.cust_id)
ORDER BY a.cust_id; QUIT;

Expanded SELECT * Statement

NOTE: Statement transforms to:

select B.HH_id, A. cust_id, A.product_no, A.serial_no,
A.purch_dt, A.order_no, A.discount_amt, A._unit_price, A.tax_amt,
A.channel, A.coupon, A.payment_amt, A.Last_dt_anyProd, A.First_dt_anyProd, A.customer, A.gross_value_cust
from work.online_base_data A inner join WORK.demographic B on B. cust_no = A.cust_id
order by A.cust_id asc;

Thursday, July 29, 2010

Some useful tips on proc sql

PROC SQL (Structured Query Language) is a powerful data analysis tool. It can perform many of the same operations as found in traditional SAS code, but can often be more efficient because of its dense language structure.

JOINING THREE OR MORE TABLES

PROC SQL can be an effective tool for joining data, particularly when doing associative or three-way joins. Performing this operation using traditional SAS code would require several PROC SORTs and several DATA step merges. The same result can be achieved with one PROC SQL. For example, we need to join three tables viz. CUSTOMER, SALES and RETURN and where return > 10.

The syntax for getting the table:

PROC SQL; create table customer_data as
SELECT B.Cust_F_name, B.Cust_L_Name, B.return,
E.id, F.STATE, F.Sales
FROM Return as B, customer as E, SALES as F
WHERE B.Cust_F_name=E.Cust_F_name AND
B.Cust_L_name=E.Cust_L_name AND
E.id=F.id AND
return > 10;
QUIT;

How to use drop/keep in proc sql:

One can use drop and keep within proc sql. Here is the syntax:

proc sql;
create table Scores1 as
select *
from customer_data(drop=Cust_L_name Cust_F_name);
quit;

Proc sql can be used for creating macro value list also. For more details please refer the following link:

http://www2.sas.com/proceedings/sugi29/042-29.pdf

Tuesday, July 27, 2010

Effective use of Retain in SAS

1. Holding values of the variables across iterations: The retain statement is used to hold the values of variables across iterations of the data step. Normally, all variables in the data step are set to missing at the start of each of the iteration of the data step. The use of “retain” x y retains the values of the variables x and y across data step iterations. The use of “retain” will retain the values of all variables used in the data step across iterations of the data step.

For example, if we would like to compute values of y(n)=2*y(n-1) with y(1)=1.

data one;

if _n_=1 then y=1;

else y=2*y1;

y1=y;

if _n_ =100 then stop;

run;

Since values are set to missing at the start of data step iteration, the data set “one” will contain one value 1 and the other 99 values of y will be missing.

The following program produces the desired data set.

data two;

retain y1;

if _n_=1 then y=1;

else y=2*y1;

y1=y;

if _n_ =100 then stop;

run;

Reference: http://javeeh.net/sasintro/intro84.html

2. Create time interval using Retain Statement: The RETAIN statement causes a variable to retain its value from one iteration of the DATA step to the next. RETAIN is useful when calculating these time intervals between visits because each unique visit is in different records within the same data set. For more details please check the following link:

http://www2.sas.com/proceedings/sugi25/25/cc/25p100.pdf

3. To Create count/order variables: A counter variable can be created to identify the sequential number of the visits. Here is the syntax:

data aaa; set bbb;

Lpurch = lag(purchase_id);

Ldt = lag(purch_dt);

if first. cust_id then Lpurch = .;

if first. cust_id then Ldt = .;

run;

proc sort data=aaa out= xxx;by cust_id purch_dt purchase_id;

run;

data xxx1; set xxx; by cust_id purch_dt purchase_id;

retain cnt 0;

if first. cust_id then cnt=1;

else if (purch_dt=Ldt and purchase_id=Lpurch) then cnt=cnt;else cnt=cnt+1;

run;

For more details, please visit the following link:

http://www.wuss.org/proceedings07/Posters/POS_Worden_DatumToRemember.pdf

4. To have all the variables in the dataset in a particular order(Re-ordering variables): Any Statement that lists the variables in the desired order before any other Statement will reorder the variables in the newly created Dataset. The most common are the Retain, Length, Attrib, Label, and Format Statements. Retain statement is considered the safest to use. The reason for this is all variables coming from an input Dataset are automatically Retained. As such using a Retain Statement to reorder variables in a Dataset has no unintended side effect. All other Statements require the programmer to specify some attribute of each variable. Here is the syntax:

data high_perf_model_score_1;

retain TITLE_CODE DATA_TYPE_NAME Customer_ID MODEL_NAME MODEL_RUN_DATE

MODEL_LEVEL_CODE SCORE RANK;

set high_perf_model_score;

run;

If we need to use the file as a input for some automated process, that time the order of the variables are very important. Retain Statement is very helpful in this situation.

For more details, please check the link:

http://www.sascommunity.org/wiki/Re-ordering_variables

Saturday, July 24, 2010

Some more details on Hash Table join:

Hash Table Join is very useful and effective though quite complicated.

http://www2.sas.com/proceedings/sugi30/236-30.pdf

Some error messages might appear while joining two datasets using Hash Table. It is quite well explained in support.sas.com. Please check the link below:

http://support.sas.com/kb/39/621.html

Friday, July 23, 2010

Some useful information about Proc sort in SAS

Proc sort in SAS generally used to remove the unnecessary duplicate records from the datasets. Proc SORT sorts data in order that further analysis can be performed, such as "BY" variable processing.

Difference between NODUP and NODUPKEY options

The NODUP (or NODUPRECS) option checks for and eliminates duplicate observations. If one specifies this option, PROC SORT compares all variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, the observation is not written to the output data set.

The NODUPKEY option checks for and eliminates observations with duplicate BY variable values. If you specify this option, PROC SORT compares all BY variable values for each observation to those for the previous observation written to the output data set. If an exact match using the BY variable values is found, the observation is not written to the output data set. One can specify multiple BY variables in either ascending or descending order.

For more details, please refer to the following link:

http://www2.sas.com/proceedings/sugi30/037-30.pdf

DUPOUT Option in Proc sort

The data is unfamiliar; the exploration requires searching for duplicates. SAS version 9 contains a new SORT procedure option named DUPOUT= that puts all deleted duplicate observations into a data set.

Here is the syntax:

proc sort data = dataset out = out_dataset dupout = DUP_dataset nodupkey ;

by cust_id;

run;

For more details, please refer to the following link:

http://www2.sas.com/proceedings/sugi31/164-31.pdf

Thursday, July 22, 2010

Different ways of Merging/Joining two datasets

There are different ways of merging/joining two datasets like:
1. Using data merge Statements
a. It is necessary to sort the datasets before merging and sorting used to take longer time incase datasets are big
b. It is necessary to have the joining variable/s in the same name in two datasets.
2. Using proc sql
a. There is no need to sort datasets and even the joining variable can have different names in two datasets.
b. Use huge amount of work space and creates problem when the dataset/s are big and there is work space limitation.
3. Merge Using proc format
a. This is a really quick way of merging. Sort/Merge is used here when key values from one file are needed to extract records from another file containing the same key, or BY variable.
4. For Syntax, more details and more ways of merging you can refer the following links:
http://www2.sas.com/proceedings/sugi30/054-30.pdf
http://www.nesug.org/Proceedings/nesug09/po/po13.pdf