Saturday, August 7, 2010

My Understanding about Linear Regression - Part II

Important terminologies


Variables: Variables are measurements of occurrences of a recurring event taken at regular intervals or measurements of different instances of similar events that can take on different possible values. E.g. the price of gasoline recorded at monthly intervals, the height of children between the age of 6 and 10, Revenue per customer.

Dependent Variable: A variable whose value depends on the value of other variables in a model. E.g. revenue from customer, which is directly dependent on their purchase quantity and price of the product/s.

Independent Variables: Variables whose value is not dependent on other variables in a model. E.g. in the above example, the price of corn would be one of the independent variables driving the price of corn oil. An independent variable is specific to a model and a variable that is independent in one model can be dependent in another.

Lurking Variables: If non-linear trends are visible in the relationship between an explanatory and dependent variable, there may be other influential variables to consider. A lurking variable exists when the relationship between two variables is significantly affected by the presence of a third variable which has not been included in the modeling effort. Since such a variable might be a factor of time (for example, the effect of political or economic cycles), a time series plot of the data is often a useful tool in identifying the presence of lurking variables.

Residual: Once a regression model has been fit to a group of data, examination of the residuals (the deviations from the fitted line to the observed values) allows the modeler to investigate the validity of his or her assumption that a linear relationship exists. Plotting the residuals on the y-axis against the explanatory variable on the x-axis reveals any possible non-linear relationship among the variables, or might alert the modeler to investigate lurking variables.

Extrapolation: Whenever a linear regression model is fit to a group of data, the range of the data should be carefully observed. Attempting to use a regression equation to predict values outside of this range is often inappropriate, and may yield incredible answers. This practice is known as extrapolation. Consider, for example, a linear model which relates weight gain to age for young children. Applying such a model to adults, or even teenagers, would be absurd, since the relationship between age and weight gain is not consistent for all age groups.

Modeling and Validation Population
Once the dependent variable has been defined, the entire population is split into modeling and validation population. This split is done in a random way, so that the average value of dependent variable for both these samples is roughly the same. It is assumed that the characteristics of the independent variables would be similar in the two samples, as it is a random split. The modeling population is used to build the model and then it is implemented on the validation population. The performance of a model should be similar in both modeling and validation samples.

Here is the sample SAS code:

data wl_model(drop=x) wl_validation(drop=x);
set vintage24_48_all_cust_1;
x=ranuni(2345);if x < 0.5 then output wl_model;
else output wl_validation;run;
proc means data = vintage24_48_all_cust_1 mean; var ln_gross_rev;
run;

proc means data = wl_model mean; var ln_gross_rev; run;
proc means data = wl_validation mean; var ln_gross_rev; run;

Instead of 50:50 split, one can use 60:40 or 70:30 split for Modeling and validation dataset.

Missing Imputation and normality of dependent variable

All the independent variables and the dependent variable that go into the model should not have any missing values. If the dependent variable has a missing value for an observation, which is rarely the case, then it should be discarded. For Linear regression, if the dependent variable is skewed, one can use the log transformation of the dependent variable so that the dependent variable is normally distributed. Here is the code to perform log transformation:

data dep_var_1;
set dep_var;
ln_gross_rev =log(gross_rev);
run;

One can use proc univariate to plot the dependent variable. The following code can be used:

proc univariate data= dep_var_1;
var ln_gross_rev;
histogram / normal
midpoints = 1 2 3 4 5 6 7 8 9 10
ctext = blue;
run;

For the independent variables, the missing values are replaced / imputed. Some of the commonly used imputation techniques include :

1. Replacing by median,
2. Replaced by mode,
3. Replaced by 0
4. Replaced by any other logical values.

Median is preferred to mean because it is not impacted by extreme values. Mode is used when the variable is discrete or categorical, where the mode is the most frequently occurring value. 0 is mostly used for indicator/dummy/binary variables. Other logical values can also be used based on their business implications.

1 comment:

  1. Hi Anjanita,
    Niladry here...I hope you remember me :)
    This is really a nice Article...

    Few other missing imputation techniques we can use in addition to one mentioned above are
    1) Regression Technique - Basically we use this technique when the variable with the missing value is correlated with the any other variable which has all the values populated. So we can get an equation using the two variables and using this equation to impute the missing values.
    2) Class Mean Substitution - This method uses the mean values within subgroups of other variables or combination of variables.

    Share your thoughts on the same.

    ReplyDelete