Saturday, July 9, 2011
Friday, July 8, 2011
Thursday, June 16, 2011
What is the difference between linear regression and logistic regression?
Linear regression analyzes the relationship between two/more variables. For each subject (or experimental unit), one knows independent variables(X’s) and dependent variable(Y) and one wants to find the best straight line through the data points. In some situations, the slope and/or intercept have a scientific meaning(for example, in a log – linear model like lnY=β0+ β 1(lnX1)+ β2(lnx2) + ……….+ βn(lnxn) + Error, β 1 gives the elasticity of Y with respect to X, i.e., the percentage change in Y with respect to percentage change in X. This equation is also known as the constant elasticity form as in this equation, the elasticity of y with respect to changes in x as δ lny/δ lnxn = βn, which does not vary with xn . This log-linear form is often used in models of demands and production.). In other cases, one generally uses the linear regression line as a standard curve to find the values of Y from X’s.
Generally, a linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).
Logistic regression is quite different than linear regression, the predictor is continuous, but the response is categorical or dichotomous (only 2 options). The logistic regression provides the probability of an event occurring. For every one unit increase in the predictor/s, changes the probability of the occurrence of an event.
For example, one might be interested to look at the relationship between body size and weight of different mammals, so one would use linear regression (please note that the relationship might not actually be linear). However, one might want to look at the relationship between body size of different birds and their chance of surviving a winter storm, so one predicts the body size of birds that do and don't survive the winter storm, and generate a logistic regression that gives the probability that a bird will survive if it has a certain body size. One can also have multiple logistics regression, so one can see what the probability is that bird will survive at a given body size and given wing span.
Logistic regression belongs to the class of 'discrete analysis' techniques. Predicting outcomes that are yes/no or one-of (a,b or c) is very different from predicting outcomes that are simple measures, such as revenue.
For modeling percentage data, Logistic regression is a better option. This is true because percentage is a simple and convenient way to represent binomial data, and logistic regression (not linear regression) should be used for binomial data.
When linear regression is used for binary data, there are three problems:
ü The variance of the error term is not constant,ü The error term is not normally distributed,
The first problem can be handled by using weighted least-square regression. When the sample size is very large, the method of least squares provides estimators that are asymptotically normal under fairly general regulations, even when the distribution of the error term is far from normal. But the third problem is insurmountable.
Logistic regression may seem much more complicated than its linear counterpart. Though most of the statistical software packages can do logistic regression with no more effort than linear regression, it is not as easy and straightforward to interpret the coefficients and test for goodness of fit of logistic models.
References:
Friday, June 10, 2011
Wednesday, June 8, 2011
Thursday, June 2, 2011
Tuesday, May 31, 2011
Wednesday, May 25, 2011
Saturday, January 1, 2011
A Powerful Classification Technique in Data Mining - Discriminant Analysis(part –VI)
Computing scores and posterior probabilities using the linear discriminant function coefficients
t = subscript to distinguish the groups
n = number of quantitative variables
C = constant
L = n-by-1 vector of coefficients for the linear discriminant function
X = n-by-1 observation vector
The equation to obtain the score for Group t on a new observation X using METHOD=NORMAL and POOL=YES or POOL=TEST with an insignificant p-value is:
For example, here are the linear discriminant function coefficients(please refer the link: http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/discrim_sect29.htm)
Linear Discriminant Function for Corp:
For example, say a new set of values for x1 though x4 is (16 30 35 31). To classify this observation into a group, you compute the posterior probability of this observation belonging to each group. The observation is classified into the group for which it has the largest probability. To compute the probabilities, you first need to computes the score on each group.
For example, the score on the Clover group is
Clover_score = -10.98457 + 0.08907*16 + 0.17379*30 + 0.11899*35 + 0.15637*31
PROC DISCRIM doesn't report the scores anywhere because their magnitudes are meaningless as well as not interpretable. The LIST and TESTLIST options print the posterior probabilities in the printed output, the OUT= and TESTOUT= options save them to a data set.
The posterior probabilities are computed as
where denominator = eclover_score + eCorn_score + eCotton_score + eSoybeans_score + eSugarbeets_score
Computing scores and posterior probabilities using the quadratic discriminant function coefficients
t = subscript to distinguish the groups
n = number of quantitative variables
C = constant
L = n-by-1 vector of coefficients for the linear discriminant function
Q = n-by-n matrix of coefficients for the quadratic discrim function
X = n-by-1 observation vector
The equation to obtain the score for Group t on a new observation X using METHOD=NORMAL POOL=NO or POOL=TEST with a significant p-value is:
For example, here is how to obtain the score on one group(Please refer the link: http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/discrim_sect28.htm)
For example, say there is new set of observation
SepalLength = 80
SepalWidth = 30
PetalLength = 25
PetalWidth = 16
In PROC IML, it would look like
proc iml;
reset print;
const=-75.8208;
linear = {0.737 1.325 0.623 0.966};
quadratic = {-0.053 0.017 0.050 -0.009,
0.017 -0.079 -0.006 0.042,
0.050 -0.006 -0.067 0.014,
-0.009 0.042 0.014 -0.097};
obs = {80 30 25 16};
score = const + linear*obs` + obs*quadratic*obs`;
quit;
Note: in case you don’t have IML, you need to write the equation manually to compute the scores
For more details, please refer the link: http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/discrim_sect17.htm
The posterior probability of x belonging to group t is then equal to
The discriminant scores are -0.5 Du2(x).
So to compute the 3 posterior probabilities for this new observation on the three groups(please refer the example mentioned at: http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/discrim_sect28.htm), Setosa, Versicolor, and Virginica, first compute the score for each group as above. Then each posterior probability is computed from dividing the exponentiated score by the sum of all the three exponentiated scores.
Reference:
Subscribe to:
Posts (Atom)