Thursday, June 16, 2011

What is the difference between linear regression and logistic regression?


Linear regression analyzes the relationship between two/more variables. For each subject (or experimental unit), one knows independent variables(X’s) and dependent variable(Y) and one wants to find the best straight line through the data points. In some situations, the slope and/or intercept have a scientific meaning(for example, in a log – linear model like lnY=β0+ β 1(lnX1)+ β2(lnx2) + ……….+ βn(lnxn)  + Error, β 1 gives the elasticity of Y with respect to X, i.e., the percentage change in Y with respect to percentage change in X. This equation is also known as the constant elasticity form as in this equation, the elasticity of y with respect to changes in x as  δ lny/δ lnxn = βn, which does not vary with xn . This log-linear form is often used in models of demands and production.). In other cases, one generally uses the linear regression line as a standard curve to find the values of Y from X’s.

Generally, a linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).
Logistic regression is quite different than linear regression, the predictor is continuous, but the response is categorical or dichotomous (only 2 options). The logistic regression provides the probability of an event occurring. For every one unit increase in the predictor/s, changes the probability of the occurrence of an event.

For example, one might be interested to look at the relationship between body size and weight of different mammals, so one would use linear regression (please note that the relationship might not actually be linear). However, one might want to look at the relationship between body size of different birds and their chance of surviving a winter storm, so one predicts the body size of birds that do and don't survive the winter storm, and generate a logistic regression that gives the probability that a bird will survive if it has a certain body size. One can also have multiple logistics regression, so one can see what the probability is that bird will survive at a given body size and given wing span.

Logistic regression belongs to the class of 'discrete analysis' techniques. Predicting outcomes that are yes/no or one-of (a,b or c) is very different from predicting outcomes that are simple measures, such as revenue.

For modeling percentage data, Logistic regression is a better option. This is true because percentage is a simple and convenient way to represent binomial data, and logistic regression (not linear regression) should be used for binomial data.

When linear regression is used for binary data, there are three problems:
ü  The variance of the error term is not constant,
ü  The error term is not normally distributed,
ü  There is no restriction requiring the prediction to fall between 0 and 1.


The first problem can be handled by using weighted least-square regression. When the sample size is very large, the method of least squares provides estimators that are asymptotically normal under fairly general regulations, even when the distribution of the error term is far from normal. But the third problem is insurmountable.

Logistic regression may seem much more complicated than its linear counterpart. Though most of the statistical software packages can do logistic regression with no more effort than linear regression, it is not as easy and straightforward to interpret the coefficients and test for goodness of fit of logistic models. 


References:

No comments:

Post a Comment