Thursday, November 11, 2010

A Powerful Classification Technique in Data Mining - Discriminant Analysis(part – II)

Discriminant Analysis attempts to find a rule that separates clusters to the maximum possible extent.


Discriminant Analysis - Assumptions

The underlying assumptions of Discriminant Analysis (DA) are:

– Each group is normally distributed, Discriminant Analysis is relatively robust to departures from normality.

– The groups defined by the dependent variable exist a priori.

– The Predictor variable, Xk are multivariate normally distributed, independent, and non-collinear

– The variance/covariance matrix of the predictor variable across the various groups are the same in the population, (i.e. Homogeneous)

– The relationship is linear in its parameters

– Absence of leverage point outliers

– The sample is large enough: Unequal sample sizes are acceptable. The sample size of the smallest group needs to exceed the number of predictor variables. As a “rule of thumb”, the smallest sample size should be at least 20 for a few (4 or 5) predictors. The maximum number of independent variables is n - 2, where n is the sample size. While this low sample size may work, it is not encouraged, and generally it is best to have 4 or 5 times as many observations and independent variables

– Errors are randomly distributed

Drawback of Discriminant Analysis

– An important drawback of discriminant analysis is its dependence on a relatively equal distribution of group membership. If one group within the population is substantially larger than the other group, as is often the case in real life, Discriminant analysis might classify all observations in only one group. An equal good-bad sample should be chosen for building the discriminant analysis model.

– Another significant restriction of discriminant analysis is that it can’t handle categorical independent variables.

– Discriminant analysis is more rigid than logistic regression in its assumptions. In contrast to ordinary linear regression, discriminant analysis does not have unique coefficients. Each of the coefficients depends on the other coefficients in the estimation and therefore there is no way of determining the absolute value of any coefficient.

Discriminant Analysis Vs Logistic Regression



Similarity: Both techniques examine an entire set of interdependent relationships


Discriminant Analysis Vs ANOVA

Similarity: Both techniques examine an entire set of interdependent relationships

Difference: In Discriminant analysis, Independent variables are metric where as in ANOVA it is categorical.

Reference:

http://userwww.sfsu.edu/~efc/classes/biol710/discrim/discrim.pdf
www.shsu.edu/~icc_cmf/cj_742/stats7.doc

2 comments:

  1. "An important drawback of discriminant analysis is its dependence on a relatively equal distribution of group membership. If one group within the population is substantially larger than the other group, as is often the case in real life, Discriminant analysis might classify all observations in only one group. An equal good-bad sample should be chosen for building the discriminant analysis model."


    In some problems, one class is simply more prevalent than the other, over a large fraction of the input space. Consequently, it may be helpful to consider class probabilities rather than simple class predictions. The linear discriminant functions can be converted to estimated probabilities by applying the softmax transform (exponentiate everything and normalize to a sum of 1.0). See, for instance:

    http://matlabdatamining.blogspot.com/2010/12/linear-discriminant-analysis-lda.html

    -Will Dwinnell

    ReplyDelete
  2. Thanks a lot Will for sharing your views.....

    ReplyDelete