Sunday, November 21, 2010

A Powerful Classification Technique in Data Mining - Discriminant Analysis(part – III)

Terminologies:

F values : Is the ratio of the between sum of squares to the within sum of squares of variable.

Wilks’ Lambda: Is the ratio of the within sum of squares to the total sum of squares for the entire set of variables in analysis. Wilks’ Lambda varies between 0 to 1. Also called U statistics.

Classification matrix: Is a matrix that contains the number of correctly classified and misclassified cases.

Hit Ratio: Percentage of cases correctly classified by the discriminant function.

The DISCRIM Procedure

• PROC DISCRIM can be used for many different types of analysis including

• canonical discriminant analysis

• assessing and confirming the usefulness of the functions (empirical validation and crossvalidation)

• predicting group membership on new data using the functions (scoring)

• linear and quadratic discriminant analysis

• nonparametric discriminant analysis

Discriminant Function



Linear discriminant analysis constructs one or more discriminant equations (linear combinations of the predictor variables Xk) such that the different groups differ as much as possible on Z.

Where,


Z = Discriminant score, a number used to predict group membership of a  case

a = Discriminant constant

Wk = Discriminant weight or coefficient, a measure of the extent to which variable Xk discriminates among the groups of the DV

Xk = An Independent Variable or Predictor variable. Can be metric or non-metric.

Number of discriminant functions = min (number of groups – 1, k).

k = Number of predictor variables.

Discriminant Function : Interpretation
 

• The weights are chosen so that one will be able to compute a discriminant score for each subject and then do an ANOVA on Z.
• More precisely, the weights of the discriminant function are calculated in such a way, that the ratio (between groups SS)/(within groups SS) is as large as possible.
• The value of this ratio is the eigenvalue
• First discriminant function Z1 distinguishes first group from groups 2,3,..N.
• Second discriminant function Z2 distinguishes second group from groups 3, 4…,N. etc
Note : Discriminant analysis uses OLS to estimate the values of the parameters (a) and Wk that minimize the Within Group SS.

Partitioning Sums of Squares in Discriminant Analysis


In Linear Regression:

• Total sums of squares are partitioned into Regression sums of squares and Residual sums of squares.

• And Goal is to estimate parameters that minimize the Residual SS.

In Discriminant Analysis:

• The Total sums of squares is partitioned into Between Group sums of squares and Within Groups sums of squares



Where,
i = an individual case,

j = group j

Zi = individual discriminant score

Z = grand mean of the discriminant scores

Zj = mean discriminant score for group j

Here, Goal is to estimate parameters that minimize the Within Group Sums of Squares



Compare and Contrast

Canonical Discriminant analysis

Canonical discriminant analysis is a dimension-reduction technique related to principal component analysis and canonical correlation. Given a nominal classification variable and several interval variables, canonical discriminant analysis derives canonical variables (linear combinations of the interval variables) that summarize between-class variation in much the same way that principal components summarize total variation.

Canonical discriminant analysis is equivalent to canonical correlation analysis between the quantitative variables and a set of dummy variables coded from the classification variable.

Given two or more groups of observations with measurements on several interval variables, canonical discriminant analysis derives a linear combination of the variables that has the highest possible multiple correlation with the groups. This maximum multiple correlations is called the first canonical correlation. The coefficients of the linear combination are the canonical coefficients. The variable defined by the linear combination is the first canonical variable. The second canonical correlation is obtained by finding the linear combination uncorrelated with the first canonical variable that has the highest possible multiple correlation with the groups. The process of extracting canonical variables can be repeated until the number of canonical variables equals the number of original variables or the number of classes minus one, whichever is smaller. Canonical variables are also called canonical components

Fisher linear Discriminant analysis

The most famous example of dimensionality reduction is ”principal components analysis”.

This technique searches for directions in the data that have largest variance and subsequently project the data onto it. In this way, we obtain a lower dimensional representation of the data, that removes some of the ”noisy” directions. There are many difficult issues with how many directions one needs to choose, but that is beyond the scope of this note.



PCA is an unsupervised technique and as such does not include label information of the data. For instance, if we imagine 2 cigar like clusters in 2 dimensions, one cigar has y = 1 and the other y = -1. The cigars are positioned in parallel and very closely together, such that the variance in the total data-set, ignoring the labels, is in the direction of the cigars.

For classification, this would be a terrible projection, because all labels get evenly mixed and we destroy the useful information. A much more useful projection is orthogonal to the cigars, i.e. in the direction of least overall variance, which would perfectly separate the data-cases (obviously, we would still need to perform classification in this 1-D space). So the question is, how do we utilize the label information in finding informative projections?

To that purpose Fisher-LDA considers maximizing the following objective:

where SB is the “between classes scatter matrix” and SW is the “within classes scatter matrix”. Note that due to the fact that scatter matrices are proportional to the covariance matrices we could have defined J using covariance matrices – the proportionality constant would have no effect on the solution.





 

Reference:

1 comment:

  1. Hello my friend! I would like to tell you that this write-up is awesome, great written and include almost all important info. I recently came to know about http://www.ttegulf.com/, their Business Management Consultancy are very effective.
    Business Management Consultancy

    ReplyDelete