Saturday, January 1, 2011

A Powerful Classification Technique in Data Mining - Discriminant Analysis(part –VI)

Computing scores and posterior probabilities using the linear discriminant function coefficients
t = subscript to distinguish the groups
n = number of quantitative variables
C = constant
L = n-by-1 vector of coefficients for the linear discriminant function
X = n-by-1 observation vector

The equation to obtain the score for Group t on a new observation X using METHOD=NORMAL and POOL=YES or POOL=TEST with an insignificant p-value is:


For example, here are the linear discriminant function coefficients(please refer the link: http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/discrim_sect29.htm)

             Linear Discriminant Function for Corp:
   


For example, say a new set of values for x1 though x4 is (16 30 35 31). To classify this observation into a group, you compute the posterior probability of this observation belonging to each group. The observation is classified into the group for which it has the largest probability. To compute the probabilities, you first need to computes the score on each group.

For example, the score on the Clover group is

Clover_score = -10.98457 + 0.08907*16 + 0.17379*30 + 0.11899*35 + 0.15637*31

PROC DISCRIM doesn't report the scores anywhere because their magnitudes are meaningless as well as not interpretable. The LIST and TESTLIST options print the posterior probabilities in the printed output, the OUT= and TESTOUT= options save them to a data set.

The posterior probabilities are computed as




where denominator = eclover_score + eCorn_score + eCotton_score + eSoybeans_score + eSugarbeets_score

Computing scores and posterior probabilities using the quadratic discriminant function coefficients

t = subscript to distinguish the groups
n = number of quantitative variables
C = constant
L = n-by-1 vector of coefficients for the linear discriminant function
Q = n-by-n matrix of coefficients for the quadratic discrim function
X = n-by-1 observation vector

The equation to obtain the score for Group t on a new observation X using METHOD=NORMAL POOL=NO or POOL=TEST with a significant p-value is:

  
For example, here is how to obtain the score on one group(Please refer the link: http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/discrim_sect28.htm)

 

For example, say there is new set of observation

SepalLength = 80
SepalWidth  = 30
PetalLength = 25
PetalWidth  = 16                                   



In PROC IML, it would look like

proc iml;
  reset print;
  const=-75.8208;
  linear = {0.737 1.325 0.623 0.966};
  quadratic = {-0.053  0.017  0.050 -0.009,
                0.017 -0.079 -0.006  0.042,
                0.050 -0.006 -0.067  0.014,
               -0.009  0.042  0.014 -0.097};
  obs = {80 30 25 16};
  score = const + linear*obs` + obs*quadratic*obs`;
quit;

Note: in case you don’t have IML, you need to write the equation manually to compute the scores


The posterior probability of x belonging to group t is then equal to 


The discriminant scores are -0.5 Du2(x).

So to compute the 3 posterior probabilities for this new observation on the three groups(please refer the example mentioned at: http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/discrim_sect28.htm), Setosa, Versicolor, and Virginica, first compute the score for each group as above.  Then each posterior probability is computed from dividing the exponentiated score by the sum of all the three exponentiated scores.

Reference: