Saturday, November 6, 2010

A Powerful Classification Technique in Data Mining - Discriminant Analysis(part – I)

Classification is a data mining technique used to predict group membership for data instances. In predictive customer analytics, classification techniques are deployed frequently and are true across most applications including acquisition, cross-sell, attrition, credit scoring, collections and classifying first time buyer etc. The objective of any classification model is to classify customers in two or more groups based on a predicted outcome associated with each customer e.g. responder or non-responder, defaulter or non-defaulter, churner or non-churner, valuable customers or non valuable customers etc. Businesses are interested in predicting likelihood of each customer behaving in a particular fashion, and classification techniques provide them with predictive models for the same.


Various parametric and non-parametric methods are used to solve classification related problems. Traditional statistical methods are parametric in nature based on the assumptions about the nature of the distributions and estimate the parameters of the distributions to solve the problem. Non-parametric methods, on the other hand, make no assumptions about the specific distributions involved, and are therefore distribution-free.

Discriminant analysis is a technique for classifying a set of observations into two or more predefined classes. The purpose is to determine the class of an observation based on a set of variables known as predictors or input variables (analogous to independent variables in regression). The model is built based on a set of observations for which the classes are known. This set of observations is sometimes referred to as the training set. Based on the training set, the technique constructs a set of linear functions of the predictors, known as discriminant functions, such that

L = b1x1 + b2x2 + …… + bnxn + c , where the b's are discriminant coefficients, the x's are the input variables or predictors and c is a constant.

These discriminant functions are used to predict the class of a new observation with unknown class. For a k class problem k discriminant functions are constructed. Given a new observation, all the k discriminant functions are evaluated and the observation is assigned to class i if the ith discriminant function has the highest value.

Discriminant Analyis (DA), a multivariate statistical technique is commonly used to build a predictive / descriptive model of group discrimination based on observed predictor variables and to classify each observation into one of the groups. In DA multiple quantitative attributes are used to discriminate single classification variable. DA is different from the cluster analysis because prior knowledge of the classes, usually in the form of a sample from each class is required.

The common objectives of DA are

i. To investigate differences between groups

ii. To discriminate groups effectively;

iii. To identify important discriminating variables;

iv. To perform hypothesis testing on the differences between the expected groupings

v. To classify new customers into pre-existing groups.

Commonly used DA techniques available in the SAS systems are :

DISCRIM: Computes various discriminant functions for classifying observations. Linear or quadratic discriminant functions can be used for data with approximately multivariate normal within-class distributions. Nonparametric methods can be used without making any assumptions about these distributions.

CANDISC: Performs a canonical analysis to find linear combinations of the quantitative variables that best summarize the differences among the classes.

STEPDISC: It uses forward selection, backward elimination, or stepwise selection to try to find a subset of quantitative variables that best reveals differences among the classes.

Reference:
http://www2.sas.com/proceedings/sugi27/p247-27.pdf

No comments:

Post a Comment