Friday, July 23, 2010

Some useful information about Proc sort in SAS

Proc sort in SAS generally used to remove the unnecessary duplicate records from the datasets. Proc SORT sorts data in order that further analysis can be performed, such as "BY" variable processing.


Difference between NODUP and NODUPKEY options

The NODUP (or NODUPRECS) option checks for and eliminates duplicate observations. If one specifies this option, PROC SORT compares all variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, the observation is not written to the output data set.

The NODUPKEY option checks for and eliminates observations with duplicate BY variable values. If you specify this option, PROC SORT compares all BY variable values for each observation to those for the previous observation written to the output data set. If an exact match using the BY variable values is found, the observation is not written to the output data set. One can specify multiple BY variables in either ascending or descending order.

For more details, please refer to the following link:

http://www2.sas.com/proceedings/sugi30/037-30.pdf

DUPOUT Option in Proc sort

The data is unfamiliar; the exploration requires searching for duplicates. SAS version 9 contains a new SORT procedure option named DUPOUT= that puts all deleted duplicate observations into a data set.



Here is the syntax:

proc sort data = dataset out = out_dataset dupout = DUP_dataset nodupkey ;

by cust_id;

run;

For more details, please refer to the following link:

http://www2.sas.com/proceedings/sugi31/164-31.pdf

1 comment:

  1. Two more very important tips to use Proc SORT:
    1. In most of the cases we need only very few variables to be sorted or to have unique. In that case use "Keep" statement in Data statement. NOT in out statement. This optimizes the code.
    2. You can also use "where" statement to sort a subset of the data set. Again it saves another step of sub-setting a data set when its a large file.
    3. Combining 1 and 2 in most cases gives desired result in much lesser steps/time.

    ReplyDelete