Analytics Knowledge Sharing Forum: July 2010

Thursday, July 29, 2010

Some useful tips on proc sql

PROC SQL (Structured Query Language) is a powerful data analysis tool. It can perform many of the same operations as found in traditional SAS code, but can often be more efficient because of its dense language structure.

JOINING THREE OR MORE TABLES

PROC SQL can be an effective tool for joining data, particularly when doing associative or three-way joins. Performing this operation using traditional SAS code would require several PROC SORTs and several DATA step merges. The same result can be achieved with one PROC SQL. For example, we need to join three tables viz. CUSTOMER, SALES and RETURN and where return > 10.

The syntax for getting the table:

PROC SQL; create table customer_data as
SELECT B.Cust_F_name, B.Cust_L_Name, B.return,
E.id, F.STATE, F.Sales
FROM Return as B, customer as E, SALES as F
WHERE B.Cust_F_name=E.Cust_F_name AND
B.Cust_L_name=E.Cust_L_name AND
E.id=F.id AND
return > 10;
QUIT;

How to use drop/keep in proc sql:

One can use drop and keep within proc sql. Here is the syntax:

proc sql;
create table Scores1 as
select *
from customer_data(drop=Cust_L_name Cust_F_name);
quit;

Proc sql can be used for creating macro value list also. For more details please refer the following link:

http://www2.sas.com/proceedings/sugi29/042-29.pdf

Tuesday, July 27, 2010

Effective use of Retain in SAS

1. Holding values of the variables across iterations: The retain statement is used to hold the values of variables across iterations of the data step. Normally, all variables in the data step are set to missing at the start of each of the iteration of the data step. The use of “retain” x y retains the values of the variables x and y across data step iterations. The use of “retain” will retain the values of all variables used in the data step across iterations of the data step.

For example, if we would like to compute values of y(n)=2*y(n-1) with y(1)=1.

data one;

if _n_=1 then y=1;

else y=2*y1;

y1=y;

if _n_ =100 then stop;

run;

Since values are set to missing at the start of data step iteration, the data set “one” will contain one value 1 and the other 99 values of y will be missing.

The following program produces the desired data set.

data two;

retain y1;

if _n_=1 then y=1;

else y=2*y1;

y1=y;

if _n_ =100 then stop;

run;

Reference: http://javeeh.net/sasintro/intro84.html

2. Create time interval using Retain Statement: The RETAIN statement causes a variable to retain its value from one iteration of the DATA step to the next. RETAIN is useful when calculating these time intervals between visits because each unique visit is in different records within the same data set. For more details please check the following link:

http://www2.sas.com/proceedings/sugi25/25/cc/25p100.pdf

3. To Create count/order variables: A counter variable can be created to identify the sequential number of the visits. Here is the syntax:

data aaa; set bbb;

Lpurch = lag(purchase_id);

Ldt = lag(purch_dt);

if first. cust_id then Lpurch = .;

if first. cust_id then Ldt = .;

run;

proc sort data=aaa out= xxx;by cust_id purch_dt purchase_id;

run;

data xxx1; set xxx; by cust_id purch_dt purchase_id;

retain cnt 0;

if first. cust_id then cnt=1;

else if (purch_dt=Ldt and purchase_id=Lpurch) then cnt=cnt;else cnt=cnt+1;

run;

For more details, please visit the following link:

http://www.wuss.org/proceedings07/Posters/POS_Worden_DatumToRemember.pdf

4. To have all the variables in the dataset in a particular order(Re-ordering variables): Any Statement that lists the variables in the desired order before any other Statement will reorder the variables in the newly created Dataset. The most common are the Retain, Length, Attrib, Label, and Format Statements. Retain statement is considered the safest to use. The reason for this is all variables coming from an input Dataset are automatically Retained. As such using a Retain Statement to reorder variables in a Dataset has no unintended side effect. All other Statements require the programmer to specify some attribute of each variable. Here is the syntax:

data high_perf_model_score_1;

retain TITLE_CODE DATA_TYPE_NAME Customer_ID MODEL_NAME MODEL_RUN_DATE

MODEL_LEVEL_CODE SCORE RANK;

set high_perf_model_score;

run;

If we need to use the file as a input for some automated process, that time the order of the variables are very important. Retain Statement is very helpful in this situation.

For more details, please check the link:

http://www.sascommunity.org/wiki/Re-ordering_variables

Saturday, July 24, 2010

Some more details on Hash Table join:

Hash Table Join is very useful and effective though quite complicated.

http://www2.sas.com/proceedings/sugi30/236-30.pdf

Some error messages might appear while joining two datasets using Hash Table. It is quite well explained in support.sas.com. Please check the link below:

http://support.sas.com/kb/39/621.html

Friday, July 23, 2010

Some useful information about Proc sort in SAS

Proc sort in SAS generally used to remove the unnecessary duplicate records from the datasets. Proc SORT sorts data in order that further analysis can be performed, such as "BY" variable processing.

Difference between NODUP and NODUPKEY options

The NODUP (or NODUPRECS) option checks for and eliminates duplicate observations. If one specifies this option, PROC SORT compares all variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, the observation is not written to the output data set.

The NODUPKEY option checks for and eliminates observations with duplicate BY variable values. If you specify this option, PROC SORT compares all BY variable values for each observation to those for the previous observation written to the output data set. If an exact match using the BY variable values is found, the observation is not written to the output data set. One can specify multiple BY variables in either ascending or descending order.

For more details, please refer to the following link:

http://www2.sas.com/proceedings/sugi30/037-30.pdf

DUPOUT Option in Proc sort

The data is unfamiliar; the exploration requires searching for duplicates. SAS version 9 contains a new SORT procedure option named DUPOUT= that puts all deleted duplicate observations into a data set.

Here is the syntax:

proc sort data = dataset out = out_dataset dupout = DUP_dataset nodupkey ;

by cust_id;

run;

For more details, please refer to the following link:

http://www2.sas.com/proceedings/sugi31/164-31.pdf

Thursday, July 22, 2010

Different ways of Merging/Joining two datasets

There are different ways of merging/joining two datasets like:
1. Using data merge Statements
a. It is necessary to sort the datasets before merging and sorting used to take longer time incase datasets are big
b. It is necessary to have the joining variable/s in the same name in two datasets.
2. Using proc sql
a. There is no need to sort datasets and even the joining variable can have different names in two datasets.
b. Use huge amount of work space and creates problem when the dataset/s are big and there is work space limitation.
3. Merge Using proc format
a. This is a really quick way of merging. Sort/Merge is used here when key values from one file are needed to extract records from another file containing the same key, or BY variable.
4. For Syntax, more details and more ways of merging you can refer the following links:
http://www2.sas.com/proceedings/sugi30/054-30.pdf
http://www.nesug.org/Proceedings/nesug09/po/po13.pdf