SAS
Introduction¶
A SAS node allows the integration of a batch SAS program with other VOR Stream nodes. Like other VOR Stream nodes, a SAS node takes input streams of data and produces streams of data. If the data being fed to the SAS program is partitionable, then the SAS node can be partitioned in three ways:
-
Break the input data into pieces and process the pieces sequentially. This allows subsequent nodes to begin working on the output while the SAS node processes the next partitions.
-
Break the input data into pieces and process them in parallel. This allows for the use of multiple threads or multiple machines. As individual partitions are completed, the results are passed to subsequent nodes without waiting for other threads to complete.
-
A combination of 1 and 2 can be used.
The partitioning of a SAS node helps the node behave less sequential and improves parallel processing. Other nodes downstream from a SAS node can work on completed partitions in parallel with the SAS node.
Partition Requirements¶
To be partitionable, the SAS program must meet the following criterion:
-
The SAS program cannot require all the input data to be present at once. This only applies to the partitioned data and not to the other input data. This criteria is necessary for both 1 and 2 above.
-
For parallel processing (2), intermediate and final data must be written to and read from the WORK library. Any program that locks data that another process would either read or write from would disallow the use of parallel processing.
-
The order of the result data must not be required. This is only a restriction for parallel processing.
Creating SAS Nodes¶
The method of creating nodes is through Stream Files. This file would require the typical process name, input node, and output node declarations, in addition to the sas node declaration. The SAS node sections are detailed in the Process - SAS Statements section.
An example of a SAS stream file is:
name sas
in input.csv -> input
// This is a SAS Node
sas input -> (ds = a part=10)
(ds=outds part) -> outputsas
sasWork = ( "/tmp/saswork", "/mount2/saswork" )
sasFile= "part.sas"
getdyn = fact1, fact2
getsig = signal1
scenariods = work.scen
framework = PD
name=saspart
out outputsas -> ans.csv
where
-
ds - Source or destination SAS data set. Multiple destination and source desitionations can be specified. Two level SAS names are allowed but the use of the WORK libref is recommended.
-
part - Designates an input or output partitioned data set. For input data sets, there can only be one that is partitioned. The optional number specified with the partion indicates how many observations to include in each partition. For output data sets, the partition option designates that the output should be appended to the other partitions.
-
saswork - Optional list of locations for SAS work. Each thread will be assigned a work directory in a round robin fashion.
-
sasfile - The SAS program to run. The sasFile option can have an explicit path to the SAS program or a path relative to the playpen directory.
-
getdyn - Optional list of dynamic facts the SAS node needs. The node will wait for the facts to be available before running. The facts will be assigned to macro variables of the same name.
-
getsig - Optional list of signals that he the SAS node waits on. The node will wait for the signals to be sent before running. This can be used to synchronize the SAS node with other nodes.
-
scenariods - Optional name of the data set to put the current scenario set in. The scenario data set will have scenario, the scenario name, and date variables as well as all of the provided risk factors. Some of the risk factor names could have embedded spaces. These variables should be accessed as n-literals in SAS.
-
framework - Optional name of the modeling framework to use.
The SAS command is obtained from an environment variable SASCMD:
export SASCMD="/opt/sas/install/SASFoundation/9.4/sas -config /opt/sas/install/SASFoundation/9.4/sasv9.cfg"
The SASCMD will need to be adjusted depending on the locations of where SAS was installed.
If there is a static source for input data, the static can be read directly from the SAS program if the data is opened as read only. The static input should not be listed as one of the input queues.
Only one of the input queues can be partitioned. Multiple output queues can be marked as partitioned.
When the SAS node completes, it will automatically send a signal that is the name of the node. In the example above, the node will send the signal saspart when it completes.
Partitioning Input Data¶
The part option specifies that an input data set should be partitioned. If a number is specified with the part= option, this number is the observations per thread. That is, the number of observations read from the input queue before launching a SAS session to process those observations. The default value for observations per thread is 10,000.
The input data set can also be partitioned by a group key. Specify a group key on the input, partitioned queue. The partitions will then be the size of the number of observations on the outer group key. This will guarantee all of a specific group are included in an partition sent to SAS. Partitioning in this way has some restrictions on the size of the groups. VOR Stream keeps all observations from the same group in a single message. This limits the partition size to be at most around a thousand observations. Note, VOR Stream handles BY processing different than SAS as the by groups (group key) in VOR Stream do not have to be sorted.
sas Null -> (ds = a)
(ds=outds) -> output
sasCMD = "/opt/app/sas/SASFoundation/9.4/sas -config /opt/app/sas/SASFoundation/9.4/sasv9.cfg"
sasFile= "/home/sasuser/src/sasfile.sas"
name=sastest
Automatic SAS Macro variables¶
There are several macro variables that are made available for use in your SAS program:
-
threadNum - An integer, starting at zero, that indicates the thread number.
-
partNum - An integer, starting at 1, that indicates the partition number.
-
jobID - An integer valued job ID.
-
rootDir - The path to the temporary directory used to run the SAS node.
-
VOR_STREAM_PLAY - The path to the playpen directory.
-
VOR_STREAM_OUTPUT_PATH - The path to the output directory.
Process Options¶
Process option, if they exist, will show up in dataset called work.processoptions. The format of the data set is:
category | subcategory | name | value |
---|---|---|---|
reporting | format | page | landscape |
scenario | info | asofdate | 2022-06-01 |
Debugging a SAS Node¶
A directory is created in /tmp/<name of node>-<userID>-<internaljobID> containing the intermediate files. Perform the following export before running to turn off automatic removal of the files:
export LOG_LEVEL=debug
In that directory, there will be a SAS file called <node name>.sas. This is the program that is run for each thread. It is safe to run this program on its own to investigate problems. The SAS log and list file for the node will be in the playpen/output/jobName/logs directory.