Processes¶
Introduction¶
Processes are made of either nodes or sub-processes (more processes). This allows for the expression of a complex process that can be visualized at different levels of detail. Defining a process involves specifying the process name, its children (either nodes or sub-processes), and a description. The description is optional and only appears in the UI view of the process and the documentation.
Processes are created using stream files.
Using Stream Files¶
With a stream file, you can
-
Provide a name and description of the process
-
Automatically create nodes declared in the process
-
Add comments to the process
General Format¶
Stream files accept single line comments start with //
and multiline comments
beginning with \*
and ending with */
. All statements and option keywords
are case-insensitive:
Node == noDe
Statements can flow over multiple lines.
The name of the process is provided with the NAME statement. The NAME statement is required. The DESCR statement is optional and provides a description for the documentation and the UI.
Examples¶
An example of a simple stream file was presented in the Getting Started section:
// My first process
name firstprocess
// read from the input.csv file
in first.csv -> input
// create a computational node
node usernode(input)(output)
// write out the results
out output -> output.csv
This stream file creates a process called firstprocess and three nodes:
-
An input node that reads from a file called input.csv. The name of the node is automatically generated.
-
A computational node named usernode. This node has input from a queue named input and puts the results into a queue named output.
-
An output node that writes the contents of output to output.csv.
Creating the process from this file is done by running the vor create process command:
vor create process <stream file>
A more verbose form of the same file is the following:
// My first process
name firstprocess
descr "This is my first process"
// read from the input.csv file
in first.csv -> input name=inputnode descr= "My input node"
// create a computational node
node usernode(input)(output) descr= "My computational node"
// write out the results
out output -> output.csv name=outputnode descr= "My output node"
The verbose form adds a description for the process and nodes as well as node names for the input and output nodes.
Statements¶
IN and OUT Statements¶
IN an OUT statements are used to define I/O nodes in a process. The syntax for an IN statements is:
( IN | INPUT ) (csvfile | databaseTable) -> queueName [ select="SELECT statement"] [ name= nodeName] [descr="quoted description"][db= PG | MSSQL | CSV | SAS]
When reading a database table, the default database type is PG, postGres.
The following are examples of IN statements:
in fact.csv -> fact // name and description are optional
in input.csv -> input name=inputnode descr= "My input node"
in schema.table -> fact select="select * from schema.table"
The syntax for the OUT statement is:
( OUT | OUTPUT ) queueName -> (csvfile | databaseTable) [ compress] [name=nodeName] [descr=”quoted description][db= PG | MSSQL]
The following are examples of OUT statements:
out output -> output.csv name=outputnode descr= "My output node"
output agg_fact_dsn -> agg_fact_out.csv compress
out fact -> schema.out
SQL Statements¶
SQL statements are used to define SQL nodes in a process. The syntax for a SQL statement is :
SQL SELECT statement ; [name=nodeName] [descr="quoted description"] [predict=predictValue] [ (setdyn | setfact) = comma-separated list of dynamic facts] [ (getdyn | getfact) = comma-separated list of dynamic facts] [minimize= memory | time ] [syntax_version= 1 | 2]
SQL Options:
name=nodeName --> Specifies the name of the node. This is optional.
descr="quoted description" --> Specifies a long form description of the node's purpose or value.
predict=predictValue --> Specifies the prediction queue for the node.
setdyn | setfact=comma-separated list of dynamic facts --> Declares a dynamic fact that will be set in the SQL node. The node must still set the dynamic fact using the DynFactSet() function.
getdyn | getfact=comma-separated list of dynamic facts --> Declares a dynamic fact that will be consumed in the SQL node. The node must still get the dynamic fact using the DynFactGet() function.
minimize=memory | time --> Specifies whether the node should minimize memory or time. The default is time.
syntax_version=1 | 2 --> Specifies the version of the SQL syntax to use. The default is 2.
The following are examples of SQL statements:
sql select *
from fact into agg_fact_dsn
group by city_nm,
scenario_nm,
rpt_hierarchy_nm_12,
region_nm,
product_nm,
productcategory_nm,
portfolio_nm,
instrumenttype_nm,
country_nm,
scenariotype_nm,
stateprovince_nm,
productline_nm,
date;
name=fact_agg_node
descr="aggregate fact table by all the class vars"
sql select *
from types
left join
typesin using (date) into typesout;
SAS Statements¶
SAS statements are used to define SAS nodes in a process. The syntax for a SAS statement is:
SAS queueName1 -> (data set info) queuename2 -> (data set info) (data set info) -> outQueue
SASCMD="command to execute SAS"
SASFILE="Full path to SAS file to run" [name=nodeName] [descr=” quoted description]
Where dataset info is:
DS | DSN | DATASET = WORK.memname PART | PARTITIONED [= number of obs per thread]
The SASCMD only needs to be set on the first SAS node. The remaining SAS nodes will inherit this option.
Here is an example of a SAS statement:
sas input -> ( ds=work.name part=10) creditport -> ( ds=work.b )
(ds=work.ans part) -> output
sascmd="/opt/app/sas/SASFoundation/9.4/sas -config /opt/app/sas/SASFoundation/9.4/sasv9.cfg"
sasfile="/home/user/playpen/src/saspgm.sas"
name= testsas2 descr="Test running a sas process“
Node Statements¶
NODE statements are used to declare computational nodes. These can be Python or Golang nodes. The syntax for the NODE statement is:
NODE nodeName(inputQueue1, inputQueue2, … )(outputQueue1, … outputQueueN) NodeOptions
Or
NODE (inputQueue = <named Parameter> )(outputQueue1=<namedParameter>, … outputQueueN) NodeOptions
Where NodeOptions are:
-
Lang= --> Specifies the programming language the node should be written in.
-
Descr= --> Specifies a long form description of the node's purpose or value.
-
Setsig= --> Declares a signal name that will be set in the computational node. The node must still send the signal using the
SendSignal()
function. -
Getsig= --> Declares a signal name that will be consumed in the computational node. The node must still wait for the signal using the
WaitSignals()
function. -
Setfact= | setdyn= --> Declares a dynamic fact that will be set in the computational node. The node must still set the dynamic fact using DynFactSet() function.
-
Getfact= | getdyn= --> Declares a dynamic fact that will be consumed in the computational node. The node must still get the dynamic fact using DynFactGet() function.
Examples of node statements are as follows:
node usernode(input)(output)
node mynode(inqueue)(queue1,queue2)
node otherNode(q1)(q2,q3) descr=”Python node”
lang=Python getfact=InMat
Process Statements¶
The PROCESS or SUBPROCESS statement is used to declare a subprocess within a process. To declare a subprocess, the process it is referring to must exist. Unlike the command line syntax for declaring a process, you can declare both nodes and subprocesses at the same time. The syntax for declaring a subprocess is:
SUBPROCESS | PROCESS process-name [descr=<"description">] [ { \<process declaration inline> }]
Or
SUBPROCESS | PROCESS process-name(input-1<=input-parm-1> <,input-n<=input-parm-n>>)(output-1<=output-parm-1> <,output-n<=output-parm-n>>) [descr=<"description">] [ { \<process declaration inline> }]
The specification of input and output queues, signals, and dynamic facts is optional unless you are using named parameters. Note that, unlike other statements, the SUBPROCESS statement does not create the process unless the process is explicitly created inline.
If you don't use named parameters, the queue names put there are for documentation purposes only. Queues can be both input and output for a subprocess. You are allowed to:
- Not specify queue names.
- Specify queue names in both in the first and second set of parentheses.
- Specify queue names in only the first set of parentheses.
Mapping Options¶
All processes can be run independent of their parent processes. Usually, the parent processes provide some of the inputs, queues, signals, and dynamic facts, and consume some of the output queues from the process. To satisfy these edge connections you can wrap your process in a driver process that provides these missing connections. The run process will inform you of missing inputs either when you start the process or when you halt the process.
Other VOR Process Commands¶
To delete a process, use the vor delete process
command:
vor delete process <process-name...>
The command vor show process
is used to print all processes created in the
playpen in an order that would allow you to recreate processes in one pass. To
generate the script for a single process, add the <process-name>
to
the command.
Running Processes¶
To run a process, use the vor run command. Name of the process is required, and optionally you can specify a unique name for the running job instance and results. If a unique job instance name is not specified, name of the process is used as default.
vor run <process name> [-n <job instance name>]
Several checks are performed before a job starts running:
- The queues in a process must have at least one input and one output node.
- The graph of the nodes cannot be a cycle.
- The nodes themselves are compiled to check for syntax errors.
To stop a running job, run the vor stop command:
vor stop <process name>
or using Ctrl+C on the vor run process.
By default, all the output will go into the <playpen>/output/<process name> directory. If the usedate option is set to true (see section System Options), then the output will go in directory <playpen>/output/<process name>YYYY-MM-DD hh:mm:tt. You can specify an alternate location using the joboptions.json file in the input directory. The configurations used to run the job will be saved in the output directory along with the logs to the run.
See vor run
for command line options
Reusing Processes¶
A common programming paradigm is reusability. Processes can be reused in other processes simply by including the process as a subprocess in a new process. Reusing a process in this way requires the use of the same queue names in the new process.
Processes and nodes can be viewed as functional components. The arguments to the process “function” would be the queues, signals, and dynamic facts consumed and produced by the function. VOR Stream allows for named parameters on processes and nodes.
process-name|node-name(parameter-name-1=parameter-value-1<...,parameter-name-N>=<parameter-value-N>)
The specification of the named parameter assignment is done at process instantiation time. Only queues, signals, and dynamic facts are supported as named parameters. A node or a process with an input or output queue names that match the input or output dynamic fact names cannot use named parameters.
Given a node, chain, that has an input queue named input and an output queue named output, the following are examples of valid syntax using this node in a process:
// Chain nodes together using named parameters
name subchain2
in input.csv -> input
node chain(input=input)(output=queue1) descr="Test using a node more than once in a process"
node chain(input=queue1)(output=queue2)
node chain(input=queue2)(output=output)
output output -> output.csv
The first instance of the node chain assigns the queue input to the input queue input and assigns queue1 to the output queue output. Similarly, the second instance of the node chain assigns the queue queue1 to the input queue input and assigns queue2 to the output queue output.
Note that queue1 and queue2 are not required to have all matching fields, only the fields that match input will be seen and consumed by the algorithms in chain. Similarly, only matching fields in queue2 and output will be seen by the consuming node. Furthermore, if queue1 and queue2 aren't declared in tables.csv they will assume the form of the input and output queues.
Constant dynamic facts¶
To help identify programmatically which instance of a subprocess is running, constant dynamic facts can be used. A constant dynamic fact is a fact that is assigned a quoted string in the named parameter list. Consider the following example subprocess:
name loop
sql select "${dyn.instance}" as instance from ininst into outinst;
name=instance getdyn=instance
which is a simple SQL node that assigns a string field instance the value of the dynamic fact instance. This process can now be used in another process as follows:
name useloop
in ininst.csv -> ininst
subprocess loop(ininst, instance="instance1")(outinst)
subprocess loop(ininst, instance="instance2")(outinst)
subprocess loop(ininst, instance="instance3")(outinst)
out outinst -> outinst.csv
Now the data coming from each instance of loop would be indentified by the value of instance in the output data.