Skip to content

Data Sources

Introduction

Data is the core of a VOR Stream process. Like all VOR Stream components – data is strongly typed, so it is necessary to annotate the data used by a process in detail for the system. In VOR Stream, data sources include

  • Streaming sources: CSV, S3, PostgreSQL, MS SQL, SAS dataset
  • XLSX
  • Scenario
  • Simulation
  • Matrix Map
  • Risk Factor Curves
  • Job Options

Streaming Data

Input / Output CSV Files

The first line in a CSV file is a comma-delimited list of field names. The following table is an example of input/output CSV file:

id,class1,class2,value
instid_1,Chicago,Illinois,86229
instid_2,Chicago,Illinois,14461
instid_3,Chicago,Illinois,161430
instid_4,New York,New York,118647
instid_5,Raleigh,North Carolina,11596
instid_6,Los Angeles,California,102604
instid_7,Seattle,Washington,151177
instid_8,Chicago,Illinois,49967
instid_9,Los Angeles,California,70632
instid_10,Seattle,Washington,128297

Input file fields are based on fields listed in the tables.csv file. Input CSV files are located in the root of the input directory of a playpen or any subdirectory of the input directory.

Output CSV files are put in a subdirectory of the playpen output directory. The order of the fields for an output CSV file is determined by the variable declaration order for the specific table in the tables.csv file.

Input and Output from S3

S3 data is written and read from S3 in the form of CSV files.

Input and Output from PostgreSQL, MS SQL

Depending on which database is setup by the administrators, data can be streamed to or from PostgreSQL or MS SQL. Only PostgreSQL supports the static array type. MS SQL doesn't support the array type.

Input from a SAS Dataset

SAS datasets are only supported for reading directly with VOR. Writing to SAS datasets requires the use of licensed SAS product called through a SAS node. Reading a SAS dataset doesn't require a SAS install or a license.

Tables.csv

There are two supported forms of table description files. The preferred version is the data dictionary form and the legacy version is tables.csv form. Input table fields are based on fields listed in the tables.csv file.

Data Dictionary

Data dictionaries are an important way to document and ensure consistency and accuracy for field names and data types. VOR Stream supports the ability to describe the fields and tables using a data dictionary format. This is done using two CSV files, dictionary.csv and tables.csv. These files are located in the root of the src directory of a playpen. The data dictionary, dictionary.csv, has the following fields

Field Description
name field name
type the type of the field; num, int, char, array, bool, date, datetime
descr The description of the field
arraylen length of the array if it is an array
genFormat How to generate test data - not required. For more details see the Generating Test Data
group User provided group/category for the variable - Not required

Each field should be described only once in this file. To associate fields to tables, use the tables.csv file which has the fields

Field Description
name This is the name of the input, output, or queue name for this table. This name is case insensitive and should follow the typical naming rules of variables in most computer languages – i.e., start with a letter and contains only numbers, letters, and character underscore.
type This is the type of table. Valid table types are blank, nofanout, or fanout. The default is fanout.
descr A description of the table.
inherit This field refers to the table or tables to inherit fields from. The inheritance is recursive.
groupkey This field specifies the field to be used as a grouping key. This field is optional - for more details see the Grouping Keys section.

The list of fields belonging to the table are provided on the lines following the table entry. Either a blank line, comment, or a table entry with the description filled in indicates the end of the list of variables. For tables that inherit from other tables, drop or keep can be added to the variable names to change which variables are inherited.

Here is an example:

name,type,descr,inherit,groupkey
class,,Common Classifiers for Portfolio tables
instid
nettingsetid
region
city
state
country
GL_account
insttype

mortport,,Mortgage Portfolio table,class
maturity_date
origination_date
principal
segment
float_ref_curve
months_since_default
cleanmort,,Clean Mortgage Portfolio table,mortport
float_ref_curve drop

In this example, the mortport table has all the same fields as the class table plus five additional fields. The cleanmort table has all the fields that the mortport table has except float_ref_curve.

Legacy Tables.csv form

Tables.csv is a description of the queues and the fields in those queues. The structure of the tables.csv file allows for inheritance, dropping, and keeping fields. The table.csv file is located in the root of the src directory of a playpen. This file has the following required fields:

Field Description
tablename This is the name of the input, output, or queue name for this table. This name is case insensitive and should follow the typical naming rules of variables in most computer languages – i.e., start with a letter and contains only numbers, letters, and character underscore.
varname This is the name of a field in the table. This name is case insensitive and should follow the typical naming rules of variables in most computer languages. If this name is blank, the current row is a description of the table itself.
type This is the type of variable or table if the varname field is blank. Valid variable types are num, char, bool, date, datetime, int, array, keep or drop. Valid table types are blank, nofanout, or fanout. The default is fanout
descr A description of the field name or the table if the varname field is blank.
arraylen For variables that are array types, specify the array length.
inherit This field is valid on the table description row and refers to the table or tables to inherit fields from. The inheritance is recursive.
groupkey This field is valid on the table description row and specifies the field to be used as a grouping key. This field is optional - for more details see the Grouping Keys section.

The following table details variable types:

Type Description Example CSV form
num This is a double precision floating point value 1.5e-15, 22, NaN
char This is a variable length string of reasonable length “I am a character string”
bool A Boolean variable is either true or false. T, F, true, false, 0, 1
date A date variable represented as days since the Epoch. In input files, dates should be in ISO or RFC3339 date format although many other formats are parsed. 1985/04/01, 2020-01-01
datetime This is a datetime variable. Times are assumed to be UTC unless otherwise specified. 1994-11-05T13:15:30Z
int This is a 64bit signed integer. -32,764
array This is a double precision floating point array of values with length – arraylen. In input and output tables they are represented by sequentially named variable with numeric suffixes starting from 0 to arraylen-1. myarry0, myarry1, myarry3 or 3.1415, 63, -200.1
drop Drop this variable from the table. This is used to drop unwanted variables from inherited tables. n/a
keep Keep this variable from the table and drop all others that are inherited and aren't marked for keeping. This is used to keep variables from inherited tables. n/a

The following table is an example of tables.csv:

tablename,varname,type,inherit,descr,arraylen
input,,,,Input table,
input,id,char,
input,class,char,
input,value,num,
output,,,input,Output table,

Matrix Maps

Matrix maps are used for data not ideal for processing one observation at a time. Matrix maps are represented by a two-dimensional matrix of floating-point values, which are accessible by a row and a column key. The keys are strings. The input tables for matrix maps are in the input directory of the playpen in a sub directory called matrices. The matrix input file needs to be present at the time the node requests the matrix map (See Sending and Waiting on Signals).

Here is an example input matrix map for foreign exchange rates:

currency,USD,EUR,JPY,GBP,CHF,CAD,AUD,HKD
USD,1.0,1.1229,0.0094,1.2450,1.0527,0.7366,0.6860,0.1290
EUR,0.8906,1.0,0.0083,1.1087,0.9375,0.6560,0.6109,0.1149
JPY,106.8000,119.9257,1.0,132.9660,112.4329,78.6682,73.2648,13.7805
GBP,0.8032,0.9019,0.0075,1.0,0.8456,0.5916,0.5510,0.1036
CHF,0.9499,1.0666,0.0089,1.1826,1.0,0.6997,0.6516,0.1226
CAD,1.3576,1.5244,0.0127,1.6902,1.4292,1.0,0.9313,0.1752
AUD,1.4577,1.6369,0.0137,1.8149,1.5346,1.0738,1.0,0.1881
HKD,7.7501,8.7026,0.0726,9.6489,8.1589,5.7087,5.3166,1.0

Note that the field names in this file do not have to comply with standard variable name rules.

Scenarios

Such as with matrix maps, scenario data is not ideal for a single-observation pass of data. Typically, a scenario is processed for each financial product in a portfolio. The provided scenario file can contain one or more scenarios. Scenario input data is provided as a csv file with two required fields:

  • scenario – The name of the scenario.
  • date – An ISO formatted date or similar.

For each scenario, the dates should be in ascending order. There is no requirement that the different scenarios have the same number or values of the dates. Like matrix maps, the names of the other fields do not have to comply with standard variable name rules. The remaining fields are expected to contain floating point values. Here is a sample scenario file:

scenario, date, Real GDP growth, Nominal GDP growth, Real disposable income growth, Nominal disposable income growth, Unemployment rate, CPI inflation rate, 3-month Treasury rate, 5-year Treasury yield, 10-year Treasury yield, BBB corporate yield, Mortgage rate, Prime rate, Dow Jones Total Stock Market Index, House Price Index, Commercial Real Estate Price Index, Market Volatility Index
baseline,Q1 2020,1.6,3.7,2.2,4.1,3.6,2.2,1.6,1.7,1.8,3.3,3.6,4.8,33,381,213319,22.8
baseline,Q2 2020,1.9,4.0,2.0,3.8,3.6,2.1,1.5,1.7,1.9,3.4,3.6,4.7,33,754,214323,24.5
baseline,Q3 2020,1.8,3.9,1.9,3.6,3.6,2.0,1.5,1.7,1.9,3.5,3.6,4.7,34,123,216327,25.3
baseline,Q4 2020,1.9,4.0,2.1,3.7,3.7,1.9,1.4,1.8,2.0,3.5,3.5,4.6,34,508,217331,25.8
baseline,Q1 2021,1.9,4.0,2.2,4.0,3.7,2.1,1.4,1.8,2.0,3.6,3.6,4.6,34,895,218335,25.9
severely adverse,Q1 2020,-5.3,-3.8,-5.5,-4.2,4.5,1.7,0.1,0.5,0.7,5.2,3.9,3.4,22,262,205308,69.1
severely adverse,Q2 2020,-9.9,-8.7,-7.3,-6.6,6.1,1.1,0.1,0.6,0.9,6.1,4.2,3.4,18,623,198299,70.0
…

Alternately, scenarios can be uploaded in the UI and are served from the database. Scenarios are upload in an Excel file format with each individual scenario being a worksheet in the Excel file. The Excel file also has to have an info worksheet. The format of the info worksheet is as follows:

SCENARIO_NAME SCENARIO_BASE_DATE SCENARIO_FREQUENCY STRESS_TYPE SCENARIO_CATEGORY
S1 5/31/2022 Monthly Optimistic Q22022_IFSR9
S3 5/31/2022 Monthly Pessimistic Q22022_IFSR9
Baseline 5/31/2022 Monthly Base Q22022_IFSR9

Scenario weights can be optionally specified either in the UI as part of a study or as a file in the local playpen. When scenarios are obtained locally, an optional file can be added the scenario directory, scenario_set_name_weights.csv, that contains the scenario weights. If the file isn't present or there are significant errors in the file, then equal weights are used.

The weights CSV file has two fields scenario and weight. The weights must be positive. The weights will be normalized for the user. Here is an example scenario weight file

scenario, weight
adverse, 10
baseline, 20

Simulations

Simulations are like scenarios but are usually much more numerous, and each simulation draw is unnamed. Both a scenario and a simulation can be used in a run. Simulation input data is provided as a CSV file with two required fields:

  • simnum – The name of the scenario.
  • date – An ISO formatted date or similar.

The draws should be in numerical order and for each draw, the dates should be in ascending order. There is no requirement that the different simulations have the same number or values of the dates. Like matrix maps, the names of the other fields do not have to comply with standard variable name rules. The remaining fields are expected to contain floating point values. Here is a sample simulation file:

simnum, date, num1, num2
1, "2020-10-01", 1, 2
1, "2020-10-02", 3, 4
1, "2020-10-03", 5, 6
2, "2020-10-01", 1, 2
2, "2020-10-02", 3, 4
2, "2020-10-03", 5, 6

Risk Factor Curves

Risk factor curves are two-dimensional arrays of dates (maturities) and values (supplied by risk factors). Risk factor curves are used to represent forward rates for prices, discounting rates, exchange rates, and benchmarks.

Risk factor curves in VOR Stream are defined in three steps:

  1. Curve descriptions are provided before a run. The curve description provides, along with other information, the grid of maturities used relative to an evaluation date and the names of the risk factor variables that make up a curve
  2. Risk factor values are input using a scenario or simulation. These values are combined with the date tenors that have been created relative to the base date
  3. Bootstrapping is performed to create the final curve values

The format for a risk factor curve is a CSV file with the following fields. All fields are required unless otherwise specified. This file should be placed in the playpen/input/curves directory or in the look-through directory.

Field Description
name Name of the curve
maturity Relative date, i.e., 1 day, 5 months, 2 years
rfName Name of the risk factor for the given maturity
descr Description of the curve
group Classification of the curve, for UI
currency Currency, i.e., USD, EUR, JPY
fx Two-part name representing the exchange rate, i.e., USD:EUR
bootstrapZero Boolean flag indicating whether to create a zero curve from the values. For forward and FX curves, this should be false
bootstrapOIS Uses Overnight Index Swaps (OIS) to create a discount curve. This is the name of a curve that contains the OIS term structure. The LIBOR/swaps term structure should be referenced by the current curve description. Creates both curves simulataneously.
index This field is optional. Name of an index/benchmark to create: Euribor, Aonia, Eonia. This is also the name of the variable used to satisfy historical queries if indexVariable isn't supplied. Bootstrapping is done using OIS method.
indexPeriod Index period, if applicable, i.e., 1 week, 6 months. Aonia and Eonia do not have periods.
indexVariable The name of the variable that is used to satisfy historical queries. This field is required if an index curve is used.
calendar Specifies which holiday calendar to use when determining dates.
dayCountConv Specifies which day count convention to use, i.e., actual360
compoundingFrequency Specifies the compounding frequency for bootstrapping

Job Options

Job options can be specified in the joboptions.json file in the input directory of the playpen. However, this file is not required to run a process. When you run a process, an override JSON file can be specified which overrides any matching option in the joboptions.json file and adds non-matching options. The resulting job options will be put in the output directory as reference for auditing purposes.

There are two parts to the job options specifications:

  1. System options
  2. Process options

System Options

These are options the VOR Stream system knows how to use. The structure of these options is fixed. The following is an example JSON for the system options:

{
   "system": {
      "machines": {
         "localhost": "box.corp.com,",
         "othermachine": "1.1.1.1.1,"
      },
      "nodes": {
         "nodename@localhost": 1,
         "node2": 5,
         "node6@othermachine": 30
      },
      "lookthroughpath":"",
      "output": {
         "usedate": true
      },
      "scenario":"dfast",
      "debug": {
         "trace": true,
         "sniff": {
            "nobs": 10,
            "queues": [
               "queue1,",
               "queue2,",
               "queuen,"
            ]
         }
      }
   }
}

Process Options

User provided process options are options ignored by VOR Stream and passed to computational nodes for user written code to consume. These options are user specified and contain a category and sub-category, followed by name value pairs. The following is an example of process options JSON:

{
   "processoptions": {
      "category1": {
         "subcat1": {
            "name": "value,",
            "name2": "value2"
         },
         "subcatbar": {
            "name": "value,",
            "name2": [
               1,
               2,
               "3.14,"
            ]
         }
      },
      "categorymine": {
         "subcat1": {
            "name": "value,",
            "name2": "value2"
         },
         "subcatbar": {
            "name": "value,",
            "name2": "value2"
         }
      }
   }
}

The processoptions tag is the only required tag.