Debugging & Testing¶

Debugging Options¶

Add print statements to the node files:
- log.Println() for Golang nodes. Note, you need to add "log", or log "github.com/sirupsen/logrus" to the import list.
- logging.Info() for Python nodes. Note, you need to add import logging to the top of the file. The print() function in Python does not work well in parallel.
Run a node independently in a debugger
Use the --trace <fields> option on the vor run command.

Using the --trace <fields> option, select one or more fields used in a process. The selected fields would be traced for 10 observations. Tracing will produce a trace.csv file in the output/<jobID>/trace directory with the following form:

order,obs,thread,node,queue,fieldName,...
1,1,1,firstNode,input,NaN
2,2,1,firstNode,input,1.5
3,3,1,firstNode,input,-14
4,1,1,secondNode,output,NaN
5,4,1,firstNode,input,56
6,2,1,secondNode, output,1.5
...

Where:

order is basically a timestamp of when the field was processed. This is a useful column for sorting. For example sort by obs and order to get a sequence of values for each observation.
obs is the observation number for the named queue. The observation number may not correspond across nodes/queues.
thread is the thread number for the node feeding the queue. It would be useful for the user to disable threading for more clarity in the trace.
node is the name of the node that is the source of the value.
queue is name of the queue the field was observed on.
fieldName is the value of the field as observed in the queue. When multiple fields are selected, <NA> will indicate the field is not present in the current queue.

Note: any field longer than 100 chars will be truncated.

Generating Test Data¶

Testing nodes and processes is an important and efficient activity in VOR Stream. The vor generate command was created to make generating test data easy, both as input data to a process, and input data for testing nodes individually.

By default, vor generate produces very bland data. If you are creating demo data or testing specific features like missing values, vor generate can use a format specification for field names. The format specification is placed in the genFormat field of the data dictionary, dictionary.csv. Format is one of the following:

Format	Description	Allowed Field type	Example output
Percent	Generates a number between 0.01 and 100.	num	99.5, 88.87
Prob	Generates a number in [0,1)	num	.55555, .922
currencySmall	Generate a number with two decimal places between 0, 10,000	num	199.52
currencyMedium	Generate a number with two decimal places between 0, 1,000,000	num	1000.50
currencyBig	Generate a number with two decimal places between 0, 100,000,000	num	20,223,423.56
missingN	Generates missing values/NaNs	num	NaN
Int	Generate positive integers [0,1000). This is for the int variable type.	int	247
City	Generates city names from the US. This is coupled with the state format so city and state match up.	char	Tampa, Raleigh
State	Generates state names from the US. This is coupled with the city format so city and state match up.	char	Florida, North Carolina
Region	Generates one of the five regions of the US.	char	Midwest
Country	Generate a name of a country	char	Albania
list string1@string2@string3	Generate strings from a list. Items can be repeated in the list to increase the frequency that they show up in the data	char	String1
list [email protected]@3.5	Generate numbers from a list	num	2.5
list -1@2@3	Generate integers from a list	int	-1
list true@true@false	Generate a boolean from a list	bool	true
list 2020-01-01@2021-01-01@2022-01-01	Generate a date from a list	date	date
LOB	Generate a line of business for a bank.	char	Commercial Lending
name	Generate a name of a person.	char	Olive Yew
color	Generate a color	char	scarlet
companyFake	Generate a fictional company name	char	Polly Pipe
stock	Generate a stock ticker from the S&P 500. This is linked to the company and sector formats.	char	ABT
company	Generate a real company name from the S&P 500. This is linked to the stock and sector formats.	char	Abbott Laboratories
sector	Generate a sector from the S&P 500. This is linked to the company and stock formats.	char	Health Care
stringSmall	Generate a 16 character string	char
stringMedium	Generate a 64 character string	char
stringBig	Generate a 256 character string	char
missingC	Generate a blank	char
sequence	Uses the name of the field an add a sequential numeric suffix	char	Instid1, instid2, instid3, …
sequencen	Produce a numeric sequence starting with 0	int	0, 1, 2, 3 …
dateFuture	Generate a date in the coming year.	date
dateFarFuture	Generate a date in the next ten years.	date
datePast	Generate a date in the past year.	date
dateFarPast	Generate a date in the past ten years.	date
datetime	Generate a datetime variable value.	datetime
uniform min max <num decimals>	Generate numbers uniformly distributed between two numbers. The optional number of decimals is an integer between 0-10 (10 is the default) that specifies the number of digits preserved after the decimal point.	num	5

The order of the fields in the generated data is determined by the order the fields appear in the dictionary.csv file.

To generate sample data for a table mine use the following command:

vor generate mine

This will use the definition of the table to generate valid data to load as a CSV input. The output will be placed in the <playpen>/input directory and be called mine.csv. By default, 100 observation will be produced. To change how much data is produced use the --nobs <int> option. If the file already exists, the file will not be overwritten unless you use the -f, force option.

To create JSON input for individual node testing, use the --json option. This creates files in the <playpen>/test directory with the extension, .json. You can change the output file name by using the -o <name> option. You can change the seed used in the random number generator by specifying the --seed <int> option.

Options	Description
-f	Forces the program to overwrite the already existing file
--json	Create data in <playpen>/test directory with the extension, .json. Otherwise, the data is created in the input directory as a CSV
--nobs <int>	Specifies the number of observations to create (default 100)
-o <name>	Change output table name (default <table>.csv or <table>.json)
--seed <int>	Change the seed used in the random number generator
--test	Allows you to provide a quoted, comma separated list of nodes to test/debug

End-to-End Process Tests¶

VOR Stream has builtin capabilities to perform End-to-End (E2E) testing of a process. E2E testing is performed in a playpen and run on one or more stream files. The stream files should be in the src directory of the playpen. In the src directory create a JSON file called test.json with the following form:

[
  {
    "streamfile": "<stream-file-name>",
    "logsToComp": [
      "<log-file-1>", "<log-file-2>", ...
    ],
    "ResultsToComp": [
      {
        "name": "<output>.csv",
        "ignore": [
            "<column-name-1>", "<column-name-2>", ...
        ],
        "significance": 4,
        "ids": [
            "<column-name-1>", "<column-name-2>", ...
        ]
      }
    ],
    "WantCreateProcessError": false,
    "JobOptionsFile" : "Optional joboptions in src directory"
  }
]

tests.json contains a list of one or more stream files to test. It has the following rules:

The stream file name must be the same as the process name.
The stream files must be in the source directory.

Below are descriptions of the fields for a test suite entry in the JSON file:

streamfile: The name of the stream file to test.
logsToComp: A list of log files to compare. This is optional.
wantCreateProcessError: A Boolean indicating whether the process creation is expected to fail. This is optional. By default, process creation is expected to succeed.
jobOptionsFile: Name of a job options file in the src directory to use with the current test. This is optional.
ResultsToComp: A list of output CSV files to compare. This is optional.
- name: The name of the CSV file to compare.
- ignore: A list of columns to ignore in the comparison. This is optional.
- significance: The number of significant digits to compare. Only applies to numeric columns. A negative significance implies digits to the left of the decimal. This is optional.
- ids: A list of columns to use as identifiers in the comparison. This signals to use csvdiff to compare the files. csvdiff allows you to compare the semantic contents of two CSV files, ignoring things like row and column ordering in order to get to what’s actually changed.¹ If ids is not specified, a traditional diff is run, and the test output file must match the benchmark file exactly, i.e., the order of rows and columns must be the same. Also, if ids is not specified, the ignore and significance fields are not used.

Note

The tests are run using a generic option which removes the datetime and other non-repeatable information from the logs.

To run a test suite use:

vor test

To run a single test use:

vor test -s <streamfile>

You can optionally specify a directory to store the diff output:

vor test --diff-output-dir <output-dir>

If a relative path is given, the directory is created within the playpen directory. For each failed test with differences, the output file, benchmark file (prefixed with bench_), and the diff (prefixed with diff_) are copied into this specified directory.

The example below shows the directory structure of the diff output directory specified as diff for failed tests alias and distinct:

test_diff_dir

Here is an example of a *tests.json* file:

```JSON
[
  {
    "streamfile": "baseball",
    "logsToComp": [],
    "ResultsToComp": [
      {
        "name": "output_bpf.csv",
        "ids": [
          "playerID",
          "yearID",
          "stint",
          "teamID",
          "lgID",
          "G"
        ]
      }
    ]
  },
  {
    "streamfile": "distinct",
    "logsToComp": [
      "distinct"
    ],
    "ResultsToComp": []
  },
  {
    "streamfile": "orderby",
    "logsToComp": [],
    "ResultsToComp": [
      {
        "name": "test.csv",
        "ids": [
          "str",
          "n1"
        ]
      }
    ]
  },
  {
    "streamfile": "cycle",
    "logsToComp": [],
    "ResultsToComp": [
      {
        "name": "results.csv",
        "ids": [
          "instid"
        ]
      }
    ]
  }
]

https://pypi.org/project/csvdiff/ ↩