Skip to content

Debugging & Testing

Debugging Options

  1. Add print statements to the node files:
    • log.Println() for Golang nodes. Note, you need to add "log", or log "github.com/sirupsen/logrus" to the import list.
    • logging.Info() for Python nodes. Note, you need to add import logging to the top of the file. The print() function in Python does not work well in parallel.
  2. Run a node independently in a debugger
  3. Use the --trace <fields> option on the vor run command.

Using the --trace <fields> option, select one or more fields used in a process. The selected fields would be traced for 10 observations. Tracing will produce a trace.csv file in the output/<jobID>/trace directory with the following form:

order,obs,thread,node,queue,fieldName,...
1,1,1,firstNode,input,NaN
2,2,1,firstNode,input,1.5
3,3,1,firstNode,input,-14
4,1,1,secondNode,output,NaN
5,4,1,firstNode,input,56
6,2,1,secondNode, output,1.5
...

Where:

  • order is basically a timestamp of when the field was processed. This is a useful column for sorting. For example sort by obs and order to get a sequence of values for each observation.
  • obs is the observation number for the named queue. The observation number may not correspond across nodes/queues.
  • thread is the thread number for the node feeding the queue. It would be useful for the user to disable threading for more clarity in the trace.
  • node is the name of the node that is the source of the value.
  • queue is name of the queue the field was observed on.
  • fieldName is the value of the field as observed in the queue. When multiple fields are selected, <NA> will indicate the field is not present in the current queue.

Note: any field longer than 100 chars will be truncated.

Generating Test Data

Testing nodes and processes is an important and efficient activity in VOR Stream. The vor generate command was created to make generating test data easy, both as input data to a process, and input data for testing nodes individually.

By default, vor generate produces very bland data. If you are creating demo data or testing specific features like missing values, vor generate can use a format specification for field names. The format specification is placed in the genFormat field of the data dictionary, dictionary.csv. Format is one of the following:

Format Description Allowed Field type Example output
Percent Generates a number between 0.01 and 100. num 99.5, 88.87
Prob Generates a number in [0,1) num .55555, .922
currencySmall Generate a number with two decimal places between 0, 10,000 num 199.52
currencyMedium Generate a number with two decimal places between 0, 1,000,000 num 1000.50
currencyBig Generate a number with two decimal places between 0, 100,000,000 num 20,223,423.56
missingN Generates missing values/NaNs num NaN
Int Generate positive integers [0,1000). This is for the int variable type. int 247
City Generates city names from the US. This is coupled with the state format so city and state match up. char Tampa, Raleigh
State Generates state names from the US. This is coupled with the city format so city and state match up. char Florida, North Carolina
Region Generates one of the five regions of the US. char Midwest
Country Generate a name of a country char Albania
list string1@string2@string3 Generate strings from a list. Items can be repeated in the list to increase the frequency that they show up in the data char String1
list [email protected]@3.5 Generate numbers from a list num 2.5
list -1@2@3 Generate integers from a list int -1
list true@true@false Generate a boolean from a list bool true
list 2020-01-01@2021-01-01@2022-01-01 Generate a date from a list date date
LOB Generate a line of business for a bank. char Commercial Lending
name Generate a name of a person. char Olive Yew
color Generate a color char scarlet
companyFake Generate a fictional company name char Polly Pipe
stock Generate a stock ticker from the S&P 500. This is linked to the company and sector formats. char ABT
company Generate a real company name from the S&P 500. This is linked to the stock and sector formats. char Abbott Laboratories
sector Generate a sector from the S&P 500. This is linked to the company and stock formats. char Health Care
stringSmall Generate a 16 character string char
stringMedium Generate a 64 character string char
stringBig Generate a 256 character string char
missingC Generate a blank char
sequence Uses the name of the field an add a sequential numeric suffix char Instid1, instid2, instid3, …
sequencen Produce a numeric sequence starting with 0 int 0, 1, 2, 3 …
dateFuture Generate a date in the coming year. date
dateFarFuture Generate a date in the next ten years. date
datePast Generate a date in the past year. date
dateFarPast Generate a date in the past ten years. date
datetime Generate a datetime variable value. datetime
uniform min max <num decimals> Generate numbers uniformly distributed between two numbers. The optional number of decimals is an integer between 0-10 (10 is the default) that specifies the number of digits preserved after the decimal point. num 5

The order of the fields in the generated data is determined by the order the fields appear in the dictionary.csv file.

To generate sample data for a table mine use the following command:

vor generate mine

This will use the definition of the table to generate valid data to load as a CSV input. The output will be placed in the <playpen>/input directory and be called mine.csv. By default, 100 observation will be produced. To change how much data is produced use the --nobs <int> option. If the file already exists, the file will not be overwritten unless you use the -f, force option.

To create JSON input for individual node testing, use the --json option. This creates files in the <playpen>/test directory with the extension, .json. You can change the output file name by using the -o <name> option. You can change the seed used in the random number generator by specifying the --seed <int> option.

Options Description
-f Forces the program to overwrite the already existing file
--json Create data in <playpen>/test directory with the extension, .json. Otherwise, the data is created in the input directory as a CSV
--nobs <int> Specifies the number of observations to create (default 100)
-o <name> Change output table name (default <table>.csv or <table>.json)
--seed <int> Change the seed used in the random number generator
--test Allows you to provide a quoted, comma separated list of nodes to test/debug

End-to-End Process Tests

VOR Stream has builtin capabilities to perform End-to-End (E2E) testing of a process. E2E testing is performed in a playpen and run on one or more stream files. The stream files should be in the src directory of the playpen. In the src directory create a JSON file called test.json with the following form:

[
  {
    "streamfile": "<stream-file-name>",
    "logsToComp": [
      "<log-file-1>", "<log-file-2>", ...
    ],
    "ResultsToComp": [
      {
        "name": "<output>.csv",
        "ignore": [
            "<column-name-1>", "<column-name-2>", ...
        ],
        "significance": 4,
        "ids": [
            "<column-name-1>", "<column-name-2>", ...
        ]
      }
    ],
    "WantCreateProcessError": false,
    "JobOptionsFile" : "Optional joboptions in src directory"
  }
]

tests.json contains a list of one or more stream files to test. It has the following rules:

  1. The stream file name must be the same as the process name.
  2. The stream files must be in the source directory.

Below are descriptions of the fields for a test suite entry in the JSON file:

  • streamfile: The name of the stream file to test.
  • logsToComp: A list of log files to compare. This is optional.
  • wantCreateProcessError: A Boolean indicating whether the process creation is expected to fail. This is optional. By default, process creation is expected to succeed.
  • jobOptionsFile: Name of a job options file in the src directory to use with the current test. This is optional.
  • ResultsToComp: A list of output CSV files to compare. This is optional.
    • name: The name of the CSV file to compare.
    • ignore: A list of columns to ignore in the comparison. This is optional.
    • significance: The number of significant digits to compare. Only applies to numeric columns. A negative significance implies digits to the left of the decimal. This is optional.
    • ids: A list of columns to use as identifiers in the comparison. This signals to use csvdiff to compare the files. csvdiff allows you to compare the semantic contents of two CSV files, ignoring things like row and column ordering in order to get to what’s actually changed.1 If ids is not specified, a traditional diff is run, and the test output file must match the benchmark file exactly, i.e., the order of rows and columns must be the same. Also, if ids is not specified, the ignore and significance fields are not used.

Note

The tests are run using a generic option which removes the datetime and other non-repeatable information from the logs.

To run a test suite use:

vor test

To run a single test use:

vor test -s <streamfile>

You can optionally specify a directory to store the diff output:

vor test --diff-output-dir <output-dir>

If a relative path is given, the directory is created within the playpen directory. For each failed test with differences, the output file, benchmark file (prefixed with bench_), and the diff (prefixed with diff_) are copied into this specified directory.

The example below shows the directory structure of the diff output directory specified as diff for failed tests alias and distinct:

test_diff_dir

Here is an example of a *tests.json* file:

```JSON
[
  {
    "streamfile": "baseball",
    "logsToComp": [],
    "ResultsToComp": [
      {
        "name": "output_bpf.csv",
        "ids": [
          "playerID",
          "yearID",
          "stint",
          "teamID",
          "lgID",
          "G"
        ]
      }
    ]
  },
  {
    "streamfile": "distinct",
    "logsToComp": [
      "distinct"
    ],
    "ResultsToComp": []
  },
  {
    "streamfile": "orderby",
    "logsToComp": [],
    "ResultsToComp": [
      {
        "name": "test.csv",
        "ids": [
          "str",
          "n1"
        ]
      }
    ]
  },
  {
    "streamfile": "cycle",
    "logsToComp": [],
    "ResultsToComp": [
      {
        "name": "results.csv",
        "ids": [
          "instid"
        ]
      }
    ]
  }
]