Debugging & Testing¶
Debugging Options¶
- Add print statements to the node files:
log.Println()
for Golang nodes. Note, you need to add"log"
, orlog "github.com/sirupsen/logrus"
to the import list.logging.Info()
for Python nodes. Note, you need to addimport logging
to the top of the file. Theprint()
function in Python does not work well in parallel.
- Run a node independently in a debugger
- Use the --trace <fields> option on the vor run command.
Using the --trace <fields> option, select one or more fields used in a process. The selected fields would be traced for 10 observations. Tracing will produce a trace.csv file in the output/<jobID>/trace directory with the following form:
order,obs,thread,node,queue,fieldName,...
1,1,1,firstNode,input,NaN
2,2,1,firstNode,input,1.5
3,3,1,firstNode,input,-14
4,1,1,secondNode,output,NaN
5,4,1,firstNode,input,56
6,2,1,secondNode, output,1.5
...
Where:
- order is basically a timestamp of when the field was processed. This is a useful column for sorting. For example sort by obs and order to get a sequence of values for each observation.
- obs is the observation number for the named queue. The observation number may not correspond across nodes/queues.
- thread is the thread number for the node feeding the queue. It would be useful for the user to disable threading for more clarity in the trace.
- node is the name of the node that is the source of the value.
- queue is name of the queue the field was observed on.
- fieldName is the value of the field as observed in the queue. When multiple fields are selected, <NA> will indicate the field is not present in the current queue.
Note: any field longer than 100 chars will be truncated.
Generating Test Data¶
Testing nodes and processes is an important and efficient activity in VOR Stream. The vor generate command was created to make generating test data easy, both as input data to a process, and input data for testing nodes individually.
By default, vor generate produces very bland data. If you are creating demo data or testing specific features like missing values, vor generate can use a format specification for field names. The format specification is placed in the genFormat field of the data dictionary, dictionary.csv. Format is one of the following:
Format | Description | Allowed Field type | Example output |
---|---|---|---|
Percent | Generates a number between 0.01 and 100. | num | 99.5, 88.87 |
Prob | Generates a number in [0,1) | num | .55555, .922 |
currencySmall | Generate a number with two decimal places between 0, 10,000 | num | 199.52 |
currencyMedium | Generate a number with two decimal places between 0, 1,000,000 | num | 1000.50 |
currencyBig | Generate a number with two decimal places between 0, 100,000,000 | num | 20,223,423.56 |
missingN | Generates missing values/NaNs | num | NaN |
Int | Generate positive integers [0,1000). This is for the int variable type. | int | 247 |
City | Generates city names from the US. This is coupled with the state format so city and state match up. | char | Tampa, Raleigh |
State | Generates state names from the US. This is coupled with the city format so city and state match up. | char | Florida, North Carolina |
Region | Generates one of the five regions of the US. | char | Midwest |
Country | Generate a name of a country | char | Albania |
list string1@string2@string3 | Generate strings from a list. Items can be repeated in the list to increase the frequency that they show up in the data | char | String1 |
list [email protected]@3.5 | Generate numbers from a list | num | 2.5 |
list -1@2@3 | Generate integers from a list | int | -1 |
list true@true@false | Generate a boolean from a list | bool | true |
list 2020-01-01@2021-01-01@2022-01-01 | Generate a date from a list | date | date |
LOB | Generate a line of business for a bank. | char | Commercial Lending |
name | Generate a name of a person. | char | Olive Yew |
color | Generate a color | char | scarlet |
companyFake | Generate a fictional company name | char | Polly Pipe |
stock | Generate a stock ticker from the S&P 500. This is linked to the company and sector formats. | char | ABT |
company | Generate a real company name from the S&P 500. This is linked to the stock and sector formats. | char | Abbott Laboratories |
sector | Generate a sector from the S&P 500. This is linked to the company and stock formats. | char | Health Care |
stringSmall | Generate a 16 character string | char | |
stringMedium | Generate a 64 character string | char | |
stringBig | Generate a 256 character string | char | |
missingC | Generate a blank | char | |
sequence | Uses the name of the field an add a sequential numeric suffix | char | Instid1, instid2, instid3, … |
sequencen | Produce a numeric sequence starting with 0 | int | 0, 1, 2, 3 … |
dateFuture | Generate a date in the coming year. | date | |
dateFarFuture | Generate a date in the next ten years. | date | |
datePast | Generate a date in the past year. | date | |
dateFarPast | Generate a date in the past ten years. | date | |
datetime | Generate a datetime variable value. | datetime | |
uniform min max <num decimals> | Generate numbers uniformly distributed between two numbers. The optional number of decimals is an integer between 0-10 (10 is the default) that specifies the number of digits preserved after the decimal point. | num | 5 |
The order of the fields in the generated data is determined by the order the fields appear in the dictionary.csv file.
To generate sample data for a table mine use the following command:
vor generate mine
This will use the definition of the table to generate valid data to load as a CSV input. The output will be placed in the <playpen>/input directory and be called mine.csv. By default, 100 observation will be produced. To change how much data is produced use the --nobs <int> option. If the file already exists, the file will not be overwritten unless you use the -f, force option.
To create JSON input for individual node testing, use the --json option. This creates files in the <playpen>/test directory with the extension, .json. You can change the output file name by using the -o <name> option. You can change the seed used in the random number generator by specifying the --seed <int> option.
Options | Description |
---|---|
-f | Forces the program to overwrite the already existing file |
--json | Create data in <playpen>/test directory with the extension, .json. Otherwise, the data is created in the input directory as a CSV |
--nobs <int> | Specifies the number of observations to create (default 100) |
-o <name> | Change output table name (default <table>.csv or <table>.json) |
--seed <int> | Change the seed used in the random number generator |
--test | Allows you to provide a quoted, comma separated list of nodes to test/debug |
End-to-End Process Tests¶
VOR Stream has builtin capabilities to perform End-to-End (E2E) testing of a process. E2E testing is performed in a playpen and run on one or more stream files. The stream files should be in the src directory of the playpen. In the src directory create a JSON file called test.json with the following form:
[
{
"streamfile": "<stream-file-name>",
"logsToComp": [
"<log-file-1>", "<log-file-2>", ...
],
"ResultsToComp": [
{
"name": "<output>.csv",
"ignore": [
"<column-name-1>", "<column-name-2>", ...
],
"significance": 4,
"ids": [
"<column-name-1>", "<column-name-2>", ...
]
}
],
"WantCreateProcessError": false,
"JobOptionsFile" : "Optional joboptions in src directory"
}
]
tests.json
contains a list of one or more stream files to test. It has the
following rules:
- The stream file name must be the same as the process name.
- The stream files must be in the source directory.
Below are descriptions of the fields for a test suite entry in the JSON file:
streamfile
: The name of the stream file to test.logsToComp
: A list of log files to compare. This is optional.wantCreateProcessError
: A Boolean indicating whether the process creation is expected to fail. This is optional. By default, process creation is expected to succeed.jobOptionsFile
: Name of a job options file in thesrc
directory to use with the current test. This is optional.ResultsToComp
: A list of output CSV files to compare. This is optional.name
: The name of the CSV file to compare.ignore
: A list of columns to ignore in the comparison. This is optional.significance
: The number of significant digits to compare. Only applies to numeric columns. A negative significance implies digits to the left of the decimal. This is optional.ids
: A list of columns to use as identifiers in the comparison. This signals to usecsvdiff
to compare the files.csvdiff
allows you to compare the semantic contents of two CSV files, ignoring things like row and column ordering in order to get to what’s actually changed.1 Ifids
is not specified, a traditionaldiff
is run, and the test output file must match the benchmark file exactly, i.e., the order of rows and columns must be the same. Also, ifids
is not specified, theignore
andsignificance
fields are not used.
Note
The tests are run using a generic option which removes the datetime and other non-repeatable information from the logs.
To run a test suite use:
vor test
To run a single test use:
vor test -s <streamfile>
You can optionally specify a directory to store the diff output:
vor test --diff-output-dir <output-dir>
If a relative path is given, the directory is created within the playpen
directory. For each failed test with differences, the output file, benchmark
file (prefixed with bench_
), and the diff (prefixed with diff_
) are copied
into this specified directory.
The example below shows the directory structure of the diff output directory
specified as diff
for failed tests alias
and distinct
:
Here is an example of a *tests.json* file:
```JSON
[
{
"streamfile": "baseball",
"logsToComp": [],
"ResultsToComp": [
{
"name": "output_bpf.csv",
"ids": [
"playerID",
"yearID",
"stint",
"teamID",
"lgID",
"G"
]
}
]
},
{
"streamfile": "distinct",
"logsToComp": [
"distinct"
],
"ResultsToComp": []
},
{
"streamfile": "orderby",
"logsToComp": [],
"ResultsToComp": [
{
"name": "test.csv",
"ids": [
"str",
"n1"
]
}
]
},
{
"streamfile": "cycle",
"logsToComp": [],
"ResultsToComp": [
{
"name": "results.csv",
"ids": [
"instid"
]
}
]
}
]