CSV File Reader Input Adapter

Introduction

The CSV File Reader is an embedded adapter that reads comma-separated value (CSV) files.

An embedded adapter is an adapter that runs in the same process as a StreamBase Server. The CSV File Reader reads records from a CSV file, creates tuples from these records, then sends these tuples to the operator downstream from it in its StreamBase application. A record typically consists of a line in the CSV file. If quoted, however, a record can span more than one line in the file.

The CSV File Reader is similar to an input stream that supplies its own input from a CSV file. As with an input stream, a schema needs to be specified for the CSV File Reader. The schema used by the CSV File Reader is specified in the Edit Schema tab of the Properties View in StreamBase Studio.

An embedded adapter that reads from a CSV file differs from an external data source, in that it consumes its input file as rapidly as it can. This means the rate at which it consumes records and produce tuples is governed only by the speed at which it can read records from disk and create tuples from them. This would not typically be true of an external data source and it may not be the desired behavior. A property of the CSV File Reader, Period, is used to govern the rate at which the CSV File Reader consumes records. The period is the amount of time that the CSV File Reader pauses between consuming records. That is, the CSV File Reader reads one record, processes it to completion, pauses for the specified period, and then reads another record.

The name of the CSV file is specified as a property of the CSV File Reader. If you use the File Name field without the Start Control Port option, the specified file must exist in the same project folder in StreamBase Studio, or in a referenced project's folder. If you use the File Name field in conjunction with the Start Control Port option, you can specify a relative or absolute path to the CSV file. If you specify a relative path, the named file is searched for in the directory specified in the StreamBase Server sbd.sbconf configuration file. In the global section, look for the operator-resource-search parameter. By default, it is commented out, as seen in this example

<global>
 ...
<!-- The following optional element is used to load any
     operator/adapter resources required by this application. You can
     have as many <operator-resource-search> elements as you like. Each
     one must have a "directory" attribute, which will be scanned for
     any resources referenced by operators and adapters. -->
<!-- <operator-resource-search directory="${STREAMBASE_HOME}/resources"/> -->
</global>

Uncomment the element and specify a path. For example:

<global>
  <operator-resource-search directory="/home/sbuser/mysbapps/resources"/>
</global>

Another consideration is that the size of a CSV file may be limited by practical considerations. It may not be practical to provide the desired amount of data in a single file. One possible solution is to iterate over the CSV file a number of times. This is provided for by the Repeat property. If 0 is specified for Repeat, then the CSV File Reader iterates over the CSV file indefinitely.

Note that the CSV file can be either imported into your project, or created and edited in StreamBase Studio. To create a new one, select FileNewFile. In the New File dialog, specify the file's name and project. A new, empty file is opened in a text editor, where you can edit and save it.

The CSV File Reader allows you to specify a string that, when encountered in an incoming CSV field, will be translated into a null tuple field value. The default string is null, but you can specify any string in the NULL String property.

Properties

File Name

The name of the CSV file to read, without any path. The specified file must be in the current project folder, or in a referenced project's folder. You must enter a file name in this field, or enable the Start Control Port, or both. If Start Control Port is disabled, the file specified in this field is the only file to be read by the current adapter instance. If Start Control Port is enabled, a file specified in this field is the default file to be read, as described below.

Field Delimiter

The delimiter used to separate tokens. The default is a comma.

String Quote Character

The optional quote character used (in pairs) to delimit string constants. The default is the double-quote character.

Timestamp Format

The string format used to represent timestamp fields extracted from the input file. The default and ideal is the form expected by the SimpleDateFormat class. For more information, see the com.streambase.sb.adapter.Adapter class description in the StreamBase Java API documentation.

Period

The time, in milliseconds, to wait between the processing of records.

Repeat

The number of times to iterate over the CSV file. 0 specifies iterating indefinitely.

Start Control Port

Enable this checkbox to give this adapter instance an input port that you can use to control which CSV files to read, and in which order. The input schema for the Start Control Port must have a single field, a string of size 10 or greater. The schema is typechecked as you define it.

If the File Name property is empty, the adapter begins reading when it receives a control tuple on this port. The path to the CSV file to be read is specified in the only field of the tuple. The path can be absolute, or relative to the working directory of the StreamBase Server process.

If the File Name property specifies a file name, there are two cases:

  1. If a control tuple received on this port has an empty or null string, the file specified in the File Name property is read or re-read.

  2. If a control tuple contains the path to a CSV file, then that specified file is read, as above, ignoring the File Name field.

Start Event Port

Enable this checkbox to create an output port that emits an informational tuple each time a CSV file is opened or closed. The informational tuple schema has five fields: three strings, an int, and a string.

For a file open event, the event port tuple's first string is set to "Open", while the second string is set to the path name of the CSV file being opened.

For a file close event, the event port tuple's first string is set to "Close", the second string is set to the path name of the CSV file being closed, and the int is set to the number of rows that were read from the CSV file.

NULL String

The string which, if encountered in a CSV field when reading a file, is to be translated as a null tuple field value for the corresponding tuple field. If unspecified, the default string is null. You can designate any string to be considered the null value string.

Header Type

The type of header used in the CSV file. Choose one of the following:

No header

The CSV file contains no header and is to be parsed without a header.

Ignore header

The first line of the CSV file is to be considered the header. The first line is skipped and not read into the adapter as a tuple.

Read header

The first line of the CSV file is to be considered the header, and compared against the schema used in your StreamBase application. Fields that do not match the schema are not parsed (including the subsequent fields in the following rows), and fields outside the range of the header are not parsed. Field order does not matter, because the adapter reorganizes the CSV file to fit the schema of the StreamBase application.

All Truncation Warnings

If checked (default), all truncation events during the CSV file read result in warning messages written to stderr. In StreamBase Studio, these messages appear in the Console View. If unchecked, only the first truncation event per field is reported.

Typechecking and Error Handling

Typechecking fails if the schema does not have at least one parameter, if the Delimiter is not single character string, if the QuoteChar is longer than one character, or if the TimestampFormat is malformed. The File Name field fails to typecheck only if it is blank and you have not enabled the Start Control Port option.

A warning is emitted if the File Name property is empty and a null control tuple is received on the Start Control Port.

Suspend/Resume Behavior

On suspend, the CSV File Reader finishes processing the current record, outputs the tuple, and then pauses. The input file remains open and the adapter retains its position in the file. The adapter will stay paused until it is either shutdown or resumed.

On resumption, the CSV File Reader continues processing with the next record in the input file.