The StreamBase Regular Expression File Reader Input Adapter allows StreamBase applications to read custom-formatted text input files, parsed with regular expressions.
The application specifies an input file, the regular expression used to parse lines of the input file, options for how to time and repeat tuples, how to deal with malformed records, and the target output schema. The input file must be a text file with newlines delimiting records. The adapter parses each line of the file using the provided Java regular expression. Each capture group of the regular expression must correspond to a field of the output schema (the first capture group corresponds to the first schema field and so forth). The fields extracted from the file are coerced to the correct data types according to the schema and tuples are emitted.
Because the input source of this adapter is finite and has no natural timing, this adapter allows the input file to be repeated and the inter-tuple timing to be specified.
The Regular Expression File Reader can read files compressed in the zip or gzip
formats, automatically extracting the file to be read from the zip or gzip archive
file. For this to work, the adapter requires the target file to have the extension
.gz, and expects
to find exactly one text file inside each compressed file. This feature allows the
adapter to read market data files provided by a market data vendor in compressed
format, without needing to uncompress the files in advance.
|File Name||FileName||none||This control is a drop-down list showing eligible files in the current project. Use the drop-down selector to select the file to read and parse. This file is read one line at a time. Each line is parsed using the Format property and emits one tuple.|
|Use Default Charset||UseDefaultCharset||Selected||If selected, specifies whether the Java platform default character set is to be used. If cleared, a valid character set name must be specified for the Character Set property.|
|Character Set||Charset||None||The name of the character set encoding that the adapter is to use to read input or write output.|
The regular expression used to parse the input file. This must be a Java
regular expression as expected by the java.util.regex.Pattern class. For example,
|Period||Period||0||An integer specifying the rate, in milliseconds, at which to read lines from the specified file and emit tuples. Specify 0 or omit this property to emit tuples as quickly as possible.|
|Repeat||Repeat||1||An integer specifying the number of times to repeat the input file. If omitted or 1, this reads the input file once and then stops emitting tuples. If set to 0, this repeats the input file indefinitely.|
|Drop Mismatches||DropMismatches||checked (true)||If selected, records that do not match the regular expression in the Format field are ignored and the next record is immediately examined. Otherwise, a tuple with all fields set to null is emitted when a non-matching input line is encountered.|
|Timestamp Format||TimestampFormat||MM/dd/yyyy hh:mm:ss aa||
Specifies the format used to parse timestamp fields extracted from the input
file. Specify a string in the form expected by the
|Start Control Port||StartControlPort||Cleared||
Select this check box to give this adapter instance an input port that you can use to control which files to read, and in which order. The input schema for the Start Control Port must have a single field of type string. The schema is typechecked as you define it.
If the File Name property is empty, the adapter begins reading when it receives a control tuple on this port. The path to the file to be read is specified in the only field of the tuple. The path can be absolute, or relative to the working directory of the StreamBase Server process.
If the File Name property specifies a file name, there are two cases:
|Log Level||LogLevel||INFO||Controls the level of verbosity the adapter uses to issue informational traces to the console. This setting is independent of the containing application's overall log level. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE, and ALL.|
Use the Edit Schemas tab to specify the schema to output from the adapter.
Typechecking fails if the Format property contains an invalid regular expression, if the number of fields in the output schema does not match the number of capture subexpressions in the Format property, or if the Timestamp Format is malformed.
Malformed records (lines that do no match the Format regular expression) will cause the adapter to either ignore the input line or to emit a tuple with all fields set to null, depending on the value of the Drop Mismatches property.
If a field extracted from the file cannot be coerced into the type specified for that field in the schema (for example, if "abc" is extracted where a int field is expected), that field is set to null in the output tuple. Likewise, if a capture group in the Format expression fails to match, but the overall regular expression does match, the corresponding field in the output tuple is set to null.
When suspended, the input file will remain open and the adapter will retain its position in the file. Upon resume, the adapter will continue consuming lines from the input file and outputting tuples.