Contents
This sample demonstrates StreamBase support for UTF-16 character data. UTF-16 is the 16-bit Unicode Transformation Format. It is a variable-length character encoding for Unicode, capable of encoding the entire Unicode set.
The sample includes several short documents that contain non-Latin characters. We chose Hebrew characters, as an example. The StreamBase application includes custom functions that manipulate and search for the Hebrew characters.
To follow along with the sample, the high-level steps are:
-
Use a Java client to load sample Hebrew character keywords into a StreamBase application's Query Table.
-
Notice how UTF-16 custom Java functions are used in the application's expressions to manipulate the keywords.
-
Use the Java client to load a Hebrew document into the application.
-
Notice how UTF-16 functions are used to find matches between the Hebrew keywords and text in the Hebrew document.
This topic also contains a list of all the UTF-16 functions provided by StreamBase.
In StreamBase Studio, import this sample with the following steps:
-
From the top menu, click → .
-
Select this sample from the Extending StreamBase list.
-
Click OK.
StreamBase Studio creates a project for the sample.
-
On Windows:
C:\Program Files\StreamBase Systems\StreamBase.n.m\sample\utf-16 -
On UNIX:
/opt/streambase/sample/utf-16
When you load the sample into StreamBase Studio, Studio copies the
sample project's files to your Studio workspace. StreamBase Systems
recommends that you use the workspace copy of the sample, especially on UNIX, where
you may not have write access to /opt/streambase. In
the default installation, the path to this sample in your Studio workspace is:
UNIX: ~/streambase-studio-n.m-workspace/sample_utf-16 Windows XP: C:\Documents and Settings\username\My Documents\StreamBase Studion.mWorkspace\ sample_utf-16 Windows Vista: C:\Users\username\Documents\StreamBase Studion.mWorkspace\ sample_utf-16
The UTF-16 sample consists of:
-
Two StreamBase applications:
utf16-search.sbappandutf16-regexpsearch.sbapp. -
Two documents that contain Hebrew text:
doc_utf_16_big_endian.txtanddoc_utf_16_little_endian.txt. -
Two data files that contain Hebrew text keywords, which you will load into the StreamBase application's Query Table. The data files are:
keywords_utf_16_big_endian.txtandkeywords_utf_16_littleendian.txt. The instructions in this topic explain how you can load the keywords. -
A StreamBase Java client,
, that will load the keywords and (in a separate step) load the document into the StreamBase application.UTF16Client -
A set of StreamBase custom Java functions implemented in the
utf16class, used by the StreamBase applications to manipulate the non-Latin keywords and then search for matches in a separately loaded non-Latin document. For example, one expression is:calljava("com.streambase.utf16", "indexOf",text, keyword, start) -
A JAR file that contains the Java classes for a client (
UTF16Client) and the custom Java function (utf16). -
The Java source code for the client and the custom
utf16functions, in.streambase-install-dir/sample/utf-16/*.java -
The ant
build.xmlfile, if you want to rebuild the class files from the provided sources and generate an updatedutf16.jar.
You can and should open this sample's application files in StreamBase Studio to study how the applications are assembled. However, this sample is designed to be run in UNIX terminal windows or Windows command prompt windows. On Windows, be sure to use the StreamBase Command Prompt from the Start menu as described in the Test/Debug Guide, not a standard command prompt.
-
Open three terminal windows on UNIX, or three StreamBase Command Prompts on Windows. In each window, navigate to the directory where the sample is installed, or to your workspace copy of the sample, as described above.
-
In window 1, launch an instance of StreamBase Server on
utf16-search.sbapp:sbd -f sbd.sbconf utf16-search.sbapp
-
In window 2, enter the following command to dequeue tuples from the keywords-found output stream:
sbc dequeue keywords_found
Note: Initially, no output is displayed in the dequeue window. We must complete the next few steps before results are dequeued. The results will be messages confirming that matches were found in the StreamBase application between the Hebrew keywords and the Hebrew document's text.
-
In window 3, set the CLASSPATH, and double-check the other environment settings. Make sure the
STREAMBASE_HOMEenvironment variable is set to your StreamBase installation directory. These examples also assume a supported JDK is on your PATH.Set the CLASSPATH to include this sample's JAR file as well as the standard StreamBase Client library.
On UNIX:
export CLASSPATH=$STREAMBASE_HOME/sample/utf-16/utf16.jar: $STREAMBASE_HOME/lib/sbclient.jar:$CLASSPATHOn Windows:
When you use a StreamBase Command Prompt, the CLASSPATH variable is pre-set to include the path to
sbclient.jarfor the current release. Thus, you only need to addutf16.jarto the CLASSPATH:set CLASSPATH=%STREAMBASE_HOME%sample\utf-16\utf16.jar;%CLASSPATH%
-
In window 3, run the sample's Java client, which loads Unicode Hebrew keywords into the application's Query Table.
On UNIX and Windows:
To load UTF-16BE (big-endian) keywords into the running StreamBase application:
java com.streambase.UTF16Client sb://localhost:10000 bigend_keywords keywords_utf_16_big_endian.txt bigTo load UTF-16LE (little-endian) keywords into the running StreamBase application:
java com.streambase.UTF16Client sb://localhost:10000 littleend_keywords keywords_utf_16_littleendian.txt littleThese commands display the length of the keywords being loaded into the running StreamBase application. For example:
Line len: 22 Line len: 16 Line len: 12 Line len: 10 Line len: 14 Line len: 6 Line len: 14 Line len: 16
A total of eight keywords were loaded.
-
Now that the Hebrew keywords have been loaded, you can run the same Java client with different command line parameters, this time to load a Hebrew document.
On UNIX and Windows:
To load the UTF-16BE (big-endian) document into the StreamBase application:
java com.streambase.UTF16Client sb://localhost:10000 bigend_text doc_utf_16_big_endian.txt bigTo load the UTF-16LE (little-endian) document into the StreamBase application:
java com.streambase.UTF16Client sb://localhost:10000 littleend_text doc_utf_16_little_endian.txt littleIn the running StreamBase application, a custom Java function (
utf16) is used to find matches between the Hebrew keywords and the just-loaded Hebrew document. -
Now look at window 2, the one running sbc dequeue. Look for evidence that the Hebrew characters were recognized properly, by virtue of the matches between the Hebrew keywords and the just-loaded Hebrew document. For example:
sbc dequeue keywords_found 1,5 2,7 1,1 2,6 1,1 1,4 1,2 1,1 5,1 1,3
In the dequeue results, you are seeing the
count,idfields that comprise each tuple on thekeywords_foundOutput Stream. -
In window 3, type the following command to terminate the server and dequeuer:
sbadmin shutdown
This section describes the UTF-16 functions available from StreamBase. The general format is:
calljava("com.streambase.utf16", "function-name", [arg0] [,...])
Here are the functions:
- append(str1, str2)
-
Returns a new string that is
str1appended tostr2. - endianSwap(str)
-
Converts from big-endian to little-endian or vice-versa.
- indexOf(haystack, needle)
-
Returns the first position (0-indexed) that the string needle occurs within haystack. This is written for efficiency, so the two strings must have the same endian-ness. If they do not have the same Unicode form, a unicode normalizer should be used.
- indexOf(haystack, needle, start)
-
Returns the first position (0-indexed) after the start index that the string needle occurs within haystack. This is written for efficiency, so the two strings must have the same endian-ness. If they do not have the same Unicode form, a unicode normalizer should be used.
- lastIndexOf(haystack, needle)
-
Returns the last position (0-indexed) that the string needle occurs within haystack. This is written for efficiency, so the two strings must have the same endian-ness. If they do not have the same Unicode form, a unicode normalizer should be used.
- lastIndexOf(haystack, needle, lastStart)
-
Returns the last position (0-indexed) before the lastStart index that the string needle occurs within haystack. This is written for efficiency, so the two strings must have the same endian-ness. If they do not have the same Unicode form, a unicode normalizer should be used.
- stripBOM(str)
-
Removes the Byte Order Mark from the beginning of str, if present. Returns a string that has no BOM. The string may be the original string.
- strlen(str)
-
Returns the number of characters in str. This method assumes there are no supplementary characters in str. Use strlenBigEndian or strlenLittleEndian if there may be supplementary characters. This method is considerably faster.
- strlenBigEndian(str), strlenLittleEndian(str)
-
Returns the number of characters in str. If there are no supplementary characters in str, strlen is considerably faster.
- substr(str, start, len)
-
Returns the portion of str that starts at index start (0-indexed), that contains length characters.
- regexp(haystack, needle, start, charset), regexpBigEndian(haystack, needle, start), regexpLittleEndian(haystack, needle, start)
-
Decodes the strings into UTF-16 and returns the position of the first match of the regular expression (needle) in the text (haystack) after the start index.
While the string is decoded into UTF-16, it is possible that composed characters will not match their equivalent code points. To prevent this, the text may be normalized before using the regexp* functions.
Note
A freely-available Unicode normalizer can be found at http://icu-project.org.
To learn more about how the utf16 custom Java
functions are used in the StreamBase applications:
-
See the custom function source code, installed in
streambase-install-dir/sample/utf-16/com/streambase/utf16.java -
In StreamBase Studio, see the Properties view for the components in the utf16-search.sbapp and utf16-regexpsearch.sbapp application diagrams. Look for the expressions in the operators, such as:
strresize(calljava("com.streambase.utf16", "stripBOM", text), 2048)In the expression above, the function is used to remove the Byte Order Mark (BOM) value from each tuple's text string. This is an example of how we can use the provided
utf16function to properly manipulate portions of the Hebrew strings. In this example, we can remove the BOM because the big-endian or little-endian encoding of the characters is already known, because of the command-line parameters we used in the Java examples.
The examples in this topic were from the utf16-search.sbapp application, which does
exact string matches. The utf16-regexpsearch.sbapp
application is similar, except that it uses regular expressions (for example, using a
wildcard * character in a string search).
On the StreamBase Developer Zone website, see this article: Handling 16-bit Character Streams.
