UTF-16 Sample

This sample demonstrates StreamBase support for UTF-16 character data. UTF-16 is the 16-bit Unicode Transformation Format. It is a variable-length character encoding for Unicode, capable of encoding the entire Unicode set.

The sample includes several short documents that contain non-Latin characters. We chose Hebrew characters, as an example. The StreamBase application includes custom functions that manipulate and search for the Hebrew characters.

To follow along with the sample, the high-level steps are:

  • Use a Java client to load sample Hebrew character keywords into a StreamBase application's Query Table.

  • Notice how UTF-16 custom Java functions are used in the application's expressions to manipulate the keywords.

  • Use the Java client to load a Hebrew document into the application.

  • Notice how UTF-16 functions are used to find matches between the Hebrew keywords and text in the Hebrew document.

This topic also contains a list of all the UTF-16 functions provided by StreamBase.

Importing This Sample into StreamBase Studio

In StreamBase Studio, import this sample with the following steps:

  • From the top menu, click FileLoad StreamBase Sample.

  • Select this sample from the Extending StreamBase list.

  • Click OK.

StreamBase Studio creates a project for the sample.

Sample Location

  • On Windows: C:\Program Files\StreamBase Systems\StreamBase.n.m\sample\utf-16

  • On UNIX: /opt/streambase/sample/utf-16

When you load the sample into StreamBase Studio, Studio copies the sample project's files to your Studio workspace. StreamBase Systems recommends that you use the workspace copy of the sample, especially on UNIX, where you may not have write access to /opt/streambase. In the default installation, the path to this sample in your Studio workspace is:

UNIX:       
  ~/streambase-studio-n.m-workspace/sample_utf-16
Windows XP:
  C:\Documents and Settings\username\My Documents\StreamBase Studio n.m Workspace\
      sample_utf-16
Windows Vista:
  C:\Users\username\Documents\StreamBase Studio n.m Workspace\
      sample_utf-16

This Sample's Files

The UTF-16 sample consists of:

  • Two StreamBase applications: utf16-search.sbapp and utf16-regexpsearch.sbapp.

  • Two documents that contain Hebrew text: doc_utf_16_big_endian.txt and doc_utf_16_little_endian.txt.

  • Two data files that contain Hebrew text keywords, which you will load into the StreamBase application's Query Table. The data files are: keywords_utf_16_big_endian.txt and keywords_utf_16_littleendian.txt. The instructions in this topic explain how you can load the keywords.

  • A StreamBase Java client, UTF16Client, that will load the keywords and (in a separate step) load the document into the StreamBase application.

  • A set of StreamBase custom Java functions implemented in the utf16 class, used by the StreamBase applications to manipulate the non-Latin keywords and then search for matches in a separately loaded non-Latin document. For example, one expression is:

    calljava("com.streambase.utf16", "indexOf",text, keyword, start)
    
  • A JAR file that contains the Java classes for a client (UTF16Client) and the custom Java function (utf16).

  • The Java source code for the client and the custom utf16 functions, in streambase-install-dir/sample/utf-16/*.java.

  • The ant build.xml file, if you want to rebuild the class files from the provided sources and generate an updated utf16.jar.

Running the Sample

You can and should open this sample's application files in StreamBase Studio to study how the applications are assembled. However, this sample is designed to be run in UNIX terminal windows or Windows command prompt windows. On Windows, be sure to use the StreamBase Command Prompt from the Start menu as described in the Test/Debug Guide, not a standard command prompt.

  1. Open three terminal windows on UNIX, or three StreamBase Command Prompts on Windows. In each window, navigate to the directory where the sample is installed, or to your workspace copy of the sample, as described above.

  2. In window 1, launch an instance of StreamBase Server on utf16-search.sbapp:

    sbd -f sbd.sbconf utf16-search.sbapp
    
  3. In window 2, enter the following command to dequeue tuples from the keywords-found output stream:

    sbc dequeue keywords_found
    

    Note: Initially, no output is displayed in the dequeue window. We must complete the next few steps before results are dequeued. The results will be messages confirming that matches were found in the StreamBase application between the Hebrew keywords and the Hebrew document's text.

  4. In window 3, set the CLASSPATH, and double-check the other environment settings. Make sure the STREAMBASE_HOME environment variable is set to your StreamBase installation directory. These examples also assume a supported JDK is on your PATH.

    Set the CLASSPATH to include this sample's JAR file as well as the standard StreamBase Client library.

    On UNIX:

    export CLASSPATH=$STREAMBASE_HOME/sample/utf-16/utf16.jar:
        $STREAMBASE_HOME/lib/sbclient.jar:$CLASSPATH
    

    On Windows:

    When you use a StreamBase Command Prompt, the CLASSPATH variable is pre-set to include the path to sbclient.jar for the current release. Thus, you only need to add utf16.jar to the CLASSPATH:

    set CLASSPATH=%STREAMBASE_HOME%sample\utf-16\utf16.jar;%CLASSPATH%
    
  5. In window 3, run the sample's Java client, which loads Unicode Hebrew keywords into the application's Query Table.

    On UNIX and Windows:

    To load UTF-16BE (big-endian) keywords into the running StreamBase application:

    java com.streambase.UTF16Client sb://localhost:10000 
        bigend_keywords keywords_utf_16_big_endian.txt big
    

    To load UTF-16LE (little-endian) keywords into the running StreamBase application:

    java com.streambase.UTF16Client sb://localhost:10000
        littleend_keywords keywords_utf_16_littleendian.txt little
    

    These commands display the length of the keywords being loaded into the running StreamBase application. For example:

    Line len: 22
    Line len: 16
    Line len: 12
    Line len: 10
    Line len: 14
    Line len: 6
    Line len: 14
    Line len: 16
    

    A total of eight keywords were loaded.

  6. Now that the Hebrew keywords have been loaded, you can run the same Java client with different command line parameters, this time to load a Hebrew document.

    On UNIX and Windows:

    To load the UTF-16BE (big-endian) document into the StreamBase application:

    java com.streambase.UTF16Client sb://localhost:10000
        bigend_text doc_utf_16_big_endian.txt big
    

    To load the UTF-16LE (little-endian) document into the StreamBase application:

    java com.streambase.UTF16Client sb://localhost:10000
        littleend_text doc_utf_16_little_endian.txt little
    

    In the running StreamBase application, a custom Java function (utf16) is used to find matches between the Hebrew keywords and the just-loaded Hebrew document.

  7. Now look at window 2, the one running sbc dequeue. Look for evidence that the Hebrew characters were recognized properly, by virtue of the matches between the Hebrew keywords and the just-loaded Hebrew document. For example:

    sbc dequeue keywords_found
    1,5
    2,7
    1,1
    2,6
    1,1
    1,4
    1,2
    1,1
    5,1
    1,3
    

    In the dequeue results, you are seeing the count,id fields that comprise each tuple on the keywords_found Output Stream.

  8. In window 3, type the following command to terminate the server and dequeuer:

    sbadmin shutdown

UTF-16 Functions

This section describes the UTF-16 functions available from StreamBase. The general format is:

calljava("com.streambase.utf16", "function-name", [arg0] [,...])

Here are the functions:

append(str1, str2)

Returns a new string that is str1 appended to str2.

endianSwap(str)

Converts from big-endian to little-endian or vice-versa.

indexOf(haystack, needle)

Returns the first position (0-indexed) that the string needle occurs within haystack. This is written for efficiency, so the two strings must have the same endian-ness. If they do not have the same Unicode form, a unicode normalizer should be used.

indexOf(haystack, needle, start)

Returns the first position (0-indexed) after the start index that the string needle occurs within haystack. This is written for efficiency, so the two strings must have the same endian-ness. If they do not have the same Unicode form, a unicode normalizer should be used.

lastIndexOf(haystack, needle)

Returns the last position (0-indexed) that the string needle occurs within haystack. This is written for efficiency, so the two strings must have the same endian-ness. If they do not have the same Unicode form, a unicode normalizer should be used.

lastIndexOf(haystack, needle, lastStart)

Returns the last position (0-indexed) before the lastStart index that the string needle occurs within haystack. This is written for efficiency, so the two strings must have the same endian-ness. If they do not have the same Unicode form, a unicode normalizer should be used.

stripBOM(str)

Removes the Byte Order Mark from the beginning of str, if present. Returns a string that has no BOM. The string may be the original string.

strlen(str)

Returns the number of characters in str. This method assumes there are no supplementary characters in str. Use strlenBigEndian or strlenLittleEndian if there may be supplementary characters. This method is considerably faster.

strlenBigEndian(str), strlenLittleEndian(str)

Returns the number of characters in str. If there are no supplementary characters in str, strlen is considerably faster.

substr(str, start, len)

Returns the portion of str that starts at index start (0-indexed), that contains length characters.

regexp(haystack, needle, start, charset), regexpBigEndian(haystack, needle, start), regexpLittleEndian(haystack, needle, start)

Decodes the strings into UTF-16 and returns the position of the first match of the regular expression (needle) in the text (haystack) after the start index.

While the string is decoded into UTF-16, it is possible that composed characters will not match their equivalent code points. To prevent this, the text may be normalized before using the regexp* functions.

Note

A freely-available Unicode normalizer can be found at http://icu-project.org.

Next Steps

To learn more about how the utf16 custom Java functions are used in the StreamBase applications:

  • See the custom function source code, installed in streambase-install-dir/sample/utf-16/com/streambase/utf16.java

  • In StreamBase Studio, see the Properties view for the components in the utf16-search.sbapp and utf16-regexpsearch.sbapp application diagrams. Look for the expressions in the operators, such as:

    strresize(calljava("com.streambase.utf16", "stripBOM", text), 2048)
    

    In the expression above, the function is used to remove the Byte Order Mark (BOM) value from each tuple's text string. This is an example of how we can use the provided utf16 function to properly manipulate portions of the Hebrew strings. In this example, we can remove the BOM because the big-endian or little-endian encoding of the characters is already known, because of the command-line parameters we used in the Java examples.

The examples in this topic were from the utf16-search.sbapp application, which does exact string matches. The utf16-regexpsearch.sbapp application is similar, except that it uses regular expressions (for example, using a wildcard * character in a string search).

Related Topic

On the StreamBase Developer Zone website, see this article: Handling 16-bit Character Streams.

Back to Top ^