Developers: Handling 16-bit Character Streams

Home
Documentation
Library
Sample Code and Applications
FAQs
Articles
Community
Training
Download Center
Contact DevZone

Printer Friendly

Library Articles

Handling 16-bit Character Streams

Authors: Hayden Schultz, Dr. John Lifter
Contributor: John Smart
StreamBase Systems
2-April-2007

Applicable To: StreamBase 3.7.1, 5.0

 

Topics:

 
Introduction

StreamBase encodes character strings using the 8-bit clean paradigm, which means that all 8 bits in a byte are used to store the character and that there are no assumptions regarding the interpretation and/or printability of the character. Consequently, a StreamBase string variable may be composed of characters using UTF-8, UTF-16, or any other character set required by your application.

Within the generated Java code that corresponds to a StreamBase EventFlow operator or StreamSQL statement, or when displaying character strings in StreamBase Studio, StreamBase initializes instances of the java.lang.String class using the platform's default java.nio.charset.Charset mapping. Consequently, only 8-bit encoding is properly displayed by Studio or manipulated by the built-in string handling functions.

If you want to use alternative character sets in your applications, you must also implement string handling functions for strings encoded in these character sets. This article discusses how you can work with UTF-16 encoded character strings in your StreamBase applications.

Byte Order Mark The Byte Order Mark (BOM) is a sequence of two bytes that may be prepended to a stream of characters as an indicator that the stream consists of Unicode characters and whether the characters are serialized in big-endian versus little-endian order. When the BOM is represented by the bytes 0xFE followed by 0xFF, the stream uses big-endian ordering, whereas the bytes 0xFF followed by 0xFE indicates little-endian ordering. However, it is not required that a BOM be used, in which case the ordering corresponds to the platform's default.

The characters 0xFEFF will be prepended to character streams encoded using the "UTF-16BE" character set; the characters 0xFFFE be prepended to character streams encoded using the "UTF-16LE" character set. Applications that use the "UTF-16" character set may optionally begin with the 0xFFFE or 0xFEFF characters; if neither BOM sequence is present, your application code should interpret the character stream as big-endian.

Further information on the UTF-16 encoding standard is available in Network Working Group RFC 2781.

Your applications that utilize UTF-16 encoding may use the BOM to determine the serialization order, but, as the hexidecimal values 0xFFFE or 0xFEFF have no Unicode interpretation, it is then necessary to remove these bytes before the remainder of the character stream is processed. The following code method illustrates how to accomplish this task. You could incorporate this method into StreamBase as a simple custom Java function, which could then be invoked using the calljava() function.

    /** The Byte Order Mark for big-endian UTF-16 */
    public static byte [] BIGEND_BOM = {(byte) 0xfe, (byte) 0xff};

    /** The Byte Order Mark for little-endian UTF-16 */
    public static byte [] LITTLEEND_BOM = {(byte) 0xff, (byte) 0xfe};
    
    /**
     * If present, remove the Byte Order Mark (BOM) for either 
     * big-endian or little-endian forms.
     * 
     * @param s The string
     * @return the string without BOM (possibly the original string)
     */
    public static byte [] stripBOM(byte [] s) {
        if((s[0] == BIGEND_BOM[0] && s[1] == BIGEND_BOM[1])
           || (s[0] == LITTLEEND_BOM[0] && s[1] == LITTLEEND_BOM[1])) {

            byte [] result = new byte[s.length -2];
            
            System.arraycopy(s, 2, result, 0, s.length-2);
            return result;
        } else {
            return s;
        }
    }

Note that the above method handles the incoming and outgoing character streams as byte[] even though the StreamBase variable passed as the argument s is a string data type.

Changing the Serialization Order

Because you will need to write the functions that manipulate UTF-16 encoded streams, you will probably want to avoid implementing each method for both big-endian and little-endian ordering. A convenience method that reverses the ordering of the character stream would allow you to convert all streams to a common ordering and, if necessary, reverse the ordering prior to output. The following code method illustrates how to accomplish this task. You could incorporate this method into StreamBase as a simple custom Java function, which could then be invoked using the calljava() function.

    /**
     * Change big-endian to little-endian and vice versa 
     */
    public static byte [] endianSwap(byte [] s) {
        if(s.length % 2 == 1) {
            throw new Error
              ("UTF-16 strings must have an even number of bytes");
        }
        
        byte [] result = new byte[s.length];
        
        for(int i=0; i < s.length; i += 2) {
            result[i] = s[i+1];
            result[i+1] = s[i];
        }
        return result;
    }

Supplementary Characters Characters in the UTF-16 encoding are represented as 16 bits in two sequential bytes or as 32 bits in four sequential bytes. Supplementary characters are characters in the Unicode standard represented by values above hexadecimal 0xFFFF, which are therefore encoded in four bytes. The functions you write to process UTF-16 character strings must be capable of determining whether a specific character is represented by two or four 8-bit bytes.

The hexadecimal values from 0xD800 to 0xDFFF (decimal: 55296 through 57343) have been reserved for representing supplementary characters. Values in the range >=0xD800 and <=0xDBFF are the only valid entries for the two high order bytes of a supplementary character; values in the range >=0xDC00 and <=0xDFFF are the only valid entries for the two low order bytes of a supplementary character. The following method illustrates how to examine each byte to determine if it is part of a four byte supplementary character — its two higher order bytes must have a value between 0xD800 and 0xDFFF; see the code highlighted in bold face font. You could incorporate this method into StreamBase as a simple custom Java function, which could then be invoked using the calljava() function.

    public static int strlen_utf16(byte [] s) {
        int len = 0;
        
        if(s.length % 2 == 1) {
            throw new Error
              ("UTF-16 strings must have an even number of bytes");
        }
        
        for(int i=0; i < s.length; i += 2) {
            int c = (s[i+1] << 8) + s[i];
            
            // is this a 2-word surrogate pair?            
            if(c >= 0xd800 && c <= 0xdfff) {
                i += 2;
            }
            ++len;
        }
        
        return (int) s.length/2;
    }

Notice how this implementation of the string length method counts both the four byte supplementary character and the two byte basic character as a single unit in the length of the string — the index i is incremented twice within an iteration of the for loop so the additional two bytes of the supplemental character are jumped over.

For a more complete discussion of supplementary characters see the Sun Developer Network article Supplementary Characters in the Java Platform.

UTF-16 Encoded Character Stream Manipulation StreamBase 3.7.1 and later releases includes a sample application that illustrates how to implement methods that manipulate UTF-16 encoded character streams. In this example, a Java class that includes multiple methods was written, packaged into a JAR file, and imported into a project within StreamBase Studio. Each of the methods can be invoked through the calljava() function.

In addition to the methods detailed earlier in this article, this Java class includes methods to find specific substrings of characters, determine the position of a specific string of characters or regular expression within the larger character stream, and to append one UTF-16 character stream to another. In most cases the implementations of these methods are similar to the processing logic used to manipulate streams of 8-bit characters except that indices are incremented or decremented in units of 2 rather than 1. If desired, you can use this class as a template for writing a class capable of manipulating UTF-32 encoded character strings.

    package com.streambase;

    import java.nio.ByteBuffer;
    import java.nio.charset.CharacterCodingException;
    import java.nio.charset.Charset;
    import java.nio.charset.CharsetDecoder;
    import java.util.Map;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;

    public class utf16 {
        /** The Byte Order Mark for bigendian UTF-16 */
        public static byte [] BIGEND_BOM = {(byte) 0xfe, (byte) 0xff};
        /** The Byte Order Mark for littleendian UTF-16 */
        public static byte [] LITTLEEND_BOM = {(byte) 0xff, (byte) 0xfe};

        /**
         * If present, remove the Byte Order Mark (BOM)
         * for either big-endian or little-endian forms.
         *
         * @param s The string
         * @return the string without BOM (possibly the original string)
         */
        public static byte [] stripBOM(byte [] s) {
            if((s[0] == BIGEND_BOM[0] && s[1] == BIGEND_BOM[1])
               || (s[0] == LITTLEEND_BOM[0] && s[1] == LITTLEEND_BOM[1])) {

                byte [] result = new byte[s.length -2];
            
                System.arraycopy(s, 2, result, 0, s.length-2);
                return result;
            } else {
                return s;
            }
        }
    
        /**
         * This strlen method should only be used if there are no surrogate
         * values in the string. If there are, or you aren't sure, use 
         * strlenBigEndian or strlenLittleEndian.
         */
        public static int strlen(byte [] s) {return s.length/2;} 
    
        public static int strlenBigEndian(byte [] s) {
            int len = 0;
        
            if(s.length % 2 == 1) {
                throw new Error
                  ("UTF-16 strings must have an even number of bytes");
            }
        
            for(int i=0; i < s.length; i += 2) {
                int c = (s[i] << 8) + s[i+1];
            
                // is this a 2-word surrogate pair?            
                if(c >= 0xd800 && c <= 0xdfff) {
                    i += 2;
                }
                ++len;
            }
        
            return (int) s.length/2;
        }
    
        public static int strlenLittleEndian(byte [] s) {
            int len = 0;
        
            if(s.length % 2 == 1) {
                throw new Error
                  ("UTF-16 strings must have an even number of bytes");
            }
        
            for(int i=0; i < s.length; i += 2) {
                int c = (s[i+1] << 8) + s[i];
            
                // is this a 2-word surrogate pair?            
                if(c >= 0xd800 && c <= 0xdfff) {
                    i += 2;
                }
                ++len;
            }
        
            return (int) s.length/2;
        }
    
        public static byte [] substr(byte [] s, int start, int length) {
            byte [] result = new byte[length*2];
            int begin = start*2;
        
            for(int i=0; i < length*2; ++i) {
                result[i] = s[begin+i];
            }
            return result;
        }

        public static int indexOf(byte [] haystack, byte [] needle) {
            return indexOf(haystack, needle, 0);
        }
    
        public static int indexOf
          (byte [] haystack, byte [] needle, int start) {
            if(haystack.length - start*2 < needle.length)
                return -1;
        
            int offset = start*2;
            for(int i=0; i < needle.length; ++i) {
                if(haystack[i+offset] != needle[i])
                    return indexOf(haystack, needle, (i + offset)/2 +1);
            }
            return start;
        }

        public static byte [] append(byte [] head, byte [] tail) {
            byte [] result = new byte[head.length + tail.length];
        
            for(int i=0; i < head.length; ++i) {
                result[i] = head[i];
            }
        
            for(int i=0; i < tail.length; ++i) {
                result[i+head.length] = tail[i]; 
            }
        
            return result;
        }

        public static int lastIndexOf(byte [] haystack, byte [] needle) {
            return lastIndexOf
              (haystack, needle, (haystack.length-needle.length)/2);
        }
    
        public static int lastIndexOf
          (byte [] haystack, byte [] needle, int lastStart) {
            return lastIndexOf(haystack, needle, 0, lastStart*2);
        }

        private static int lastIndexOf
          (byte [] haystack, byte [] needle, int start, int cur) {
            if(cur < start)
                return -1;
        
            for(int i=0; i < needle.length; ++i) {
                if(haystack[cur + i] != needle[i]) {
                    return lastIndexOf(haystack, needle, start, cur-2);
                }
            }
            return cur/2;
        }
    
        /**
         * Change big-endian to little-endian and vice versa 
         */
        public static byte [] endianSwap(byte [] s) {
            if(s.length % 2 == 1) {
                throw new Error
                  ("UTF-16 strings must have an even number of bytes");
            }
        
            byte [] result = new byte[s.length];
    
            for(int i=0; i < s.length; i += 2) {
                result[i] = s[i+1];
                result[i+1] = s[i];
            }
            return result;
        }
    
        public static int regexpBigEndian
          (byte [] haystack, byte [] needle, int start) {
            return regexp(haystack, needle, start, "UTF-16BE");
        }

        public static int regexpLittleEndian
          (byte [] haystack, byte [] needle, int start) {
            return regexp(haystack, needle, start, "UTF-16LE");
        }

        public static int regexp
          (byte [] haystack, byte [] needle, int start, String charSet) {
            try {
                Map charsets = Charset.availableCharsets();
                Charset cs = (Charset) charsets.get(charSet);
                CharsetDecoder decoder = cs.newDecoder();
                String regexp =
                  decoder.decode(ByteBuffer.wrap(needle)).toString();
                Pattern pattern = Pattern.compile(regexp, Pattern.CANON_EQ);

                decoder = cs.newDecoder();
         
                Matcher m =
                  pattern.matcher(decoder.decode(ByteBuffer.wrap(haystack)));
        
                if(m.find(start)) {
                    return m.start();
                } else {
                    return -1;
                }
            } catch (CharacterCodingException e) {
                throw new Error(e);
            }
        }
    }

Related Topics

Back to Top ^