Java: Understanding Streams, Bytes and Character sets

Posted: June 19, 2011 in Flex, Java

I had to evolve myself back to Java, one of the first languages I learned to program in, and boy was that a fallback. Being in PHP so long, used to the OOP capabilities it brings a long, I noticed the side effects of using PHP’s flawed and missing OOP principles.

The project I was working on needed a Java program that communicated with hardware devices under Linux. My main development environment has been OSX + Netbeans for the last year so abusing Java’s statement for being OS independent, I developed the Java program for both OSX and Linux.

One of the requirements the Java program should have was a communication bridge between Java and a Flex program (designated to be the interface). So the first thing we (me and my colleague) looked at was Merapi. Though after some hours of trial and error, when things started to communicate between each other, the work required to get it communicating was just way to much for the little information being passed through Merapi (though Merapi is superb in building a sophisticated communication bridge). So we decided to switch to a more simpler approach, a socket connection.

A socket connection is just simply said “binding a port to a socket to retrieve and send streams of data from and to the designated client(s)”. So being the lazy programmers we are, and having tied deadlines, we started searching for some ready to use Server/Client codes and I started to wonder if any of them even understood the full picture of Streams, Bytes and the Charsets, which at the end is needed to convert those pretty little bytes back to a useful characters. And hopefully the character that was meant to be send from the server and received by the client.

So I found myself some time to write this little guide on what it does, and how it matters to understand for your own sake 😉

There are various possible ways to read a streams input, or send information back to a streams output. The following code is based upon one method that reads the input stream, read().
example 1.0

   byte[] readBuffer;
      // try to set the byte array at the correct size
      try {
        readBuffer = new byte[this.in.available()];
      } catch(Exception e) {
        // if any exception occurs make sure there is enough 
        // room to read the stream, if the length bytes is .e.g 14, 
        // then there will be 34 zeros (0) at the end.
        readBuffer = new byte[48];
      }

      try {
        // dumpe all available bytes into the readBuffer byte array
        while (this.in.available() > 0) {
          this.in.read(readBuffer);
        }
      } catch(Exception e) {
        // something happened while reading
      }

      // here the byte array converts to a string
      String output = new String(readBuffer);


The output variable at the end now contains a string of the information that was send to the input stream. “So I’ve got my String, what do I care”. Well unless you’re getting a completely different string than the string originally send, you should from now on 😉

So what does the read() method actually do, well it is getting passed a parameter which is in our example a fixed sized byte array which tries to match the available bytes from the input stream and if that fails, a designated hardcoded size of 48. The read() method does something similar internally as followed:

example 1.1

  public byte[] read(byte[] putBytesHere)
    int index = 0;
    if( this.hasNextByte == true ) {
      try {
        // if there are 14 bytes available, index will stop
        // at 13 (index starts with zero (0))
        putBytesHere[index] = this.getByte();
      }
      catch(ArrayIndexOutOfBoundsException e) {
        // the number of bytes surpasses the length of 
        // the passed putBytesHere .. not so good .. but 
        // can be corrected
      }
      index++;
    }

    return putBytesHere;


Every single byte gets iterated upon and put at a unique index in the putBytesHere array starting from zero (0), which at the end gets returned. So if you would loop over the byte array readBuffer before passing it into the Strings constructor:

example 1.2

  for(int i = 0; i < readBuffer.length; i++) {
    System.out.println("byte " + readBuffer[i] + " @ index "+i);
  }


you would see some output similar like:
example 1.3

byte 72 @ index 0
byte 69 @ index 1
byte 76 @ index 2
byte 76 @ index 3
byte 79 @ index 4
byte 32 @ index 5
.. etc

Now these byte numbers, 72, 69, etc are decimal numbers and get converted into a String. But what the heck do these decimal numbers stand for and how does the String constructor know how to convert decimal number 72 or 69 to a character. Well that’s where the character encoding comes in later, but first let me clarify where these decimal byte numbers come from.

A byte is nothing more then 8 bits. A bit is a number representing a 1 or a 0 (e.g a 1 in ON or a 0 in OFF or a 1 in TRUE or a 0 in FALSE). So a bit combination of 01011010 represents ONE byte. This combination can also be represented as a decimal number 90 (or hex 0x5A). The lowest value of a decimal byte number is 0 (0x00) and the highest is 255 (0xFF). You most likely recognize the decimal byte number 255 for color schemes, e.g. RGB(255, 123, 34) where to be honest, I have no idea what color it represents. The byte sequence 255, 123, 34 used in the RGB color scheme would be the equivalent in a hex value of 0xFF7B22. The decimal byte number 255 is not some magical number that fell from a roof top. It originates from the highest number of possible combinations a byte with a length of 8 can have where 1 position can contain 2 different values (1 and 0).

So to calculate all the different combinations of 0’s and 1’s with a length of 8 (see also hereS) ..

bit 1 | 0+1 = 1 .. ( can have 0 or 1; that makes 2 )
bit 2 | 1+1 = 2 .. ( can have 0 or 1; that makes 4 )
bit 3 | 2+2 = 4 .. ( can have 0 or 1; that makes 8 )
bit 4 | 4+4 = 8 .. ( can have 0 or 1; that makes 16 )
bit 5 | 8+8 = 16 .. ( can have 0 or 1; that makes 32 )
bit 6 | 16+16 = 32 .. ( can have 0 or 1; that makes 64 )
bit 7 | 32+32 = 64 .. ( can have 0 or 1; that makes 128 )
bit 8 | 64+64 = 128 .. ( can have 0 or 1; that makes 256 )

or

2^8 (2×2×2×2×2×2×2×2) = 256

.. brings up the number 256, 256 different values. The reason why 255 is the maximum number representing a byte is because they start counting from 0, not from 1. In mathematics, 0 (zero) is a real number. It is a number between -1 and 1 and therefor it exists. It is not the number 0 used to specify quantity. Therefor an array starts to count from 0 and not 1 only to identify it’s location, not the quantity or the amount it holds at its current position. Still a lot of programmers scream that it is “not logical” to start an array from index 0, it should start from index 1 because that is the first position in an array.

If we could only remember the reason why some idiot, at some point in history, came up with the idea starting an array from the 0 (zero) index. Well you can probably figure out his reasoning for yourself.

The number 0 (zero) represents the byte 0000000. The number 255 represents the byte 1111111. In this case, again, 0 (zero) or 255 is not to be mistaken with the quantity, as in “There are no (zero) cars in this parking lot”, but a location which is represented as a byte zero. It is the first combination of all combinations of a byte. The second combination is 7 zeros and a 1 as in 00000001 and the third is 00000010 and so on.

So our previous byte output from example 1.3 contain the following 8 bit combinations
example 1.3.1

byte 72 (01001000) @ index 0 (00000000)
byte 69 (01000101) @ index 1 (00000001)
byte 76 (01001100) @ index 2 (00000010)
byte 76 (01001100) @ index 3 (00000011)
byte 79 (01001111) @ index 4 (00000100)
byte 32 (00100000) @ index 5 (00000101)
.. etc

So how does the String convert these decimal byte numbers into a character? Now I can finally start explaining about the character map.

A character map is a set of characters bound to a another set of unique numbers (or identifiers). Through the history of computing, a numerous set of character maps have been deployed for different languages to hold all it’s single and most used characters and signs to be able to use this particular language for writing. UTF-8 is one of these sets and stands for Unified Transformation Format. UTF-8 was one of the first proposed editions of the Universal Character Sets and I will use UTF-8 and its Basic Latin encoding scheme to show how the String is able to convert the decimal byte numbers to a useful character and how it can convert each character to a byte.

If you take a look at the general UTF-8 encoding table you’ll see a table holding all the characters meant to be displayed when using the UTF-8 character encoding. Some parts of table are empty, meant to be filled with different kind of characters for different languages (different encodings). To make this wiki page less complicated than it looks, you could imagine the UTF-8 encoding table as a Chessboard. If I would like to move a figure on the chessboard, I could simply say “I’m moving my knight on A-3 to C-2”. A-3 and C-2 (starting from the top to the left) are coordinations on a 2 dimensional board meant to specify a location (the same technique is used for arrays to specify the value at a particular location).

The UTF-8 encoding table can do just the same for decimal byte numbers (or other unique identifiers like hex or ASCI). If you look at the encoding table and find number 72 (8-4) from our byte output example 1.3 and 1.3.1, you’ll notice that it represents the capital letter H. Number 69 (8-5) represent the capital letter E, 76 (C-4) represents the capital letter L, 79 (F-4) represents the capital letter O and 32 (0-2) stands for a space. Putting these characters all together will make “HELLO “.

The InputStream object in Java received a byte stream (or a message) “HELLO “, which in bytes says 72,69,76,76,79,32.

Now the String constructor can convert these bytes into characters, for example

   public void String( byte[] byteArray ) {
     // set a string to concat the characters
     String output = "";
     for(int i = 0; i < byteArray.length; i++) {
       // get the corresponding character from the systems
       // default character set
       output += characterSet.get(i);
     }
     // call the corresponding String constructor that handles the first
     // parameter as object String to set the desired string value
     this(output);
   }


 

From this point on, I can only guess it was the message “HELLO ” because I automatically used the system default charset of Java, which is in my case the UTF-8 encoding table.

If the person giving the input uses a different encoding table for its characters then the person reading the output, everything can be mixed up, or just partially where only a few characters end up being displayed wrong, e.g as a question mark (?) or a square block (઀) because that is the character retrieved at the specified location.

Cheers 😉

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s