Friday, September 14, 2007

DIY Protocols are evil!

A recent project at work opened my eyes as to why you really have no good business designing your own low-level interprocess protocol. Given that there are so many good protocols already available, HTTP, SOAP, etc. why would a person be compelled to want to write their own? I should clarify what I mean by low-level in this context. In this case I am referring to a protocol which is implemented directly on top of the TCP/IP protocol. I am not arguing against higher level protocols or API's that applications expose, this is necessary nowadays unless you live in a bubble.

Anyway... I was faced with modifying this existing protocol which was implemented via a client and server communicating to each other through a C structure. The protocol went something like this:

Client:
connect to server
build struct and insert some 'request' data
write struct on socket
wait for response...

Server:
Listen/accept/fork for each client connection
read struct from socket
parse request
write response in same struct's 'response' fields
write struct on socket

Well to some of you this may seem OK, it is simple enough, very straight to the point. The problem lies in the underlying implementation of the structure. When the code is compiled to a particular machine, let's say a big-endian system, you end up with a specific byte ordering for any structure members which are integers. The problem really occurs when your server and your client are compiled on machines which have different byte orderings. What you end up with is a system sending what it thinks is a perfectly valid integer value and the other end may not order the integer bytes the same way (in some cases your number gets really big, or really small - depending on the signed-ness). Wikipedia has a good example of what all this byte-ordering means.

Anyone who has developed a network application is probably already aware of this byte-ordering difference that you must account for. That is why the de-facto byte-ordering for all network communications is big-endian. All good network programs should explicitly convert their integer data to the proper ordering before sending and then back to the native ordering on receiving (see ntoh and hton type library calls).

The second problem, which actually turns out to be the more painful, is that of byte-alignment. The concept is that different machine architectures may have a preference on how to pad the members of our C structure so that the memory regions are aligned with the integer (and shorts) of our structure. A very detailed explanation is also available on Wikipedia. What can happen is that you manually add up the size of each member of your structure and get a sum; you think great I can tell the client that the server will read exactly X bytes and write back exactly X bytes. But sometimes you would be very wrong. So to check your math you execute the sizeof() function on a variable of the type of your structure and get another _completely_ different number! The reason for this is that you compiler knows that the alignment must be adjusted to line up on a particular boundary. So you actually end up reading and writing X + y alignment bytes of data. The extra bits of data are not accessible when you reference the structure members and you never know they are there.

The project I mentioned earlier actually had the luxury of always being implemented on systems whose byte-alignment agreed and where both the client and server were implemented in C. Then the day came when a .Net application needed to communication across the wire to the server. Another team worked on the .Net implementation of a client and it was a struggle to make sure that we properly communicated the exact layout of the structure, including the padding bytes. A second problem arose when the .Net side also wanted to send over its version of a 'Character' array. What we got was a 16bytes per character array and not the old-standard 8bytes per character array.

In the end it all worked out and the systems are communicating nicely. In hindsight it probably would have been less time consuming to just toss the old DIY protocol out the door and use a more established standard protocol.