Search in shivacherukuri.tech@blogger.com

Sunday, February 28, 2010

Unix Incompatibility Notes: Byte ordering and how to find machine endianness

FYI,

http://unixpapa.com/incnote/byteorder.html

 

Jan Wolter

Any program that writes binary files that may have to be read by another computer needs to be concerned about byte order issues. Different processors write integers differently.

There is a minority view that says if you code properly then you never need to know the endianness of your machine. You should certainly consider carefully if you can do so in your application.

Terminology

Let's suppose we are writing out a four byte long integer 67305985. In hexadecimal, this is 0x04030201, so the most significant byte contains the hexadecimal value 04 and the least significant byte contains the hexadecimal value 01. Suppose this is written out to memory address x. The value will actually be written to four consecutive addresses, x through x+3. Which byte of data goes in which memory location? It depends on the processor. The alternatives are named after Lilliputian political parties:

  • Big-Endian systems save the most significant byte first. Sun and Motorola processors, IBM-370s and PDP-10s are big-endian. JPEG images contains big-endian values.

x

04

x+1

03

x+2

02

x+3

01

  • Little-Endian systems save the least significant byte first. The entire Intel x86 family, Vaxes, Alphas and PDP-11s are little-endian. GIF images contains little-endian values.

x

01

x+1

02

x+2

03

x+3

04

  • Middle-Endian or PDP-Endian systems save the most significant word first, with each word having the least significant byte first. For developers of new software, it is not only perfectly reasonable, but strongly recommended, to ignore this possiblity. I don't think there ever was a processor that stored 32-bit integer values to memory in a middle-endian format, though middle-endianness has occasionally appeared in things like packed-decimal formats, floating point formats, and obscure communications protocols (it's used for the length of TCP/IP packets in Visa's "Visa Base I" protocol).

x

03

x+1

04

x+2

01

x+3

02

Some processors (PowerPC, MIPS, DEC Alpha) can be either big-endian or little-endian depending on software settings.

Network byte order is the standard used in packets sent over the internet. It is big-endian (except that technically it refers to the order in which bytes are transmitted, not the order in which they are stored). If you are going to chose an arbitrary order to standardize on, network-byte order is a sensible choice.

The unix functions htonl(), htons(), ntohl(), and ntohs() convert longs and shorts back and forth between the host byte order and network byte order. However, though they are widely available, they are not universally available.

Compile-time Tests

We'd usually prefer to determine endianness at compile time. Most modern Unix systems define the byte order in the sys/param.h include file. Some code I've seen references the endian.h or machine/endian.h files instead, but I think that if those exist, thensys/param.h always pulls the appropriate ones in. Note however that some older systems (including SunOS 4.1) have sys/param.h but it does not define any byte order information.

The sys/param.h header normally defines the symbols __BYTE_ORDER, __BIG_ENDIAN, __LITTLE_ENDIAN, and __PDP_ENDIAN. You can test endianness by doing something like:

   #include <sys/param.h>
 
   #ifdef __BYTE_ORDER
   # if __BYTE_ORDER == __LITTLE_ENDIAN
   #  define I_AM_LITTLE_ENDIAN
   # else
   #  if __BYTE_ORDER == __BIG_ENDIAN
   #   define I_AM_BIG_ENDIAN
   #  else
       Error: unknown byte order!
   #  endif
   # endif
   

2 comments:

  1. If __BYTE_ORDER is not defined, you may want to test for the existance of BYTE_ORDER, BIG_ENDIAN and LITTLE_ENDIAN. Linux defines these as synonym of the versions with underscores, apparantly in attempt to be compatible with BSD Unix.
    If that is not defined, you might try things like:
    #if defined (i386) || defined (__i386__) || defined (_M_IX86) || \
    defined (vax) || defined (__alpha)
    # define I_AM_LITTLE_ENDIAN
    #endif
    However trying to cover all bases with this sort of thing seems futile, and may be complicated by architectures that can work either way. Ultimately, it is better to fall back to a run-time test.
    Run-time Tests
    It's easy enough to write code to check if you are big or little endian. The following function returns true if we are big endian.
    int am_big_endian()
    {
    long one= 1;
    return !(*((char *)(&one)));
    }
    Or an alternate version using unions (based on Harbison & Steele):
    int am_big_endian()
    {
    union { long l; char c[sizeof (long)]; } u;
    u.l = 1;
    return (u.c[sizeof (long) - 1] == 1);
    }
    I suspect that these run-time tests are the better solution

    ReplyDelete
  2. Myths
    Discussions of endian-ness on the web seem to contain quite a lot of bogus information. This includes:
    • "Most UNIX machines are big endian. Whereas most PCs are little endian machines." - Fanning Consulting
    Byte-order is not determined by the OS you are running, but some OS's have limitations on what processors they will run on. Microsoft's operations systems run almost exclusively on little-endian machines, while Apple's OS 9 ran almost exclusively on big-endian machines. Unix (including OS X) works perfectly fine either way, which is why detecting endian-ness is something Unix programmers occasionally need to know how to do. For what it's worth, I'd guess most Unixes these days are on x86 processors, so Unix is probably more commonly little-endian than big-endian.
    • "These two phrases are derived from 'Big End In" and 'Little End In.'" - Microsoft Support
    Creative, but wrong. The names are derived from Jonathon Swift's book Gulliver's Travels, where they describe Lilliputian political parties who disagree vehemently over which end to start eating an egg from. This terminology was popularized for byte order by a less than completely serious paper authored by Danny Cohen which appeared on April 1, 1980 and was entitled "On Holy Wars and a Plea for Peace" (google the title to find a copy).
    • PDP-11's were "middle-endian".
    Only sort of. The PDP-11 didn't have instructions to store 32-bit values to memory, so the particular weird "middle-endian" value couldn't possibly apply to the way it stored values in memory. It stored 16-bit values to memory in the usual little endian manner. It could do 32-bit arithmatic, storing the values in pairs of 16-bit CPU registers (not memory). The most signficant word went into the lower numbered register, but within each register values were stored little-endian. So that could be viewed as "middle-endian", but in a sense that could only matter to assembly language programmers and compiler writers, whose code could never hope to be portable anyway.

    ReplyDelete