Search in shivacherukuri.tech@blogger.com

Sunday, September 18, 2011

IPv4 addressing

Introduction


This section looks at IP addressing, subnet masking, Private and Special addresses. Examples are provided to illustrate the methodology when setting up an IP network addressing scheme. We also look at Wildcard masks and Directed Broadcasts.

IP Address Classes


Unique IP (Internet Protocol) addresses are assigned to each physical connection of a device to a network, therefore if a device (host) has more than one connection to a network or networks, then it will have more than one IP address.

An IP address is represented as four decimal integers, with each integer corresponding to one byte this means an IP address is 32 bits long as per the following example:-
162.            146.            93.             14            
dotted decimal 10100010. 10010010. 01011101. 00001110
binary
IP addresses are divided into two parts, a Network ID and a Host ID each of which can be of varying bit lengths but always making 32 bits altogether.

Hint:- Use the Windows calculator to convert binary to decimal and vice versa.

There are five primary classes of IP addresses and it is the high order 3 bits of the address which identify the class as shown below:-
                        First Octet         Example Network      
Host Class A 0xxxxxxx 1-127 25.234.45.0 1
Class B 10xxxxxx 128-191 140.250.43.0 1
Class C 110xxxxx 192-223 192.2.3.0 1
Class D 1110xxxx 224-239 232.56.4.0 1
Class E 11110000 240-254 242.5.7.0 1
Class A addresses contain 7 bits in the network portion giving 27 - 2 = 126 possible networks since all 1's and all 0's are not allowed. Consequently 24 bits remain for the host portion allowing a total of 224 - 2 = 16,777,214 hosts. 127.0.0.0/8 is reserved for loopback address purposes where just 127.0.0.1 is used normally. The address 255.255.255.255 is used as broadcast addresses and 0.0.0.0 as a default route address, meaning any network. The address 0.0.0.0 is sometimes used by hosts that have yet to receive an IP address e.g. a DHCP Client awaiting an address from the DHCP server.

Class B addresses contain 14 bits in the network portion allowing 214 - 2 = 16,384 possible networks, and 16 bits for the host portion allowing a possible total number of 216 - 2 = 65,534 hosts.

Class C addresses contain 21 bits for the network portion giving a possible total of 221 - 2 = 2,097,152 networks, and 8 bits for the host portion giving a possible 28 - 2 = 254 hosts.

Class D addresses are used for multicasting and Class E addresses are used in research.

Historically, a company may have been allocated just one Class A, B or C IP address by the Network Information Centre (NIC). Currently, all Class A addresses have been allocated and most if not all of the Class B addresses have gone. If a company have a number of networks to manage then the network administrator may wish to subnet his network, that is create subnet addresses within the scope of the IP address that the administrator has been given.

Subnets


Subnetting Example


A customer has been given an IP address of 128.100.0.0 (a Class B address) for his company. He has specified that he requires 3 separate networks with the maximum possible number of host connections on each network.

The first two octets 128.100 are fixed since these are given by NIC as the Class B address, therefore we have the last two octets to play with. Let us examine the possibilities more closely:
  1. The address given
    Octet 1            Octet 2         Octet 3         Octet 4 10000000
  2.  01100100        00000000        00000000 128.
  3.  100.            0.              0 
  1. We need to create a minimum of 3 different subnets but not at the expense of the number of host addresses available to us. The following process would seem to give us 4 permutations of subnets:
    Looking at octet 3 specifically in binary, let us just use the first 2 bits for a subnet address:
    128 64 32 16 8 4 2 1 1
  2. 1 0 0 0 0 0 0 
    The possible combinations for the first two bits are:
    11 = 192 -> 128.100.192.0 10 = 128
  3. -> 128.100.128.0 01 = 64 -> 128.100.64.0 00 = 0
  4. -> 128.100.0.0 
    However all 1's and all 0's used to be not allowed for a subnet. These subnets are called the All One's Subnet and Subnet Zero. The reason for this was that older software found it difficult to distinguish between networks 128.100.0.0/16 and the all-zeros subnet 128.100.0.0/18. The same was true of the all-ones subnet. RFC 950 therefore rules out '11' and '00' as useable subnets, we are therefore left with only two subnet addresses instead of the 3 we require.
  1. Let us try and use an extra bit in octet 3:
    128 64 32 16 8 4 2 1 1
  2. 1 1 0 0 0 0 0 
    The possible combinations are now:
    111 = 224 -> 128.100.224.0 110 = 192
  3. -> 128.100.192.0 101 = 160
  4. -> 128.100.160.0 011 = 96
  5. -> 128.100.96.0 001 = 32
  6. -> 128.100.32.0 010 = 64
  7. -> 128.100.64.0 100 = 128
  8. -> 128.100.128.0 000 = 0
  9. -> 128.100.0.0 
    As before all 1's and all 0's are not permitted for subnets, therefore we are left with 6 possible subnets (23 - 2):-
    128.100.32.0 128.100.64.0 128.100.96.0 128.100.128.0 128.100.160.0 128.100.192.0 
  1. This leaves the rest of the bits (from power 16 downwards) in octet 3 and all the bits in octet 4 to construct the individual host addresses, the permutations amount to many thousands of hosts which should be plenty. Below is an example of a host address in subnet 128.100.192.0:-
    128.100.194.23  
    On first inspection it would appear that address 128.100.194.23 has nothing to do with the subnet 128.100.192.0, so let us look a little more closely at the final two octets of the host address:
    Octet 3 = 194                            Octet 4 = 23 128  64   32   16   8   4   2   1
  2.  128  64   32   16   8   4   2   1 1    1    0    0    0   0   1   0
  3.  0    0    0    1    0   1   1   1 
    As we can see we are indeed part of the 128.100.192.0 subnet since it is only the first three bits of octet 3 which are used for the subnet address. All the bits from power 16 and downwards are allocated to the host address, so the power 2 bit just turns octet 3 from decimal 192 to decimal 194. Confusion frequently arises in this situation where the dividing line between the network portion of the IP address and the host portion rests part way through an octet (in this case between power 32 and power 16 of octet 3). Often it is possible to make the network/host dividing line between octets so that you can easily tell which host address belongs to which subnet.

    Routers are used to minimise unnecessary traffic, and when running IP it is important to tell it which subnet an address is supposed to go. The way this is done, is at configuration by entering a 'subnet mask'.
The situation with the All-zeros and All-ones subnets nowadays is to allow them according to RFC 1878. This is because modern applications understand how to distinguish between these subnets and the main network.

Subnet masks
The subnet mask specifies the portion of the IP address that is going to be used for subnetworks (as opposed to hosts). For every bit position in the IP address that is part of the network ID or subnetwork ID, a '1' is set, and for every bit position in the IP address that is part of the host id portion, a '0' is set. The router uses the boolean AND operation with an incoming IP address to 'lose' the host portion of the IP address i.e. the bits that are '0', and match the network portion with its routing table. From this, the router can determine out of which interface to send the datagram. This means that the 'Don't care bits' are represented by binary 0's whilst the 'Do care bits' are represented by binary 1's.

For our example above, because we used the first three bits in octet 3 for our subnet addressing the subnet mask would be:
Octet 1  Octet 2  Octet 3  Octet 4 11111111 11111111 
11100000 00000000 255. 255. 224. 0
What is important is that the same mask is applied throughout the physical networks that share the same subnet part of the IP address. All devices connected to the networks that compose the subnet must have the same mask.

A Broadcast Address for a subnet is when all 1's are used in the host portion of the IP address. For example, for the IP address 10.17.20.4 and a mask of 255.255.255.0 the subnet is 10.17.20.0 and the host id is 4. The broadcast address within the 10.17.20.0 subnet is when the host id portion of the address is made up of all binary 1's. In this example the host portion is the last octet and if these 8 bits are set to 1 we have a broadcast address of 10.17.20.255. You can ping this, send messages to this and so on, a single line to server a multitude of end stations.

Often you will see the network mask represented as a number of bits e.g. for the above example address of 10.17.20.4 with a mask of 255.255.255.0, this can also be represented as 10.17.20.4/24, where the 24 represents 24 bits (3 octets) set to 1.

Another Subnetting Example


Study the schematic below:

Subnets

The network drawing above shows the IP address map for a WAN installation carried out for a large financial institution. The customer had installed 'Windows NT' servers at a number of sites and was requiring an ISDN link, star-wired out, from each of the sites from the main office server room. The IP addressing scheme had to take into account the following factors:-
  • Up to 30 more sites may be added to the WAN in the near future.
  • Each site could have up to 50 host connections.
  • The customer had already assigned IP addresses to some of the servers and site PC's on the local LAN's.
The IP address given to this company was 146.162.0.0 (which is a Class B address), and the decision was made to use the whole of octet 3 for the subnet addresses leaving octet 4 for the host addresses. This made assigning IP addresses more easy to carry out and gave a maximum of 254 hosts per subnet and there could be a maximum of 254 subnets, thus satisfying the customer's requirements. The subnet mask for each subnet (Whether LAN or WAN) was consequently 255.255.255.0, it is important to design the addressing scheme such that the subnet mask is common to all LAN's/WAN's throughout the network unless a routing protocol such as OSPF is to be used. OSPF allows variable subnet masking.

Whilst studying the schematic you will note that the WAN links are 146.162.90.0 to 146.162.94.0 and the router ISDN interfaces are .20 at the main office end and .10 at the remote office end. Also you will note that the server IP addresses are all .5 and the ethernet hubs are all .8 while the router ethernet interfaces are all .6. Organising addressing like this can make life much easier especially when you are hopping from site to site.

RFC 950 and RFC 1812 describes IP subnetting whereas RFC 1009 defines Variable Length Subnet Masking.

Quick tricks to find subnets and broadcast addresses


If you have a subnet mask, then it is possible to quickly list out the possible subnets and broadcast addresses.

The number by which subnets increment for a given mask is calculated by subtracting the last numbered octet in decimal from 256. For example, given the subnet 10.1.0.0 255.255.248.0, the last numbered octet is 248, therefore 256 - 248 = 8, so subnets jump up in 8's i.e. 10.1.8.0, 10.1.16.0, 10.1.24.0 etc.

Once you have found out by how much subnets jump, finding a broadcast address for each subnet is quickly done by subtracting 1 from this and adding this to each subnet. Using the above example, for subnet 10.1.8.0, the subnets jump in 8's, 8 - 1 = 7 and 8 + 7 = 15 so, taking it as given that the final octet will be all one's for the broadcast, the broadcast address is 10.1.15.255.

Wildcard Masks


You will often come across Wildcard masks, particularly if you work with OSPF and/or Cisco routers. The use of wildcard masks is most prevalent when building Access Control Lists (ACLs) on Cisco routers. ACLs are filters and make use of wildcard masks to define the scope of the address filter. Although ACL wildcard masks are used with other protocols, we will concentrate on IP here.

Let us first take a simple example. We may want to filter a sub-network 10.1.1.0 which has a Class C mask (24-bit) 255.255.255.0. The ACL will require the scope of the addresses to be defined by a wildcard mask which, in this example is 0.0.0.255. This means that the 'Don't care bits' are represented by binary 1's whilst the 'Do care bits' are represented by binary 0's. You will note that this is the exact opposite to subnet masks!

Taking a more complex example. Say we wish to filter out a subnet which is given by 10.1.1.32 having a mask of 255.255.255.224 i.e. 10.1.1.32/27. How do we find the wildcard mask for this? Well to help us, concentrating on the 4th octet, let us first look at the binary for this network and subnet mask. Then we reverse the binary bits to get the wildcard bits and then convert back to decimal to obtain the wildcard mask for the 4th octet:

4th octet in decimal 32
4th octet in binary 0 0 1 0 0 0 0 0
4th octet mask in decimal 224
4th octet mask in binary 1 1 1 0 0 0 0 0
Now the 4th octet wildcard in binary 0 0 0 1 1 1 1 1
Now the 4th octet wildcard in decimal 31

The important bits have been highlighted in bold and this shows that the wildcard mask for the network 10.1.1.32/27 is 0.0.0.31.

The following table should help in seeing a pattern between the number of bits used for the mask in a particular octet, the subnet mask in decimal and the equivalent wildcard mask:

No. of Network Bits Set to 1 0 1 2 3 4 5 6 7 8
Subnet Mask Binary 00000000 10000000 11000000 11100000 11110000 11111000 11111100 11111110 11111111
Subnet Mask Decimal 0 128 192 224 240 248 252 254 255
Wildcard Mask Binary 11111111 01111111 00111111 00011111 00001111 00000111 00000011 00000001 00000000
Wildcard Mask 255 127 63 31 15 7 3 1 0

The binary for the wildcard mask is the exact reverse, bit for bit, of the subnet mask. You then calculate the decimal from the reversed binary bits to obtain the dotted decimal wildcard mask.

Private Addresses


One of the ways to combat the fast reduction in available IP address space was to introduce the concept of private addresses and the use of Network Address Translator (NAT) to allow many organisations to use the same address space but not have this space visible on the Internet i.e. to use address translation on the edge of the networks.

The Class A network address range 10.0.0.0 to 10.255.255.255 (10.0.0.0/8) is designated for private use only. This address range cannot be used on the Internet as every ISP will automatically drop the address. This address is becoming very popular as its use in conjunction with Network Address Translation (NAT) has meant that large corporations can make use of the Class A address space available within 10.0.0.0 for their own private use internally and just use NAT for those relatively few addresses that do need to operate on the Internet. This is one reason why the immediate need for IP version 6 has been diminished.

There is also the private address range 172.16.0.0 to 172.31.255.255 (172.16.0.0/12) which is the CIDR block of 16 x Class B addresses 172.16.0.0, 172.17.0.0, .... ,172.31.0.0.

The network address range 192.168.0.0 to 192.168.255.255 (192.168.0.0/16) is also for private use and is a CIDR block of 256 x Class C addresses 192.168.0.0, 192.168.1.0, .... ,192.168.255.0.

Examine RFC 1918 for more information on address allocation for private networks.

Other Special addresses


The address range 0.0.0.0/8 is currently considered throughout the Internet as for special use. Note that this is different from the host address 0.0.0.0/32 which means 'default'. You can have legitimate addresses in the range 0.0.0.0/16, e.g. 0.0.123.95/16.

The address range 192.0.2.0/24 is called the Test Net and is reserved for use in testing examples and documentation.

The address range 169.254.0.0/16 is used for auto-configuration of IP addresses if a DHCP server should fail and there is no backup for the DHCP Clients. This is described in RFC 2563 Stateless Auto-configuration.

Directed Broadcasts


The RFC 1812 overviews the requirements of routers to run IPv4. One of the requirements is that routers MUST, by default accept Directed Broadcasts (although it is allowable to have a switch that turns this off). A directed broadcast is one where the IP broadcast has been sent to a destination prefix (a net or subnet). A directed broadcast destined for the network 10.20.20.0/24 would be 10.20.20.255, for example.

C's volatile usage

Have you experienced any of the following in your C or C++ embedded code?

  • Code that works fine--until you enable compiler optimizations
  • Code that works fine--until interrupts are enabled
  • Flaky hardware drivers
  • RTOS tasks that work fine in isolation--until some other task is spawned

If you answered yes to any of the above, it's likely that you didn't use the C keyword volatile. You aren't alone. The use of volatile is poorly understood by many programmers. Unfortunately, most books about the C programming language dismiss volatile in a sentence or two.

C's volatile keyword is a qualifier that is applied to a variable when it is declared. It tells the compiler that the value of the variable may change at any time--without any action being taken by the code the compiler finds nearby. The implications of this are quite serious. However, before we examine them, let's take a look at the syntax.

volatile keyword syntax

To declare a variable volatile, include the keyword volatile before or after the data type in the variable definition. For instance both of these declarations will declare foo to be a volatile integer:

volatile int foo;  int volatile foo; 

Now, it turns out that pointers to volatile variables are very common, especially with memory-mapped I/O registers. Both of these declarations declare pReg to be a pointer to a volatile unsigned 8-bit integer:

volatile uint8_t * pReg;  uint8_t volatile * pReg; 

Volatile pointers to non-volatile data are very rare (I think I've used them once), but I'd better go ahead and give you the syntax:

int * volatile p; 

And just for completeness, if you really must have a volatile pointer to a volatile variable, you'd write:

int volatile * volatile p; 

Incidentally, for a great explanation of why you have a choice of where to place volatile and why you should place it after the data type (for example, int volatile * foo), read Dan Sak's column "Top-Level cv-Qualifiers in Function Parameters" (Embedded Systems Programming, February 2000, p. 63).

Finally, if you apply volatile to a struct or union, the entire contents of the struct/union are volatile. If you don't want this behavior, you can apply the volatile qualifier to the individual members of the struct/union.

Proper use of volatile

A variable should be declared volatile whenever its value could change unexpectedly. In practice, only three types of variables could change:

1. Memory-mapped peripheral registers

2. Global variables modified by an interrupt service routine

3. Global variables accessed by multiple tasks within a multi-threaded application

We'll talk about each of these cases in the sections that follow.

Peripheral registers

Embedded systems contain real hardware, usually with sophisticated peripherals. These peripherals contain registers whose values may change asynchronously to the program flow. As a very simple example, consider an 8-bit status register that is memory mapped at address 0x1234. It is required that you poll the status register until it becomes non-zero. The naive and incorrect implementation is as follows:

uint8_t * pReg = (uint8_t *) 0x1234; 
// Wait for register to become non-zero
while (*pReg == 0) { } // Do something else

This will almost certainly fail as soon as you turn compiler optimization on, since the compiler will generate assembly language that looks something like this:

  mov ptr, #0x1234 mov a, @ptr  loop:   bz loop 

The rationale of the optimizer is quite simple: having already read the variable's value into the accumulator (on the second line of assembly), there is no need to reread it, since the value will always be the same. Thus, in the third line, we end up with an infinite loop. To force the compiler to do what we want, we modify the declaration to:

uint8_t volatile * pReg = (uint8_t volatile *) 0x1234; 

The assembly language now looks like this:

  mov ptr, #0x1234  loop:   mov a, @ptr   bz loop 

The desired behavior is achieved.

Subtler problems tend to arise with registers that have special properties. For instance, a lot of peripherals contain registers that are cleared simply by reading them. Extra (or fewer) reads than you are intending can cause quite unexpected results in these cases.

Interrupt service routines

Interrupt service routines often set variables that are tested in mainline code. For example, a serial port interrupt may test each received character to see if it is an ETX character (presumably signifying the end of a message). If the character is an ETX, the ISR might set a global flag. An incorrect implementation of this might be:
int etx_rcvd = FALSE;  void main()  {     ...      while (!ext_rcvd)     
{ // Wait } ... }
interrupt void rx_isr(void) { ... if (ETX == rx_char) {
etx_rcvd = TRUE; } ... }

With compiler optimization turned off, this code might work. However, any half decent optimizer will "break" the code. The problem is that the compiler has no idea that etx_rcvd can be changed within an ISR. As far as the compiler is concerned, the expression !ext_rcvd is always true, and, therefore, you can never exit the while loop. Consequently, all the code after the while loop may simply be removed by the optimizer. If you are lucky, your compiler will warn you about this. If you are unlucky (or you haven't yet learned to take compiler warnings seriously), your code will fail miserably. Naturally, the blame will be placed on a "lousy optimizer."

The solution is to declare the variable etx_rcvd to be volatile. Then all of your problems (well, some of them anyway) will disappear.

Multi-threaded applications

Despite the presence of queues, pipes, and other scheduler-aware communications mechanisms in real-time operating systems, it is still fairly common for two tasks to exchange information via a shared memory location (that is, a global). Even as you add a preemptive scheduler to your code, your compiler has no idea what a context switch is or when one might occur. Thus, another task modifying a shared global is conceptually identical to the problem of interrupt service routines discussed previously. So all shared global variables should be declared volatile. For example, this is asking for trouble:

int cntr;  void task1(void)  {     cntr = 0;         
while (cntr == 0) { sleep(1); } ... }
void task2(void) { ... cntr++; sleep(10); ... }

This code will likely fail once the compiler's optimizer is enabled. Declaring cntr to be volatile is the proper way to solve the problem.

How offsetof() works

The offsetof() macro is an ANSI-required macro that should be found in stddef.h. Simply put, the offsetof() macro returns the number of bytes of offset before a particular element of a struct or union.

The declaration of the macro varies from vendor to vendor and depends upon the processor architecture. Browsing through the compilers on my computer, I found the example declarations shown in Listing 1. As you can see, the definition of the macro can get complicated.

// Keil 8051 compiler #define offsetof(s,m) (size_t)&(((s *)0)->m) 
// Microsoft x86 compiler (version 7)
#define offsetof(s,m) (size_t)(unsigned long)&(((s *)0)->m)
// Diab Coldfire compiler #define offsetof(s,memb) \
((size_t)((char *)&((s *)0)->memb-(char *)0))

Listing 1. A representative set of offsetof() macro definitions

Regardless of the implementation, the offsetof() macro takes two parameters. The first parameter is the structure name; the second, the name of the structure element. (I apologize for using a term as vague as "structure name." I'll refine this shortly.) A straightforward use of the macro is shown in Listing 2.

typedef struct {     int   i;     float f;     char  c;      }
SFOO; void main(void) {
printf("Offset of 'f' is %u", offsetof(SFOO, f)); }

Listing 2. A straightforward use of offsetof()

To better understand the magic of the offsetof() macro, consider the details of Keil's definition. The various operators within the macro are evaluated in an order such that the following steps are performed:

  • ((s *)0): takes the integer zero and casts it as a pointer to s.
  • ((s *)0)->m: dereferences that pointer to point to structure member m.
  • &(((s *)0)->m): computes the address of m.
  • (size_t)&(((s *)0)->m): casts the result to an appropriate data type.

By definition, the structure itself resides at address 0. It follows that the address of the field pointed to (Step 3 above) must be the offset, in bytes, from the start of the structure. At this point, we can make several observations:

  • We can be a bit more specific about the term "structure name." In a nutshell, if the structure name you use, call it s, results in a valid C expression when written as (s *)0->m, you can use s in the offsetof() macro. The examples shown in Listings 3 and 4 will help clarify that point.
  • The member expression, m, can be of arbitrary complexity; indeed, if you have nested structures, then the member field can be an expression that resolves to a parameter deeply nested within a structure
  • It's easy enough to see why this macro also works with unions
  • The macro won't work with bitfields, as you can't take the address of a bitfield member of a structure or union

Listings 3 and 4 contain simple variations on the usage of this macro. These should help you get you comfortable with the offsetof() basics.

struct sfoo  {     int   i;     float f;     char  c;      };   void main(void) {
printf("Offset of 'f' is %u", offsetof(struct sfoo, f)); }

Listing 3. A struct without a typedef

typedef struct {     long  l;     short s;      } SBAR; 
typedef struct {
int i; float f; SBAR b; } SFOO;
void main(void) {
printf("Offset of 'l' is %u", offsetof(SFOO, b.l)); }

Listing 4. Nested structs

Now that you understand the semantics of the macro, it's time to take a look at a few use examples.

struct padding bytes

Most 16-bit and larger processors require that data structures in memory be aligned on a multibyte (for example, 16-bit or 32-bit) boundary. Sometimes the requirement is absolute, and sometimes it's merely recommended for optimal bus throughput. In the latter case, the flexibility is offered because the designers recognized that you may wish to trade off memory access time with other competing issues such as memory size and the ability to transfer (perhaps via a communications link or direct memory access) the memory contents directly to another processor that has a different alignment requirement.

For cases such as these, it's often necessary to resort to compiler directives to achieve the required level of packing. As the C structure declarations can be quite complex, working out how to achieve this can be daunting. Furthermore, after poring over the compiler manuals, I'm always left with a slight sense of unease about whether I've really achieved what I set out to do.

The most straightforward solution to this problem is to write a small piece of test code. For instance, consider the moderately complex declaration given in Listing 5.

typedef union {     int   i;     float f;     char  c;     
struct { float g; double h; } b;
} UFOO; void main(void) {
printf("Offset of 'h' is %u", offsetof(UFOO, b.h)); }

Listing 5. A union containing a struct

If you need to know where field b.h resides in the structure, then the simplest way to find out is to write some test code such as that shown in Listing 5.

This is all well and good, but what about portability? Writing code that relies on offsets into structures can be risky—particularly if the code gets ported to a new target at a later date. Adding a comment is of course a good idea—but what one really needs is a means of forcing the compiler to tell you if the critical members of a structure are in the wrong place. Fortunately, one can do this using the offsetof() macro and the technique in Listing 6.

typedef union {     int   i;     float f;     char  c;          struct          {  
float g; double h; } b; } UFOO;
static union { char wrong_offset_i[offsetof(UFOO, i) == 0];
char wrong_offset_f[offsetof(UFOO, f) == 0];
... char wrong_offset_h[offsetof(UFOO, b.h) == 2]; // Error };

Listing 6. An anonymous union to check struct offsets

The technique works by attempting to declare a union of one-char arrays. If any test evaluates to false, its array will be of zero size, and a compiler error will result. One compiler I tested generated the specific error "Invalid dimension size [0]" on the line defining array wrong_offset_h[].

Thus the offsetof() macro can be used both to determine and to validate the packing of elements within C structs.

Nonvolatile memory layouts

Many embedded systems contain some form of nonvolatile memory, which holds configuration parameters and other device-specific information. One of the most common types of nonvolatile memory is serial EEPROM. Normally, such memories are byte addressable. The result is often a serial EEPROM driver that provides an API that includes a read function that looks like this:

ee_rd(uint16_t offset, uint16_t nBytes, uint8_t * dest)

In other words, read nBytes from offset offset in the EEPROM and store them at dest. The problem is knowing what offset in EEPROM to read from and how many bytes to read (in other words, the underlying size of the variable being read). The most common solution to this problem is to declare a data structure that represents the EEPROM and then declare a pointer to that struct and assign it to address 0x0000000. This technique is shown in Listing 7.

typedef struct {     int   i;     float f;     char  c;      } EEPROM;
EEPROM * const pEE = 0x0000000; ee_rd(&(pEE->f), sizeof(pEE->f), dest);

Listing 7. Accessing data in serial EEPROM via a pointer

This technique has been in use for years. However, I dislike it precisely because it does create an actual pointer to a variable that supposedly resides at address 0. In my experience, this can create a number of problems including:

  • Someone maintaining the code can get confused into thinking that the EEPROM data structure really does exist
  • You can write perfectly legal code (for example, pEE->f = 3.2) and get no compiler warnings that what you're doing is disastrous
  • The code doesn't describe the underlying hardware well

A far better approach is to use the offsetof() macro. An example is shown in Listing 8

typedef struct {     int   i;     float f;     char  c;      } EEPROM;
ee_rd(offsetof(EEPROM,f), 4, dest);

Listing 8. Use offsetof() to access data stored in serial EEPROM

However, there's still a bit of a problem. The size of the parameter has been entered manually (in this case "4"). It would be a lot better if we could have the compiler work out the size of the parameter as well. No problem, you say, just use the sizeof() operator. However, the sizeof() operator doesn't work the way we would like it to. That is, we cannot write sizeof(EEPROM.f) because EEPROM is a data type and not a variable.

Normally, one always has at least one instance of a data type so that this is not a problem. In our case, the data type EEPROM is nothing more than a template for describing how data are stored in the serial EEPROM. So, how can we use the sizeof() operator in this case? Well, we can simply use the same technique used to define the offsetof() macro. Consider the definition:

#define SIZEOF(s,m) ((size_t) sizeof(((s *)0)->m))

This looks a lot like the offsetof() definitions in Listing 1. We take the value 0 and cast it to "pointer to s." This gives us a variable to point to. We then point to the member we want and apply the regular sizeof() operator. The net result is that we can get the size of any member of a typedef without having to actually declare a variable of that data type.

Thus, we can now refine our read from the serial EEPROM as follows:

ee_rd(offsetof(EEPROM, f), SIZEOF(EEPROM, f), &dest);

At this point, we're using two macros in the function call, with both macros taking the same two parameters. This leads to an obvious refinement that cuts down on typing and errors:

#define EE_RD(M,D) \     ee_rd(offsetof(EEPROM,M), SIZEOF(EEPROM,M), D) 

Now our call to the EEPROM driver becomes much more intuitive:

EE_RD(f, &dest);

That is, read f from the EEPROM and store its contents at location dest. The location and size of the parameter is handled automatically by the compiler, resulting in a clean, robust interface.

Protect nonvolatile memory

Many embedded systems contain directly addressable nonvolatile memory, such as battery-backed SRAM. It's usually important to detect if the contents of this memory have been corrupted. I usually group the data into a structure, compute a CRC (cyclic redundancy check) over that structure, and append it to the data structure. Thus, I often end up with something like this:

struct nv {     short    param_1;     float    param_2;     char     param_3;  
uint16_t crc; } nvram;

The intent of the crc field is that it contain a CRC of all the parameters in the data structure with the exception of itself. This seems reasonable enough. Thus, the question is, how does one compute the CRC? If we assume we have a function, crc16( ), that computes the CRC-16 over an array of bytes, then we might be tempted to use the following:

nvram.crc =      crc16((char *) &nvram, sizeof(nvram)-sizeof(nvram.crc)); 

This code will only definitely work with compilers that pack all data on byte boundaries. For compilers that don't do this, the code will almost certainly fail. To see that this is the case, let's look at this example structure for a compiler that aligns everything on a 32-bit boundary. The effective structure could look like that in Listing 9.

struct nv {     short    param_1;  // offset = 0     char     pad1[2];  
// 2 byte pad float param_2; // offset = 4 char param_3;
// offset = 8 char pad2[3]; // 3 byte pad uint16_t crc;
// offset = 12 char pad3[2]; // 2 byte pad } nvram;

Listing 9. An example struct for a compiler that aligns everything on a 32-bit boundary

The first two pads are expected. However, why is the compiler adding two bytes onto the end of the structure? It does this because it has to handle the case when you declare an array of such structures. Arrays are required to be contiguous in memory, too. So to meet this requirement and to maintain alignment, the compiler pads the structure out as shown.

On this basis, we can see that the sizeof(nvram) is 16 bytes. Now our naive code in Listing 9 computes the CRC over sizeof(nvram) - sizeof(nvram.crc) bytes = 16 - 2 = 14 bytes. Thus the stored crc fieldnow includes its previous value in its computation! We certainly haven't achieved what we set out to do.

struct nv {     struct data     {         short param_1;  
// offset = 0 float param_2; // offset = 4 char param_3;
// offset = 8 } data; uint 16_t crc; // offset = 12 } nvram;

Listing 10. Nested data structures

The most common solution to this problem is to nest data structures as shown in Listing 10. Now we can compute the CRC using:

nvram.crc =      crc16((uint8_t *) &nvram.data, sizeof(nvram.data)); 

This works well and is indeed the technique I've used over the years. However, it introduces an extra level of nesting within the structure—purely to overcome an artifact of the compiler. Another alternative is to place the CRC at the top of the structure. This overcomes most of the problems but feels unnatural to many people. On the basis that structures should always reflect the underlying system, a technique that doesn't rely on artifice is preferable—and that technique is to use the offsetof() macro.

Using the offsetof() macro, one can simply use the following (assuming the original structure definition):

nvram.crc =      crc16((uint8_t *) &nvram, offsetof(struct nv, crc));

How to detect memory leaks

Most programmers use the heap via the malloc() and free() provided by their compiler, so the fragmentation is outside their control. This is very different from a memory leak.

A memory leak is a block of memory that was allocated, but will never be freed. A leak is always a bug in the application. If all pointers to that block have gone out of scope or were assigned to point elsewhere, the application will never be able to free that piece of memory. A small leak may be acceptable in a desktop application that will exit at some point, since an exiting process returns all memory to the operating system. But in a long running embedded system, it is often necessary to guarantee that absolutely no leaks exist—a non-trivial challenge.

In this article, we'll discuss the nature of memory leaks, review some commercial tools for tracking memory allocation, and look at ways to develop your own tools to track and analyze memory usage in embedded C and C++ programs.

The trouble with memory leaks

Avoiding leaks is difficult. To ensure that all allocated memory is subsequently freed, we must establish a clear set of rules about who owns the memory. To track memory, we could use a class, an array of pointers, or a linked list. Lists are commonly used since, by the nature of dynamic memory allocation, a programmer doesn't know ahead of time how many blocks will be allocated at any given time.

For example, imagine that one task is receiving messages from a communications channel. It allocates space for the message and that space will not be freed until the message is completely processed. Since the messages may not be processed in the order in which they were received, some may live longer than others. All pending messages live on a list whose length varies according to the number of messages being processed at any given time. Let's say that this embedded system must forward the messages to another device, and the message cannot be deleted until delivery is confirmed. Since the messages are going to many different destinations, and some destinations may have some down-time, causing retries, it would not be feasible to process the messages in first-in-first-out manner.

In problem areas such as these, dynamic memory management enables more efficient use of the available RAM than a predefined number of buffers. When the memory is not being used by one message queue, it can be used by another queue, or by a completely different part of the program.

One particular problem often arises when there is more than one pointer to a specific block. If the first entity owns the memory and wants to free it, then it must first consider whether any other pointers point at that location. If any do, and the first entity frees the memory, those other pointers become dangling pointers—pointers that point to space that is no longer valid. When the dangling pointer is used, you may happen to get the correct data, but eventually the memory will be reused (via another call to malloc()), leading to unpleasant interactions between the dangling pointer and the new owner of that piece of memory. A dangling pointer is the opposite of a leak.

A leak occurs when you fail to free something; a dangling pointer occurs when you free something that was not yet ready to be freed.

Memory leaks are similar to race conditions in a number of ways. The misbehavior they cause may occur far from where the bug was caused. As a result, these problems are difficult to resolve by stepping through the code with a debugger. For both memory leaks and race conditions, code inspections sometimes catch these problems more quickly than any technical solution.

Adding debug code to generate output is often a better alternative than a source code debugger, but in the case of race conditions, it could alter the behavior enough to disguise the problem. With memory leaks, adding debug code can change the shape of the memory layout, which means that dangling pointer bugs may exhibit different behavior. Another drawback is that if the debugging code consumes memory, you may run out of RAM sooner in the debug version than you would in the production version. Still, a leak is a leak and should remain detectable regardless of these side effects of the debug code.

Automatic memory management

Some programming languages provide intrinsic memory management capabilities that can reduce the possibility of memory leaks. Java, for example, has automatic memory management in the form of garbage collection. Java programmers do not have to worry about freeing allocated memory. If you ever program in Java, you will appreciate just how much time and effort other languages require to keep track of memory.

In Java, you pay a runtime cost in exchange for programming simplicity. This is because manually managing the memory produces a more efficient implementation. When programs become large, however, manual management breaks down. While good manual management will always minimize total heap size, it might not always be easy to implement. In a large program that involves dozens of programmers, human error will introduce enough leaks to tip the performance scales in favor of an automatic solution.

I read a report years ago about how bar code readers in supermarkets identified the wrong product 1% of the time. The article seemed to imply that this meant that the supermarkets were ripping us off. What the article failed to point out was a comparison with the alternative, where the person on the till has to manually type in the price of each item, which, I am guessing, would have had a higher error rate. The other consideration is that shop assistants, like programmers, vary widely in their skill levels. If the automatic system has a measurable error level, then at least management can allow for it in their plans. If the volume of customers and groceries is high, the supermarket and the customer are better off in the long run with an automatic system.

A similar logic applies to automatic garbage collection. An excellent individual programmer will do far better than an automatic collector, but in a large project, with a large number of programmers, it may be impossible to find and plug all of the leaks. Choosing an automatic system may mean that you compromise performance. You also have to accept that the automatic collector will occasionally mess up. The article "Java Q&A: How Do You Plug Java Memory Leaks?" includes a list of ways to trick Java garbage collectors.

how to calculate IP header checksum

Establishing correctness is more difficult for computers than humans. At the lowest level, communication between computers consists of nothing but a stream of binary digits. Meaning is only assigned to that particular sequence of bits at higher levels. We call that meaningful sequence of bits the message; it is analogous to a spoken or written phrase. If one or more bits within the message are inverted (a logic one becomes a logic zero, or vice versa) as it travels between computers, the receiver has no way to detect the error. No environmental or syntactical context is available to the receiver, since it cannot understand the message in its transmitted form.

Achieving parity

If we want communicating computers to detect and correct transmission errors automatically, we must provide a replacement for context. This usually takes the form of an error correction code or error detection code. A simple type of error detection code that you are probably already familiar with is called a parity bit. A parity bit is a single, extra binary digit that is appended to the message by the sender and transmitted along with it. Depending on the type of parity used, the parity bit ensures that the total number of logic ones sent is even (even parity) or odd (odd parity). For an example of even parity, consider the sequence:

10101110 1

in which the eight-bit message contains five ones and three zeros. The parity bit is one in this case to force the total number of ones in the transmitted data stream to be even.

When a parity bit is appended to a message, one additional bit of data must be sent between the computers. So there must be some benefit associated with the lost bandwidth. The most obvious benefit is that if any single bit in the message is inverted during transmission, the number of ones will either increase or decrease by one, thus making the received number of ones odd and the parity incorrect. The receiver can now detect such an error automatically and request a retransmission of the entire stream of bits. Interestingly, the receiver can also now detect any odd number of bit inversions. (Note that all of this still applies even when the parity bit is one of the inverted bits.)

However, as you may have already noticed, if two bits are inverted (two ones become zeros, for example), the parity of the stream of bits will not change; a receiver will not be able to detect such errors. In fact, the same is true for any even number of bit inversions. A parity bit is a weak form of error detection code. It has a small cost (one bit per message), but it is unable to detect many types of possible errors. (By the way, odd parity has the same costs, benefits, and weaknesses as even parity.)

Perhaps this is not acceptable for your application. An alternative that might occur to you is to send the entire message twice. If you're already spending bandwidth sending an error detection code, why not spend half of the bandwidth? The problem is that you could actually be spending much more than half of the available bandwidth. If a computer receives two copies of a message and the bits that comprise them aren't exactly the same, it cannot tell which one of the two is correct. In fact, it may be that both copies contain errors. So the receiver must request a retransmission whenever there is a discrepancy between the message and the copy sent as an error detection code.

With that in mind, let's compare the bandwidth costs of using a parity bit vs. resending the entire message. We'll assume that all messages are 1,000-bits long and that the communications channel has a bit error rate (average number of inverted bits) of one per 10,000 bits sent. Using a parity bit, we'd spend 0.1% of our bandwidth sending the error detection code (one bit of error protection for 1,000 bits of message) and have to retransmit one out of every 10 messages due to errors. If we send two complete copies of each message instead, the smallest unit of transmission is 2,000 bits (50% of our bandwidth is now spent sending the error detection code). In addition, we'll have to retransmit one out of every five messages. Therefore, the achievable bandwidths are approximately 90% and 40%, respectively. As the width of the code increases, the message plus code lengthens and becomes more vulnerable to bit errors and, as a result, expensive retransmissions.

From this type of analysis it should be clear that keeping the size of an error detection code as small as possible is a good thing, even if it does mean that some types of errors will be undetectable. (Note that even sending two copies of the message is not a perfect solution. If the same bit errors should happen to occur in both copies, the errors will not be detected by the receiver.) In addition to reducing the bandwidth cost of transmitting the error detection code itself, this also increases the message throughput. But we still don't want to go so far as using a one-bit error detection scheme like parity. That would let too many possible errors escape detection.

In practice, of course, we can achieve far better error detection capabilities than just odd numbers of inverted bits. But in order to do so we have to use error detection codes that are more complex than simple parity, and also contain more bits. I'll spend the remainder of this column and most of the next two discussing the strengths and weaknesses of various types of checksums, showing you how to compute them, and explaining how each can be best employed for purposes of error detection.

Checksums

As the name implies, a checksum usually involves a summation of one sort or another. For example, one of the most common checksum algorithms involves treating the message like a sequence of bytes and summing them. Listing 1 shows how this simple algorithm might be implemented in C.

uint8_t Sum(uint8_t const message[], int nBytes) {     uint8_t  sum = 0;    
while (nBytes-- > 0) { sum += *(message++); }
return (sum); } /* Sum() */

Listing 1. A sum-of-bytes checksum

The sum-of-bytes algorithm is straightforward to compute. However, understanding its strengths and weaknesses as a checksum is more difficult. What types of errors does it detect? What errors is it unable to detect? These are important factors to consider whenever you are selecting a checksum algorithm for a particular application. You want the algorithm you choose to be well matched to the types of errors you expect to occur in your transmission environment. For example, a parity bit would be a poor choice for a checksum if bit errors will frequently occur in pairs.

A noteworthy weakness of the sum-of-bytes algorithm is that no error will be detected if the entire message and data are somehow received as a string of all zeros. (A message of all zeros is a possibility, and the sum of a large block of zeros is still zero.) The simplest way to overcome this weakness is to add a final step to the checksum algorithm: invert the bits of the final sum. The new proper checksum for a message of all zeros would be FFh. That way, if the message and checksum are both all zeros, the receiver will know that something's gone terribly wrong. A modified version of the checksum implementation is shown in Listing 2.

uint8_t SumAndInvert(uint8_t const message[], int nBytes) {   
uint8_t sum = 0;

while (nBytes-- > 0) { sum += *(message++);
}
return (~sum); } /* SumAndInvert() */

Listing 2. A slightly more robust sum-of-bytes checksum algorithm

This final inversion does not affect any of the other error detection capabilities of this checksum, so let's go back to discussing the basic sum-of-bytes algorithm in Listing 1. First, it should be obvious that any single bit inversion in the message or checksum will be detectable. Such an error will always affect at least one bit within the checksum. (It could affect more, of course, but not less.) Observe that the sum-of-bytes is performed by essentially lining up all of the bytes that comprise the message and performing addition on them, like this:

  10110101   11000000   00001101 + 11100011 ----------   01100101

Because of this mathematical arrangement, simultaneous single-bit inversions could occur in each of any number of the "columns." At least one bit of the checksum will always be affected. No matter how the inversions occur, at least the lowest-order column with an error will alter the checksum. This is an important point, because it helps us to understand the broader class of errors that can be detected by this type of checksum. I'll say it again: no matter how the inversions occur, at least the lowest order column with an error will alter the checksum, which means that two or more bit inversions in higher-order columns may cancel each other out. As long as the lowest-order column with an error has only one, it doesn't matter what other errors may be hidden within the message.

Now let's step back for a minute and talk about errors more formally. What exactly does a transmission error look like? Well, the first and most obvious type of error is a single bit that is inverted. That happens sometimes and is easy to detect (even a parity bit will do the job). Other times, bit inversions are part of an error burst. Not all of the bits within an error burst must be inverted. The defining characteristic is simply an inverted bit on each end. So an error burst may be 200 bits long, but contain just two inverted bits--one at each end.

A sum-of-bytes checksum will detect the vast majority of error bursts, no matter what their length. However, describing exactly which ones is generally difficult. Only one class of error bursts is always detectable: those of length eight bits or less. As long as the two inverted bits that bound the error burst have no more than six bits between them, the error will always be detected by this algorithm. That's because our previous requirement for error detection--that the lowest-order column with an error have only one error--is always met when the length of the error burst is no longer than the width of the checksum. We can know with certainty that such errors will always be detected.

Of course, many longer error bursts will also be detected. The probability of detecting a random error burst longer than eight bits is 99.6%. Errors will only be overlooked if the modified message has exactly the same sum as the original one, for which there is a 2-8 chance. That's much better error detection than a simple parity bit, and for not too much more cost.

The sum-of-bytes algorithm becomes even more powerful when the width of the checksum is increased. In other words, increasing the number of bits in the checksum causes it to detect even more types of errors. A 16-bit sum-of-words checksum will detect all single bit errors and all error bursts of length 16 bits or fewer. It will also detect 99.998% of longer error bursts. A 32-bit sum will detect even more errors. In practice, this increase in performance must be weighed against the increased cost of sending the extra checksum bits as part of every exchange of data.

Internet checksum

Many protocol stacks include some sort of a checksum within each protocol layer. The TCP/IP suite of protocols is no exception in this regard. In addition to a checksum at the lowest layer (within Ethernet packets, for example), checksums also exist within each IP, UDP, and TCP header. Figure 1 shows what some of these headers look like in the case of some data sent via UDP/IP. Here the fields of the IP header are summed to generate the 16-bit IP checksum and the data, fields of the UDP header, and certain fields of the IP header are summed to generate the 16-bit UDP checksum.

Figure 1. UDP and IP headers with checksum fields highlighted

A function that calculates the IP header checksum is shown in Listing 3. This function can be used before sending an IP packet or immediately after receiving one. If the packet is about to be sent, the checksum field should be set to zero before calculating the checksum and filled with the returned value afterward. If the packet has just been received, the checksum routine is expected to return a value of 0xFFFF to indicate that the IP header was received correctly. This result is a property of the type of addition used to compute the IP checksum.

uint16_t NetIpChecksum(uint16_t const ipHeader[], int nWords) { 
uint16_t sum = 0;// it should be uint32_t i guess
/* * IP headers always contain an even number of bytes. */
while (nWords-- > 0) { sum += *(ipHeader++); }
/* * Use carries to compute 1's complement sum. */
sum = (sum >> 16) + (sum & 0xFFFF); sum += sum >> 16;
/* * Return the inverted 16-bit result. */
return ((unsigned short) ~sum); } /* NetIpChecksum() */

Listing 3. IP header checksum calculation

The IP header to be checksummed should be passed to NetIpChecksum() as an array of 16-bit words. Since the length of an IP header is always a multiple of four bytes, it is sufficient to provide the header length as a number of 16-bit words. The checksum is then computed by the function and returned to the caller for insertion into the header for validation of the header contents.

When you first look at this function, you may be overcome with a feeling of deja vu. The IP checksum algorithm begins in much the same way as the sum-of-bytes algorithm I discussed earlier. However, this algorithm is actually different. First, of course, we're computing a 16-bit checksum instead of an eight-bit one, so we're summing words rather than bytes. That difference is obvious. Less obvious is that we're actually computing a ones complement sum.

Recall that most computers store integers in a twos complement representation. Simply adding two integers, as we did in the previous algorithm, will therefore result in a twos complement sum. In order to compute the ones complement sum, we need to perform our addition with "end around carry." This means that carries out of the most significant bit (MSB) are not discarded, as they were previously. Instead, carries are added back into the checksum at the least significant bit (LSB). This could be done after each addition, but testing for a carry after each addition is expensive in C. A faster way is to let the carries accumulate in the upper half of a 32-bit accumulator. Once the sum-of-words is complete, the upper half of the 32-bit accumulator is turned into a 16-bit value and added to the 16-bit twos complement sum (the lower half). One final carry is possible at that point, and must be included in the final sum. The IP checksum is the inverted ones complement sum of all of the words in the IP header.

For checksum purposes, ones complement arithmetic has an important advantage over twos complement arithmetic. Recall that the biggest weakness of a parity bit is that it can't detect a pair of bit errors or any even number of errors, for that matter. A twos complement sum suffers from a similar weakness. Only one of the bits in the sum (the MSB) is affected by changes in the most significant of the 16 columns. If an even number of bit errors occurs within that column, the checksum will appear to be correct and the error will not be detected. A ones complement sum does not suffer this weakness.

Because carries out of the MSB are added back into the LSB during a ones complement sum, errors in any one column of the data will be reflected in more than one bit of the checksum. So a ones complement sum is a stronger checksum (for example, will detect more errors) than a twos complement sum, and only slightly more expensive. Hence, it is chosen for use in a lot of different situations.

The checksums within the UDP and TCP headers are computed in exactly the same way as the IP checksum. In fact, the only major difference is the set of words over which the sum is calculated. (A minor difference is that UDP checksums are optional.) In both cases, these layer 4 protocols include the message, their own header fields, and a few important fields of the IP header in the checksum calculation. The inclusion of some IP header fields in the UDP and TCP checksums is one of the biggest reasons that these layers of the protocol stack cannot be completely separated from the IP layer in a software implementation. The reason that IP checksums include only the IP header is that many intermediate computers must validate, read, and modify the contents of the IP header as a packet travels from its source to the destination. By reducing the number of words involved in the calculation, the speed with which all IP packets move across the Internet is improved. Including a larger set of words in the UDP and TCP checksums does not slow down communications; rather, it increases the likelihood that the message received is the same as the message sent.

Tuesday, September 13, 2011

16-bit one's complement sum means

1. Add the 16-bit values up. Each time a carry-out (17th bit) is  produced,
swing that bit around and add it back into the LSb (one's digit).
This is somewhat erroneously referred to as "one's complement addition."
2. Once all the values are added in this manner, invert all the bits in the result.
A binary value that has all the bits of another binary value inverted is called its
"one's complement," or simply its "complement."

IP header checksum calculation by example

checksum code:
unsigned short checksum(unsigned short *ptr, int length){ register int sum = 0; u_short answer = 0; register u_short *w = ptr; register int nleft = len;  while(nleft > 1){ sum += *w++; nleft -= 2; }  sum = (sum >> 16) + (sum & 0xFFFF); sum += (sum >> 16); answer = ~sum; return(answer); }


In theory, the IP checksum is the 16 bit one's complement of the one's complement sum of all 16 bit words in the header, as you may know, not all the IP fields are exactly 16-bit long, so we have to remove and place to sort them in 16 bit words.

We have the following IP packet:

Code:
TS: 18:45:33.398596  IP: 172.16.10.99 > 172.16.10.12 Offset:  Hexadecimal dump                       :  Char dump     : ------:-----------------------------------------:----------------- 0x0000:  4500 003c 1c46 4000 4006 b1e6 ac10 0a63  E..<.F@.@......c 0x0010:  ac10 0a0c                                ..

Fine, now let's analyze it:

The first byte (45) correspond to the two first fields of the IP header, which are IP Version and Internet Header Length (IHL), so, this values tell us that the IP version used is "4", and the IHL is 5 (which actually is 20, because this field is measured in 32-bit multiples).

The second byte (00) correspond to the Type Of Service IP field (ToS), which means that NORMAL PRECEDENCE is set in this packet.

The next two bytes (003C) correspond to the Total length field of the IP header, which tell us that the total length of the packet is 60 (0x16^3 + 0x16^2 + 3x16^1 + Cx16^0 = 60).

The next two bytes (1C46) correspond to the Identification field, which in this packet is 7238 (1x16^3 + Cx16^2 + 4x16^1 + 6x16^0 = 7238).

The next two bytes (4000) correspond to the flags and fragment offset IP header fields, which are divided in 3 bits for the flags and 13 for the fragment offset. Treating the flags field in 3-bit words, it's value is actually 4 (Don't Fragment), and the value for the fragment offset is obviously zero (000).

The next byte (40) correspond to the Time To Live field (TTL), which actually is 64 (4x16^1 + 0x16^0 = 64).

The next byte (06) correspond to the IP protocol field, which is set to 6, so the packet contains a TCP segment on it's payload.

The next two bytes (B1E6) correspond to the IP header checksum of the packet, we'll calculate this "manually" later, so for us, this fields value is actually zero because we're gonna calculate it just as the "sender" did. When receiving, the calculation used is a different method.

Phew... the next four bytes (ac10 0a63) correspond to the Source IP address field, which is "172.16.10.99", and the next four bytes (ac10 0a0c) correspond to the Destination IP address field, which is "172.16.10.12".

Right, we need to sort all of these fields in 16-bit words and convert them into binary, so, it will be like this:

HEX BINARY
4500 0100010100000000
003c 0000000000111100
1c46 0001110001000110
4000 0100000000000000
4006 0100000000000110
0000 0000000000000000 <- The checksum is set to zero.
ac10 1010110000010000
0a63 0000101001100011
ac10 1010110000010000
0a0c 0000101000001100


Okay, let's add all this numbers one by one:

4500 0100010100000000
003c 0000000000111100
453C 0100010100111100 <-- This is the 1st result.


453C 0100010100111100 <-- First result plus next 16-bit word.
1c46 0001110001000110
6182 0110000110000010 <-- This is the 2nd result.

6182 0110000110000010 <-- Second result plus next 16-bit word.
4000 0100000000000000
A182 1010000110000010 <-- This is the 3rd result.

A182 1010000110000010 <-- Third result plus next 16-bit word.
4006
0100000000000110
E188 1110000110001000 <-- This is the 4th result.

..E188 1110000110001000 <--Fourth result plus next 16-bit word.
..AC10 1010110000010000
18D98 11000110110011000 <-- here we see one odd bit (carry), but we have to keep the checksum in "16-bit" words, so we add that odd bit to the result.

18D98 11000110110011000
.8D99 1000110110011001 <--This is the 5th result.

8D99 1000110110011001 <-- Fifth result plus next 16-bit word.
0A63 0000101001100011
97FC 1001011111111100 <--This is the 6th result.

..97FC 1001011111111100 <-- Sixth result plus next 16-bit word.
..AC10 1010110000010000
1440C 10100010000001100 <-- Again, there is a carry, so we add it.

1440C 10100010000001100
.440D 0100010000001101 <-- This is the 7th result.

440D 0100010000001101 <-- Seventh result plus next 16-bit word
0A0C 0000101000001100
4E19 0100111000011001 <-- Last result.

Here we're not done yet, we have to apply now the last binary operation, which is the one's complement, and the result (the checksum itself) will be:

4E19 0100111000011001
B1E6 1011000111100110 <-- The IP header checksum.

Friday, September 9, 2011

Spanning tree protocol(STP)

The Spanning Tree Protocol (STP) is a network protocol that ensures a loop-free topology for any bridged Ethernet local area network. The basic function of STP is to prevent bridge loops and ensuing broadcast radiation. Spanning tree also allows a network design to include spare (redundant) links to provide automatic backup paths if an active link fails, without the danger of bridge loops, or the need for manual enabling/disabling of these backup links.

STP is a Data Link Layer protocol. It is standardized as IEEE 802.1D. As the name suggests, it creates a spanning tree within a mesh network of connected layer-2 bridges (typically Ethernet switches), and disables those links that are not part of the spanning tree, leaving a single active path between any two network nodes.

STP is based on an algorithm invented by Radia Perlman while working for Digital Equipment Corporation.
Protocol operation

The collection of bridges in a local area network (LAN) can be considered a graph whose nodes are bridges and LAN segments (or cables), and whose edges are the interfaces connecting the bridges to the segments. To break loops in the LAN while maintaining access to all LAN segments, the bridges collectively compute a spanning tree. The spanning tree is not necessarily a minimum cost spanning tree. A network administrator can reduce the cost of a spanning tree, if necessary, by altering some of the configuration parameters in such a way as to affect the choice of the root of the spanning tree. The spanning tree that the bridges compute using the Spanning Tree Protocol can be determined using the following
rules. The example network at the right, below, will be used to illustrate the rules.
Select a root bridge. The root bridge of the spanning tree is the bridge with the smallest (lowest) bridge ID. Each bridge has a unique identifier (ID) and a configurable priority number; the bridge ID contains both numbers. To compare two bridge IDs, the priority is compared first. If two bridges have equal priority, then the MAC addresses are compared. For example, if switches A (MAC=0200.0000.1111) and B (MAC=0200.0000.2222) both have a priority of 10, then switch A will be selected as the root bridge. If the network administrators would like switch B to become the root bridge, they must set its priority to be less than 10.

Determine the least cost paths to the root bridge. The computed spanning tree has the property that messages from any connected device to the root bridge traverse a least cost path, i.e., a path from the device to the root that has minimum cost among all paths from the device to the root. The cost of traversing a path is the sum of the costs of the segments on the path. Different technologies have different default costs for network segments. An administrator can configure the cost of traversing a particular network segment. The property that messages always traverse least-cost paths to the root is guaranteed by the following two rules.

Least cost path from each bridge. After the root bridge has been chosen, each bridge determines the cost of each possible path from itself to the root. From these, it picks one with the smallest cost (a least-cost path). The port connecting to that path becomes the root port (RP) of the bridge.

Least cost path from each network segment. The bridges on a network segment collectively determine which bridge has the least-cost path from the network segment to the root. The port connecting this bridge to the network segment is then the designated port (DP) for the segment.

Disable all other root paths. Any active port that is not a root port or a designated port is a blocked port (BP).

Modifications in case of ties. The above rules over-simplify the situation slightly, because it is
possible that there are ties, for example, two or more ports on a single
bridge are attached to least-cost paths to the root or two or more bridges
on the same network segment have equal least-cost paths to the root. To
break such ties:

Breaking ties for root ports. When multiple paths from a bridge are least-cost paths, the chosen path uses the neighbor bridge with the lower bridge ID. The root port is thus the one connecting to the bridge with the lowest bridge ID. For example, in figure 3, if switch 4 were connected to network segment d, there would be two paths of length 2 to the root, one path going through bridge 24 and the other through bridge 92. Because there are two least cost paths, the lower bridge ID (24) would be used as the tie-breaker in choosing which path to use.

Breaking ties for designated ports. When more than one bridge on a segment leads to a least-cost path to the root, the bridge with the lower bridge ID is used to forward messages to the root. The port attaching that bridge to the network segment is the designated port for the segment. In figure 4, there are two least cost paths from network segment d to the root, one going through bridge 24 and the other through bridge 92. The lower bridge ID is 24, so the tie breaker dictates that the designated port is the port through which network segment d is connected to bridge 24. If bridge IDs were equal, then the bridge with the lowest MAC address would have the designated port. In either case, the loser sets the port as being blocked.

The final tie-breaker. In some cases, there may still be a tie, as when two bridges are connected by multiple cables. In this case, multiple ports on a single bridge are candidates for root port. In this case, the path which passes through the port on the neighbor bridge that has the lowest port priority is used.
Data rate and STP path cost

The table below shows the default cost of an interface for a given data rate.>
Data rate     STP Cost (802.1D-1998)     STP Cost (802.1t-2001)
4 Mbit/s     250     5,000,000
10 Mbit/s     100     2,000,000
16 Mbit/s     62     1,250,000
100 Mbit/s     19     200,000
1 Gbit/s     4     20,000
2 Gbit/s     3     10,000
10 Gbit/s     2     2,000

Bridge Protocol Data Units (BPDUs)

The above rules describe one way of determining what spanning tree will be computed by the algorithm, but the rules as written require knowledge of the entire network. The bridges have to determine the root bridge and compute the port roles (root, designated, or blocked) with only the information that they have. To ensure that each bridge has enough information, the bridges use special data frames called Bridge Protocol Data Units (BPDUs) to exchange information about bridge IDs and root path costs.

A bridge sends a BPDU frame using the unique MAC address of the port itself as a source address, and a destination address of the STP multicast address 01:80:C2:00:00:00.

There are three types of BPDUs:

    Configuration BPDU (CBPDU), used for Spanning Tree computation
    Topology Change Notification (TCN) BPDU, used to announce changes in the network topology
    Topology Change Notification Acknowledgment (TCA)



BPDUs are exchanged regularly (every 2 seconds by default) and enable switches to keep track of network changes and to start and stop forwarding at ports as required.

When a device is first attached to a switch port, it will not immediately start to forward data. It will instead go through a number of states while it processes BPDUs and determines the topology of the network. When a host is attached such as a computer, printer or server the port will always go into the forwarding state, albeit after a delay of about 30 seconds while it goes through the listening and learning states (see below). The time spent in the listening and learning states is determined by a value known as the forward delay (default 15 seconds and set by the root bridge). However, if instead another switch is connected, the port may remain in blocking mode if it is determined that it would cause a loop in the network. Topology Change Notification (TCN) BPDUs are used to inform other switches of port changes. TCNs are injected into the network by a non-root switch and propagated to the root. Upon receipt of the TCN, the root switch will set a Topology Change flag in its normal BPDUs. This flag is propagated to all other switches to instruct them to rapidly age out their forwarding table entries....

STP switch port states:

    Blocking - A port that would cause a switching loop, no user data is sent or received but it may go into forwarding mode if the other links in use were to fail and the spanning tree algorithm determines the port may transition to the forwarding state. BPDU data is still received in blocking state.
    Listening - The switch processes BPDUs and awaits possible new information that would cause it to return to the blocking state.
    Learning - While the port does not yet forward frames (packets) it does learn source addresses from frames received and adds them to the filtering database (switching database)
    Forwarding - A port receiving and sending data, normal operation. STP still monitors incoming BPDUs that would indicate it should return to the blocking state to prevent a loop.
    Disabled - Not strictly part of STP, a network administrator can manually disable a port



To prevent the delay when connecting hosts to a switch and during some topology changes, Rapid STP was developed and standardized by IEEE 802.1w, which allows a switch port to rapidly transition into the forwarding state during these situations.
BPDU fields

The bridge ID, or BID, is a field inside a BPDU packet. It is eight bytes in length. The first two bytes are the Bridge Priority, an unsigned integer of 0-65,535. The last six bytes are a MAC address supplied by the switch. In the event that MAC Address Reduction is used, the first two bytes are used differently. The first four bits are a configurable priority, and the last twelve bits carry either the VLAN ID or MSTP instance number.
Evolutions and extensions

The first spanning tree protocol was invented in 1985 at the Digital Equipment Corporation by Radia Perlman. In 1990, the IEEE published the first standard for the protocol as 802.1D, based on the algorithm designed by Perlman. Subsequent
versions were published in 1998 and 2004, incorporating various extensions.

Although the purpose of a standard is to promote interworking of equipment from different vendors, different implementations of a standard are not guaranteed to work, due for example to differences in default timer settings. The IEEE encourages vendors to provide a "Protocol Implementation Conformance Statement", declaring which capabilities and options have been implemented, to help users determine whether different implementations will interwork correctly.

Also, the original Perlman-inspired Spanning Tree Protocol, called DEC STP, is not a standard and differs from the IEEE version in message format as well as timer settings. Some bridges implement both the IEEE and the DEC versions of the Spanning Tree Protocol, but their interworking can create issues for the network administrator, as illustrated by the problem discussed in an on-line Cisco document.
Rapid Spanning Tree Protocol (RSTP)

In 2001, the IEEE with document 802.1w introduced an evolution of the Spanning Tree Protocol: Rapid Spanning Tree Protocol (RSTP), which provides for faster spanning tree convergence after a topology change. Standard IEEE 802.1D-2004 now incorporates RSTP and obsoletes STP. While STP can take 30 to 50 seconds to respond to a topology change, RSTP is typically able to respond to changes within 3*Hello times (default: 6 seconds) or within a few milliseconds of a physical link failure. The so-called Hello time is an important and configurable time interval that is used by RSTP for several purposes; its default value is 2 seconds.

RSTP bridge port roles:

    Root - A forwarding port that is the best port from Nonroot-bridge to Rootbridge
    Designated - A forwarding port for every LAN segment
    Alternate - An alternate path to the root bridge. This path is different than using the root port.
    Backup - A backup/redundant path to a segment where another bridge port already connects.
    Disabled - Not strictly part of STP, a network administrator can manually disable a port



RSTP is a refinement of STP and therefore shares most of its basic operation characteristics. However there are some notable differences as summarized below:

    Detection of root switch failure is done in 3 hello times, which is 6 seconds if default hello times have not been changed.
    Ports may be configured as edge ports if they are attached to a LAN that has no other bridges attached. These edge ports transition directly to the forwarding state. RSTP still continues to monitor the port for BPDUs in case a bridge is connected. RSTP can also be configured to automatically detect edge ports. As soon as the bridge detects a BPDU coming to an edge port, the port becomes a non-edge port.
    Unlike in STP, RSTP will respond to BPDUs sent from the direction of the root bridge. An RSTP bridge will "propose" its spanning tree information to its designated ports. If another RSTP bridge receives this information and determines this is the superior root information, it sets all its other ports to discarding. The bridge may send an "agreement" to the first bridge confirming its superior spanning tree information. The first bridge, upon receiving this agreement, knows it can rapidly transition that port to the forwarding state bypassing the traditional listening/learning state transition. This essentially creates a cascading effect away from the root bridge where each designated bridge proposes to its neighbors to determine if it can make a rapid transition. This is one of the major elements that allows RSTP to achieve faster convergence times than STP.
    As discussed in the port role details above, RSTP maintains backup details regarding the discarding status of ports. This avoids timeouts if the current forwarding ports were to fail or BPDUs were not received on the root port in a certain interval.


Per-VLAN Spanning Tree (PVST)

In Ethernet switched environments where multiple Virtual LANs exist, spanning tree can be deployed per Virtual LAN.
Cisco's name for this is per VLAN spanning tree (PVST and PVST+, which is the default protocol used by Cisco switches).
Both PVST and PVST+ protocols are Cisco proprietary protocols and they cannot be used on 3rd party switches, although Force10 Networks, Extreme Networks and Blade Network Technologies support PVST+, Extreme Networks does so with two limitations (lack of support on ports where the VLAN is untagged/native and also on the VLAN with ID 1).
PVST works only with ISL (Cisco's proprietary protocol for VLAN encapsulation) due to its embedded Spanning tree ID. Due to high penetration of the IEEE 802.1Q VLAN trunking standard and PVST's dependence on ISL, Cisco defined a different PVST+ standard for 802.1Q encapsulation. PVST+ can tunnel across an MSTP Region.
Multiple Spanning Tree Protocol (MSTP)

The Multiple Spanning Tree Protocol (MSTP), originally defined in IEEE 802.1s and later merged into IEEE 802.1Q-2005, defines an extension to RSTP to further develop the usefulness of virtual LANs (VLANs). This "Per-VLAN" Multiple Spanning Tree Protocol configures a separate Spanning Tree for each VLAN group and blocks all but one of the possible alternate paths within each Spanning Tree.

If there is only one Virtual LAN (VLAN) in the network, single (traditional) STP works appropriately. If the network contains more than one VLAN, the logical network configured by single STP would work, but it is possible to make better use of the alternate paths available by using an alternate spanning tree for different VLANs or groups of VLANs.

MSTP allows formation of MST regions that can run multiple MST instances (MSTI). Multiple regions and other STP bridges are interconnected using one single common spanning tree (CST).

MSTP is similar to Cisco Systems' Multiple Instances Spanning Tree Protocol (MISTP), and is an evolution of the Spanning Tree Protocol and the Rapid Spanning Tree Protocol. It was introduced in IEEE 802.1s as an amendment to 802.1Q, 1998 edition. Standard IEEE 802.1Q-2005 now includes MSTP.

Unlike some proprietary per-VLAN spanning tree implementations, MSTP includes all of its spanning tree information in a single BPDU format. Not only does this reduce the number of BPDUs required on a LAN to communicate spanning tree information for each VLAN, but it also ensures backward compatibility with RSTP (and in effect, classic STP too). MSTP does this by encoding additional region information after the standard RSTP BPDU as well as a number of MSTI messages (from 0 to 64 instances, although in practice many bridges support less). Each of these MSTI configuration messages conveys the spanning tree information for each instance. Each instance can be assigned a number of configured VLANs and frames (packets) assigned to these VLANs operate in this spanning tree instance whenever they are inside the MST region. In order to avoid conveying their entire VLAN to spanning tree mapping in each BPDU, bridges encode an MD5 digest of their VLAN to instance table in the MSTP BPDU. This digest is then used by other MSTP bridges, along with other administratively configured values, to determine if the neighboring bridge is in the same MST region as itself.

MSTP is fully compatible with RSTP bridges, in that an MSTP BPDU can be interpreted by an RSTP bridge as an RSTP BPDU. This not only allows compatibility with RSTP bridges without configuration changes, but also causes any RSTP bridges outside of an MSTP region to see the region as a single RSTP bridge, regardless of the number of MSTP bridges inside the region itself. In order to further facilitate this view of an MST region as a single RSTP bridge, the MSTP protocol uses a variable known as remaining hops as a time to live counter instead of the message age timer used by RSTP. The message age time is only incremented once when spanning tree information enters an MST region, and therefore RSTP bridges will see a region as only one "hop" in the spanning tree. Ports at the edge of an MST region connected to either an RSTP or STP bridge or an endpoint are known as boundary ports. As in RSTP, these ports can be configured as edge ports to facilitate rapid changes to the forwarding state when connected to endpoints.
Rapid Per-VLAN Spanning Tree (R-PVST)

Cisco's proprietary protocol that combines the functionalities of RSTP and PVST.
It is based on a per VLAN instance that creates a tree for each VLAN.