When we come across memory segments in C program these are the questions that comes to our mind.
- What happens when a c program is loaded into memory?
- Where are the different types of variables allocated?
- Why do we need two data sections, initialized and un-initialized?
- If we initialize a static or global variable with 0 where will it be stored?
Even though the scope of global and static variables are different, why are they stored in same section i.e., data segment?
Let's look at some of these interesting under hood details here. We know that a C program which is compiled to an executable and loaded into memory for execution has 4 main segments in memory. They are data, code, stack, and heap segments.
Global and function static variables are allocated in the data segment. The compiler converts the executable statements in C program such as printf("hello world"); into machine code. They are loaded in the code segment. When the program executes, function calls are made. Executing each function requires allocation of memory, as if in a frame to store different information like the return pointer, local variable…etc. since this allocation is done in the stack, these are known as stack frames. When we do dynamic memory allocation, such as the use of the malloc function, memory is allocated in the heap area.
Static and Dynamic Segments
The data and code segments are of fixed size. When a program is compiled, at that point itself, the sizes required for the segments are fixed and known. Hence they are known as static segments. The sizes of the stack and heap areas are not known when the program gets compiled. Also it is possible to change or configure the sizes of these areas (i.e., increase or decrease). So, these are called dynamic segments.
Let's look at each of these segments in detail.
Data segment:- the data segment is to hold the value of those variables that need to be available throughout the life time of the program. So it is obvious that global variables should be allocated in the data segment. How about local variables declared as static? Yes, they are also allocated in the data area because their values should be available across function calls. If they are allocated in the stack frame itself, they will get destroyed once the function returns. The only option is to allocate them in a global area. Hence, they are allocated in this segment. So, the life time of a local static variable is that of the life time of the program.
There are two parts in this segment. The initialized data segment and u-initialized data segment.
When variables are initialized to some value (other than 0 or which is different value), they are allocated in the initialized segment. When the variables are un initialized they get allocated in the un-initialized data segment. This segment is usually referred to with cryptic acronym called BSS. It stands for block starting with symbol and gets its name from old IBM systems which had that segments initialized to zero.
The data area is separated into two based on explicit initialization, because the variables that are need to be initialized need not be initialized with zeros one by one. However the variables that are not initialized need not to be explicitly initialized with zeros one by one. Instead the job of initialization of variables to zero is left to the operating system to take care of. This bulk initialization can greatly reduce the time required to load.
When we want to run an executable program, the OS starts a program known as loader. When this loads the file into memory, it takes the BSS segment and initializes the whole thing to zeros. That is why the un-initialized global data and static data always get the default value of zero.
The layout of data segment is in the control of the underlying OS. However some loaders give partial control to the users. This information may be useful in applications such as embedded systems.
The data area can be addressed and accessed using pointers from the code. Automatic variables have an overhead in initializing the variables each time they are required, and code is required to do that initialization. However, variables in the data area do not have such runtime overhead, because the initialization is done only once and that too at loading time.
Code segment:- the program code is where the executable code is available for execution. This area is also known as the text segment and is of fixed size. This can be accessed only by function pointers and not by other data pointers. Another important piece of information to take note of here is that the system may consider this area as a read only memory area and any attempt to write in this area can lead to undefined behavior.
Stack and heap segments:- to execute the program two major parts of the memory used are stack and heap. Stack frames area created in the stack for functions and in the heap for dynamic memory allocation. The stack and heap are un-initialized areas. Therefore whatever happens to be in the memory becomes the initial (garbage) value for the objects created in that space.
The local variable and function arguments are allocated in the stack. For the local variables that have an initialization value, code is generated by the compiler to initialize them explicitly to those values when the stack frames are created. For function parameters the compiler generates code to copy the actual arguments to the space allocated for the parameters in the stack frame.
Here, we will take a small program and see where different program elements are stored when that program executes. The comments explain where the variables get stored.
static double bss2;
// these are stored in initialized to zero segment also known as un-initialized data segment(BSS)
char *init3="hello world";
// these are stored in initialized data segment
// the code for main function gets stored in the code segment
int local1=10; // this variable is stored in the stack and initialization code is generated by compiler.
int local2; //this variable is not initialized hence it has garbage value. It does not get initialized to zero.
static int local3; // this is allocated in the BSS segment and gets initialized to zero
static int local4=100; //this gets allocated in initialed data segment
int (*local-foo) (const char* —)= printf; // printf is in a shared library (libc or c runtime library)
// load-foo is a local variable(function pointer) that points to the printf function local-foo("hello world"); this function call results in the creation of stack frame in stack area
// allocated in stack however it points to dynamically allocated block in heap.
// stack frame for the main function gets destroyed after executing main
There several tools to check where a variable gets stored in the memory. But the easy to use tool is nm.
Using nm tool
Gcc program name
if no arguments are given to nm, it assumes that it should take the input as a.out and we will get some cryptic output like below.
Where the symbols that we did not type are come from? They have been inserted behind the screen by compiler for various reasons. We can ignore them for now.
Now what are those strange numbers, followed by letters (b, B, t). The numbers are the symbol values followed by the symbol type (displayed as a letter) and the symbol name.
The symbol type requires more explanation. A lowercase means it is local variable and uppercase means global (externally available from the file).
B un-initialized data section (BSS)
D initialized data section
T text/code section
U un identified
Example 1: nm ./a.out | grep bss
Variables bss1, bss3 got allocated in the BSS segment (global) since we put the class as static for variable bss2, it is listed as b (accessible with in the file).
Example 2: nm ./a.out | grep init
These are explicitly initialized and are allocated into initialized data section.
Example 3: nm ./a.out | grep local
Only local3 and local4 are allocated global memory. Since local3 is un-initialized it is allocated in the BSS and since local 4 is explicitly initialized it is allocated in the initialized data segment. As both are local they are indicated by small letters. Since they are local to the function and to avoid accidental mixing them up with other local variables with the same name they have been suffixed by some numbers. (Compilers differ in their approaches in treating local static variables. This approach is for gcc ).
08048354 T main
U printf@@ GLIBC_2.0
The main function is allocated in the text/code segment. Obviously we can access this function from outside the file ( to start the execution ). So the type of this symbol is T.
The malloc and printf function used in the program are not defined in the program itself ( header files only declare them, they don't define them ). They are defined in the shared library GLIBC, version 2.0. that's what the suffix @@GLIBC_@.0 implies.