Storage of global and static variables in C

2015-10-25

A while back, I’ve written some blog postings about (basic) memory management of userspace processes on Linux: 0, 1. They’re only available in German, sorry.

Let’s recap the common memory layout of a process on Linux:

+---------+··+------:····························:------+------+·······:-------+··+--------+
| Program |  | Heap :-->                      <--: mmap | vdso |    <--: Stack |  | Kernel |
+---------+··+------:····························:------+------+·······:-------+··+--------+
^            ^                                                                 ^  ^
|            |                                                                 |  |
0x08048000   0x08xxxxxx                                               0xBFxxxxxx  0xC0000000

Remarks: This figure was made to illustrate the address space on 32 bit Linux. The actual addresses differ on 64 bit. The “program” section is also known as .text.

How is that memory used? Where are variables usually stored when you program in C? Here’s what has been covered in my previous articles:

Normal function-local variables end up in the function’s stack frame. That frame will be “destroyed” when the function returns and so will be the variables.
Dynamic memory can be requested from libc via functions like malloc. The libc decides where to allocate that memory. It’s usually stored on the heap or as anonymous mmap space. This data can live as long as the process lives. (Of course, internally, the libc asks the kernel for memory.)

But there are another two “classes” of variables in C: Global variables and function-local variables that are marked as “static”. How do they work? Where are they stored?

`static` vs. `static`

First of all, the static keyword has two meanings:

If it’s used with a global variable, then this variable is only “visible” from within the current file.
If you mark a variable within a function as static, then this variable will keep its value between invocations.

The first point is not really relevant to us. It alters the scope of the variable on a much higher level. At the end of the day, the variable is just memory and can be accessed from everywhere in the program – if you know its address.

Global variables – whether static or not – are interesting. They don’t live on the heap, nor mmap, nor the stack. And static variables in functions are interesting because they don’t get destroyed when the function returns. Hence, they can’t live on the stack.

So, where do these variables live?

A new area in memory

Given the following code:

#include <signal.h>
#include <stdio.h>
#include <sys/types.h>

int global_i;
int global_i_initialized = 123;

int
main()
{
    static int function_i;
    static int function_i_initialized = 456;

    printf("%p\n", (void *)&global_i);
    printf("%p\n", (void *)&global_i_initialized);
    printf("%p\n", (void *)&function_i);
    printf("%p\n", (void *)&function_i_initialized);

    kill(0, SIGSTOP);

    return 0;
}

If you run this program, it will print some addresses and then stop itself. You can then examine its address space:

$ ./bla 
0x6009a8
0x60099c
0x6009a4
0x600998

[1]+  Stopped                 ./bla

$ cat /proc/$(pgrep bla)/maps 
00400000-00401000 r-xp 00000000 00:21 325277             /tmp/bla
00600000-00601000 rw-p 00000000 00:21 325277             /tmp/bla  <---------
7f4ee52fa000-7f4ee5495000 r-xp 00000000 08:01 10227509   /usr/lib/libc-2.22.so
7f4ee5495000-7f4ee5694000 ---p 0019b000 08:01 10227509   /usr/lib/libc-2.22.so
7f4ee5694000-7f4ee5698000 r--p 0019a000 08:01 10227509   /usr/lib/libc-2.22.so
7f4ee5698000-7f4ee569a000 rw-p 0019e000 08:01 10227509   /usr/lib/libc-2.22.so
7f4ee569a000-7f4ee569e000 rw-p 00000000 00:00 0 
7f4ee569e000-7f4ee56c0000 r-xp 00000000 08:01 10227099   /usr/lib/ld-2.22.so
7f4ee5888000-7f4ee588b000 rw-p 00000000 00:00 0 
7f4ee58be000-7f4ee58bf000 rw-p 00000000 00:00 0 
7f4ee58bf000-7f4ee58c0000 r--p 00021000 08:01 10227099   /usr/lib/ld-2.22.so
7f4ee58c0000-7f4ee58c1000 rw-p 00022000 08:01 10227099   /usr/lib/ld-2.22.so
7f4ee58c1000-7f4ee58c2000 rw-p 00000000 00:00 0 
7ffe48b2f000-7ffe48b50000 rw-p 00000000 00:00 0          [stack]
7ffe48bd1000-7ffe48bd3000 r--p 00000000 00:00 0          [vvar]
7ffe48bd3000-7ffe48bd5000 r-xp 00000000 00:00 0          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0  [vsyscall]

$ fg
./bla

As you can see, all four variables live in a memory area just above .text which has been marked as readable and writable (00600000-00601000). This is the answer to our first question. Yes, these variables don’t live on the stack or heap or mmap. They get their very own memory area, which remains in memory all the time.

So far, so good.

Initialized vs. uninitialized

It makes a difference whether you initialize your variables or not. That’s why I introduced four different variables in the example above. They may end up in the same memory area, but they live in different sections. That’s a subtle difference which becomes more clear if we have a look at the assembly code:

$ gcc -Wall -Wextra -S -o bla.S bla.c

I won’t copy the full code here. Let’s just highlight the important sections and let’s look at initialized variables first because they’re a bit easier to understand.

At the top of the code, you can find this:

    .data
    .align 4
    .type   global_i_initialized, @object
    .size   global_i_initialized, 4
global_i_initialized:
    .long   123

And at the bottom, there’s this:

    .data
    .align 4
    .type   function_i_initialized.2802, @object
    .size   function_i_initialized.2802, 4
function_i_initialized.2802:
    .long   456

.data describes the corresponding “section” in the final ELF file. Simply put: You can store data here and, at runtime, this section will be loaded into memory someplace near .text. Pretty easy, hum? It’s interesting to note that both variables end up in this section. This means the only difference between them is their scope on C level. You can’t access function_i_initialized outside of main (or whatever function it was declared in) by name.

Okay, now for uninitialized variables. At the top of the code, you can spot this:

.comm   global_i,4,4

And at the bottom, there’s this:

.comm   function_i.2801,4,4

It’s important to note that both variables are merely “declared” – they get a size and alignment – but no value has been assigned to them. Not even “0”. Sure, what would be the point of assigning data? The programmer has not initialized these variables, so he simply does not care about their initial value. He really only wants to get some space. And that is what really happens: These “declarations” actually increase a counter. Nothing more. The resulting size is then stored in your binary. When you run your program, the loader takes care of allocating enough memory. All of that happens in the .bss section.

An interesting detail: RIP-relative addressing

Often times, memory is referenced either directly via its address or relative to the current stack frame. Usage of .data and .bss, however, reveals another interesting addressing mode.

Take the following C code:

int
main()
{
    static int function_i;

    function_i = 5;

    return 0;
}

When looking at the resulting assembly code, you’ll find an instruction like this:

movl    $5, function_i.2285(%rip)

This is known as “RIP-relative addressing”. The register RIP holds the address of the next instruction to execute. Due to the fact that .data and .bss are just next to the .text segment, it’s very easy to refer to variables in these sections by using an address in .text as a basis.

Why would you do that? RIP-relative addressing makes it easier to produce position-independent code.

A note to C beginners

Beginners sometimes write code like this:

#include <stdio.h>

char *
foo(void)
{
    char s[] = "hello world";
    return s;
}

int
main()
{
    char *t;

    t = foo();
    printf("%s\n", t);

    return 0;
}

Luckily, these days, the compiler issues a big warning. The problem is that you return the address of something that lives in the stack frame of the function foo(). Once foo() has returned, t still refers to the same memory – but we don’t know anymore what data is now available at that location.

(The point is to understand that you only return a pointer and not a complete string.)

Now, the solution is NOT to declare s static.

Take the following code:

#include <stdio.h>

char *
foo(char c)
{
    static char s[] = "hello world";
    s[0] = c;
    return s;
}

int
main()
{
    char *t, *u;

    t = foo('x');
    u = foo('y');
    printf("%s\n", t);
    printf("%s\n", u);

    return 0;
}

What do you expect will happen here? Doesn’t this solve your problem? Didn’t you read somewhere that “static variables won’t vanish when the function returns”? Plus, the compiler issues no warning! We’re fine!

No, we’re not.

What foo() returns is an address in the .data section. It returns the same address on each invocation. My point is: static s does NOT create a variable that just happens to live on when the function returns. You might think that foo() would create a new “immortal” variable on each call. No!

You only get one variable.

Summary

Both global variables and static variables in functions end up in a “special” memory area. That’s why global variables are, well, globally accessible and that’s why static variables in functions can survive even after the function has returned.
If these variables have been initialized, they are stored in the .data section in your binary, otherwise their accumulated size is stored as the size of the .bss section.

Please note that this has been only a quick overview. There’s still a lot to learn.

Comments?