blog · git · desktop · images · contact & privacy · gopher
2015-10-25
A while back, I’ve written some blog postings about (basic) memory management of userspace processes on Linux: 0, 1. They’re only available in German, sorry.
Let’s recap the common memory layout of a process on Linux:
+---------+··+------:····························:------+------+·······:-------+··+--------+
| Program | | Heap :--> <--: mmap | vdso | <--: Stack | | Kernel |
+---------+··+------:····························:------+------+·······:-------+··+--------+
^ ^ ^ ^
| | | |
0x08048000 0x08xxxxxx 0xBFxxxxxx 0xC0000000
Remarks: This figure was made to illustrate the address space on 32 bit
Linux. The actual addresses differ on 64 bit. The “program” section is
also known as .text
.
How is that memory used? Where are variables usually stored when you program in C? Here’s what has been covered in my previous articles:
malloc
. The libc decides where to allocate that memory. It’s
usually stored on the heap or as anonymous mmap space. This data can
live as long as the process lives. (Of course, internally, the libc
asks the kernel for memory.)But there are another two “classes” of variables in C: Global variables and function-local variables that are marked as “static”. How do they work? Where are they stored?
static
vs. static
First of all, the static
keyword has two meanings:
static
, then this
variable will keep its value between invocations.The first point is not really relevant to us. It alters the scope of the variable on a much higher level. At the end of the day, the variable is just memory and can be accessed from everywhere in the program – if you know its address.
Global variables – whether static
or not – are interesting. They don’t
live on the heap, nor mmap, nor the stack. And static variables in
functions are interesting because they don’t get destroyed when the
function returns. Hence, they can’t live on the stack.
So, where do these variables live?
Given the following code:
#include <signal.h>
#include <stdio.h>
#include <sys/types.h>
int global_i;
int global_i_initialized = 123;
int
main()
{
static int function_i;
static int function_i_initialized = 456;
printf("%p\n", (void *)&global_i);
printf("%p\n", (void *)&global_i_initialized);
printf("%p\n", (void *)&function_i);
printf("%p\n", (void *)&function_i_initialized);
kill(0, SIGSTOP);
return 0;
}
If you run this program, it will print some addresses and then stop itself. You can then examine its address space:
$ ./bla
0x6009a8
0x60099c
0x6009a4
0x600998
[1]+ Stopped ./bla
$ cat /proc/$(pgrep bla)/maps
00400000-00401000 r-xp 00000000 00:21 325277 /tmp/bla
00600000-00601000 rw-p 00000000 00:21 325277 /tmp/bla <---------
7f4ee52fa000-7f4ee5495000 r-xp 00000000 08:01 10227509 /usr/lib/libc-2.22.so
7f4ee5495000-7f4ee5694000 ---p 0019b000 08:01 10227509 /usr/lib/libc-2.22.so
7f4ee5694000-7f4ee5698000 r--p 0019a000 08:01 10227509 /usr/lib/libc-2.22.so
7f4ee5698000-7f4ee569a000 rw-p 0019e000 08:01 10227509 /usr/lib/libc-2.22.so
7f4ee569a000-7f4ee569e000 rw-p 00000000 00:00 0
7f4ee569e000-7f4ee56c0000 r-xp 00000000 08:01 10227099 /usr/lib/ld-2.22.so
7f4ee5888000-7f4ee588b000 rw-p 00000000 00:00 0
7f4ee58be000-7f4ee58bf000 rw-p 00000000 00:00 0
7f4ee58bf000-7f4ee58c0000 r--p 00021000 08:01 10227099 /usr/lib/ld-2.22.so
7f4ee58c0000-7f4ee58c1000 rw-p 00022000 08:01 10227099 /usr/lib/ld-2.22.so
7f4ee58c1000-7f4ee58c2000 rw-p 00000000 00:00 0
7ffe48b2f000-7ffe48b50000 rw-p 00000000 00:00 0 [stack]
7ffe48bd1000-7ffe48bd3000 r--p 00000000 00:00 0 [vvar]
7ffe48bd3000-7ffe48bd5000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
$ fg
./bla
As you can see, all four variables live in a memory area just above
.text
which has been marked as readable and writable
(00600000-00601000). This is the answer to our first question. Yes,
these variables don’t live on the stack or heap or mmap. They get their
very own memory area, which remains in memory all the time.
So far, so good.
It makes a difference whether you initialize your variables or not. That’s why I introduced four different variables in the example above. They may end up in the same memory area, but they live in different sections. That’s a subtle difference which becomes more clear if we have a look at the assembly code:
$ gcc -Wall -Wextra -S -o bla.S bla.c
I won’t copy the full code here. Let’s just highlight the important sections and let’s look at initialized variables first because they’re a bit easier to understand.
At the top of the code, you can find this:
.data
.align 4
.type global_i_initialized, @object
.size global_i_initialized, 4
global_i_initialized:
.long 123
And at the bottom, there’s this:
.data
.align 4
.type function_i_initialized.2802, @object
.size function_i_initialized.2802, 4
function_i_initialized.2802:
.long 456
.data
describes the corresponding “section” in the final ELF file.
Simply put: You can store data here and, at runtime, this section will
be loaded into memory someplace near .text
. Pretty easy, hum? It’s
interesting to note that both variables end up in this section. This
means the only difference between them is their scope on C level. You
can’t access function_i_initialized
outside of main
(or whatever
function it was declared in) by name.
Okay, now for uninitialized variables. At the top of the code, you can spot this:
.comm global_i,4,4
And at the bottom, there’s this:
.comm function_i.2801,4,4
It’s important to note that both variables are merely “declared” – they
get a size and alignment – but no value has been assigned to them. Not
even “0”. Sure, what would be the point of assigning data? The
programmer has not initialized these variables, so he simply does not
care about their initial value. He really only wants to get some
space. And that is what really happens: These “declarations” actually
increase a counter. Nothing more. The resulting size is then stored in
your binary. When you run your program, the loader takes care of
allocating enough memory. All of that happens in the .bss
section.
Often times, memory is referenced either directly via its address or
relative to the current stack frame. Usage of .data
and .bss
,
however, reveals another interesting addressing mode.
Take the following C code:
int
main()
{
static int function_i;
function_i = 5;
return 0;
}
When looking at the resulting assembly code, you’ll find an instruction like this:
movl $5, function_i.2285(%rip)
This is known as “RIP-relative addressing”. The register RIP holds the
address of the next instruction to execute. Due to the fact that .data
and .bss
are just next to the .text
segment, it’s very easy to refer
to variables in these sections by using an address in .text
as a
basis.
Why would you do that? RIP-relative addressing makes it easier to produce position-independent code.
Beginners sometimes write code like this:
#include <stdio.h>
char *
foo(void)
{
char s[] = "hello world";
return s;
}
int
main()
{
char *t;
t = foo();
printf("%s\n", t);
return 0;
}
Luckily, these days, the compiler issues a big warning. The problem is
that you return the address of something that lives in the stack frame
of the function foo()
. Once foo()
has returned, t
still refers to
the same memory – but we don’t know anymore what data is now available
at that location.
(The point is to understand that you only return a pointer and not a complete string.)
Now, the solution is NOT to declare s
static.
Take the following code:
#include <stdio.h>
char *
foo(char c)
{
static char s[] = "hello world";
s[0] = c;
return s;
}
int
main()
{
char *t, *u;
t = foo('x');
u = foo('y');
printf("%s\n", t);
printf("%s\n", u);
return 0;
}
What do you expect will happen here? Doesn’t this solve your problem? Didn’t you read somewhere that “static variables won’t vanish when the function returns”? Plus, the compiler issues no warning! We’re fine!
No, we’re not.
What foo()
returns is an address in the .data
section. It returns
the same address on each invocation. My point is: static s
does
NOT create a variable that just happens to live on when the function
returns. You might think that foo()
would create a new “immortal”
variable on each call. No!
You only get one variable.
.data
section in your binary, otherwise their accumulated size is
stored as the size of the .bss
section.Please note that this has been only a quick overview. There’s still a lot to learn.