blog · git · desktop · images · contact & privacy · gopher
2017-12-23
Still in the process of analyzing the out of memory situations mentioned in an earlier posting.
Let’s have a quick look at memory fragmentation.
Suppose you have linear memory and none of it is used. You slice it up into fixed size blocks. You can only allocate an entire block, not sub-parts of it. You then allocate three of the blocks. Finally, you free the one in the middle.
Looks like this:
1) [ ----- ][ ----- ][ ----- ][ ----- ] All empty
2) [ - A - ][ - B - ][ - C - ][ ----- ] A, B, C allocated
3) [ - A - ][ ----- ][ - C - ][ ----- ] B freed again
This diagram closely resembles the one in Wikipedia’s article on the topic.
Suppose this is all the memory you have. If you now want to allocate
two contiguous blocks, you’re going to have a problem, because there
are none left. Your memory is fragmented. So, even though a hypothetical
program similar to free(1)
would report “10 of 20 MB available”, an
allocation request for 8 MB of contiguous memory would fail.
(Internal fragmentation happens when you don’t use all the memory that has been given to you. For example, block A above may only use one single byte. This kind of fragmentation does not concern us.)
If you were to move block C one block to the left, you could eliminate fragmentation. That’s what is called memory compaction.
/proc/buddyinfo
tells you a little bit about your memory:
# cat /proc/buddyinfo
Node 0, zone DMA 1 1 1 0 1 1 1 0 1 1 3
Node 0, zone DMA32 2 2 10 17 12 4 2 2 1 2 466
(This is in a VM.)
In the DMA32 zone, there are 2 areas of order 0 available, then 2 of
order 1, followed by 10 of order 2, then 17 of order 3, and so on.
Finally 466 areas of order 10. “Order” is the exponent n
in 2^n
, so
those 466 areas of order 10 make up 2^10 · 4096
bytes, which is
roughly the 2 GB I assigned to my VM.
So, virtually all of my memory is unused and there are lots of large contiguous areas.
Generally, you want large(r) numbers in the right columns of
/proc/buddyinfo
. When numbers in the right columns drop or even
approach zero, then you’re hitting fragmentation. Remember, if the
number in the right-most column has dropped to zero, then there’s not
even one single 4 MiB (!) area anymore. Still, this is not necessarily a
problem, because often times memory can be freed or maybe compacted.
For the purpose of this posting, we’re also interested in the total
amount of memory described by each line of /proc/buddyinfo
. The
following trivial awk snippet called analyze
does this job:
#!/usr/bin/awk -f
{
sum = $5 * 2**0 * 4096 + \
$6 * 2**1 * 4096 + \
$7 * 2**2 * 4096 + \
$8 * 2**3 * 4096 + \
$9 * 2**4 * 4096 + \
$10 * 2**5 * 4096 + \
$11 * 2**6 * 4096 + \
$12 * 2**7 * 4096 + \
$13 * 2**8 * 4096 + \
$14 * 2**9 * 4096 + \
$15 * 2**10 * 4096
printf "%7.2f MiB ", sum / 1024 / 1024
print
}
Let’s cause some artificial fragmentation in our VM. It’s only relevant
for in-kernel memory, so we’re going to use a little kernel module
again – fragment.c
. I strongly advise you to do this in a VM, as it
will corrupt your kernel memory (leaving unfreeable pages behind).
#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>
#define N_MEM 100000
static char *mems[N_MEM];
static int
fragment_init(void)
{
size_t i;
for (i = 0; i < N_MEM; i++)
mems[i] = kzalloc(16384, GFP_KERNEL);
return 0;
}
static void
fragment_exit(void)
{
size_t i;
for (i = 0; i < N_MEM; i += 2)
kfree(mems[i]);
}
module_init(fragment_init);
module_exit(fragment_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("nobody-cares");
Again, if you’re on Arch Linux, install base-devel
and linux-headers
and then you can build the module with the following Makefile (tabs for
indentation):
obj-m := fragment.o
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
The code allocates 100'000 memory blocks of size 16'384 bytes. It will fail horribly if you run out of memory. Moreover, it will only free every other block when the module gets unloaded. After you unloaded the module, you will want to reboot your VM.
This means, we’re going to see a sequence roughly as follows:
1) ________________ All memory unused
2) ############____ Lots of memory used
3) #_#_#_#_#_#_____ Half the memory freed again
I put the calls to kfree()
in the exit function to make it easy to
examine memory stats for each step.
Here’s what we see. Before doing anything, lots of memory is unused:
# free -m ; echo ; ./analyze /proc/buddyinfo
total used free shared buff/cache available
Mem: 2000 41 1826 0 133 1823
Swap: 0 0 0
15.46 MiB Node 0, zone DMA 1 1 1 0 1 1 1 0 1 1 3
1809.91 MiB Node 0, zone DMA32 114 136 104 13 12 5 8 10 6 2 447
After loading the module, memory usage goes up. Notice how the number of order 10 areas drops significantly:
# insmod module/fragment.ko
# free -m ; echo ; ./analyze /proc/buddyinfo
total used free shared buff/cache available
Mem: 2000 1604 261 0 134 260
Swap: 0 0 0
15.46 MiB Node 0, zone DMA 1 1 1 0 1 1 1 0 1 1 3
246.31 MiB Node 0, zone DMA32 60 94 92 13 10 5 9 10 5 3 56
Now let’s unload it:
# rmmod fragment
# free -m ; echo ; ./analyze /proc/buddyinfo
total used free shared buff/cache available
Mem: 2000 822 1043 0 134 1042
Swap: 0 0 0
15.46 MiB Node 0, zone DMA 1 1 1 0 1 1 1 0 1 1 3
1028.15 MiB Node 0, zone DMA32 116 145 50098 13 10 4 9 10 5 3 56
Almost 800 MiB have been freed, which is what we expected. We also now see a large amount of order 2 areas – the ones of size 16'384 bytes. Those are the ones we freed. But we do not see free areas of higher orders. Thus, our memory is successfully fragmented.
You can ask the kernel to compact memory. Let’s do this:
# echo 1 >/proc/sys/vm/compact_memory
# free -m ; echo ; ./analyze /proc/buddyinfo
total used free shared buff/cache available
Mem: 2000 822 1042 0 135 1042
Swap: 0 0 0
15.46 MiB Node 0, zone DMA 1 1 1 0 1 1 1 0 1 1 3
1027.19 MiB Node 0, zone DMA32 48 8 50004 12 4 2 5 3 6 2 58
Oops.
Yes, areas have been moved around a bit. But just a tiny little bit. This could also be explained by regular background activity on the system.
Turns out, not all memory is movable. If it’s not movable, compaction won’t do anything. This LWN article from 2010 says that “most memory used by the kernel directly cannot be moved”.
Another LWN article sheds some more light on this topic. You actually have to go through some hoops to make memory movable. Maybe I’ll do this in a later post.
Does memory compaction even work by running that echo
call? Or am I
doing something very wrong here?
Let’s try to create fragmentation via userspace processes. The assumption being that memory allocated for processes is movable – why shouldn’t it be, it can even be swapped out to disk.
Fragmenting memory in this way is not so easy. The following appears to work:
#include <stdio.h>
#include <string.h>
#include <unistd.h>
int
main()
{
char *p;
size_t i;
p = sbrk(0);
for (i = 0; i < 200000; i++)
{
sbrk(4096);
memset(p + i * 4096, 'a', 4096);
}
puts("Allocated. Press Enter to quit.");
getchar();
return 0;
}
A program that allocates 200'000 · 4096
bytes while trying to bypass
libc’s own memory allocation optimizations. Yes, it allocates memory in
small chunks, but that’s not the main point – it only buys us time. It’s
more important to run this program twice in parallel:
# ./userspace_fragment & ./userspace_fragment &
(While trying to read from stdin
, the program will get a SIGTTIN
which just pauses the process. This is fine.)
Hopefully the memory layout will then look something like this:
11212221121212121222212212
That is, some chunks are allocated to process number one, followed by some chunks of process number two, then process number one again, and so on. Terminating one of the processes should lead to fragmentation. Let’s see.
First, both processes are still running:
# ./analyze /proc/buddyinfo
15.46 MiB Node 0, zone DMA 1 1 1 0 1 1 1 0 1 1 3
198.29 MiB Node 0, zone DMA32 146 138 167 127 67 3 2 2 2 3 44
So far, so good. All memory allocated. Let’s terminate one of the processes:
# kill %1
[1]- Terminated userspace_fragment
# ./analyze /proc/buddyinfo
15.46 MiB Node 0, zone DMA 1 1 1 0 1 1 1 0 1 1 3
980.84 MiB Node 0, zone DMA32 6446 6410 6335 6271 6214 136 21 4 4 3 47
Good! About 800 MiB have been freed and we see quite some fragmentation. Let’s invoke the compaction:
# echo 1 >/proc/sys/vm/compact_memory
# ./analyze /proc/buddyinfo
15.46 MiB Node 0, zone DMA 1 1 1 0 1 1 1 0 1 1 3
980.19 MiB Node 0, zone DMA32 142 143 117 96 61 21 21 14 13 13 229
There you go. If memory is movable, compaction will do its job.
Should you install a cronjob to invoke compaction every minute? No. If the need arises, the kernel will do this automatically.
In other words, running that echo 1 >/proc/sys/vm/compact_memory
manually is probably very much useless. Yes, it might compact memory,
but there may be no need to do so. And it won’t help you anyway with
special un-movable kernel memory. If one of your kernel modules runs
amok and creates fragmentation, there’s probably nothing you can do
about that. At least compaction won’t help.