Linux: Memory fragmentation and compaction

2017-12-23

Still in the process of analyzing the out of memory situations mentioned in an earlier posting.

Let’s have a quick look at memory fragmentation.

External fragmentation

Suppose you have linear memory and none of it is used. You slice it up into fixed size blocks. You can only allocate an entire block, not sub-parts of it. You then allocate three of the blocks. Finally, you free the one in the middle.

Looks like this:

1)    [ ----- ][ ----- ][ ----- ][ ----- ]    All empty
2)    [ - A - ][ - B - ][ - C - ][ ----- ]    A, B, C allocated
3)    [ - A - ][ ----- ][ - C - ][ ----- ]    B freed again

This diagram closely resembles the one in Wikipedia’s article on the topic.

Suppose this is all the memory you have. If you now want to allocate two contiguous blocks, you’re going to have a problem, because there are none left. Your memory is fragmented. So, even though a hypothetical program similar to free(1) would report “10 of 20 MB available”, an allocation request for 8 MB of contiguous memory would fail.

(Internal fragmentation happens when you don’t use all the memory that has been given to you. For example, block A above may only use one single byte. This kind of fragmentation does not concern us.)

If you were to move block C one block to the left, you could eliminate fragmentation. That’s what is called memory compaction.

Examining fragmentation in Linux

/proc/buddyinfo tells you a little bit about your memory:

# cat /proc/buddyinfo
Node 0, zone      DMA      1      1      1      0      1      1      1      0      1      1      3
Node 0, zone    DMA32      2      2     10     17     12      4      2      2      1      2    466

(This is in a VM.)

In the DMA32 zone, there are 2 areas of order 0 available, then 2 of order 1, followed by 10 of order 2, then 17 of order 3, and so on. Finally 466 areas of order 10. “Order” is the exponent n in 2^n, so those 466 areas of order 10 make up 2^10 · 4096 bytes, which is roughly the 2 GB I assigned to my VM.

So, virtually all of my memory is unused and there are lots of large contiguous areas.

Generally, you want large(r) numbers in the right columns of /proc/buddyinfo. When numbers in the right columns drop or even approach zero, then you’re hitting fragmentation. Remember, if the number in the right-most column has dropped to zero, then there’s not even one single 4 MiB (!) area anymore. Still, this is not necessarily a problem, because often times memory can be freed or maybe compacted.

For the purpose of this posting, we’re also interested in the total amount of memory described by each line of /proc/buddyinfo. The following trivial awk snippet called analyze does this job:

#!/usr/bin/awk -f

{
    sum = $5 * 2**0 * 4096 + \
          $6 * 2**1 * 4096 + \
          $7 * 2**2 * 4096 + \
          $8 * 2**3 * 4096 + \
          $9 * 2**4 * 4096 + \
          $10 * 2**5 * 4096 + \
          $11 * 2**6 * 4096 + \
          $12 * 2**7 * 4096 + \
          $13 * 2**8 * 4096 + \
          $14 * 2**9 * 4096 + \
          $15 * 2**10 * 4096
    printf "%7.2f MiB    ", sum / 1024 / 1024
    print
}

Creating fragmentation

Let’s cause some artificial fragmentation in our VM. It’s only relevant for in-kernel memory, so we’re going to use a little kernel module again – fragment.c. I strongly advise you to do this in a VM, as it will corrupt your kernel memory (leaving unfreeable pages behind).

#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>

#define N_MEM 100000
static char *mems[N_MEM];

static int
fragment_init(void)
{
    size_t i;

    for (i = 0; i < N_MEM; i++)
        mems[i] = kzalloc(16384, GFP_KERNEL);

    return 0;
}

static void
fragment_exit(void)
{
    size_t i;

    for (i = 0; i < N_MEM; i += 2)
        kfree(mems[i]);
}

module_init(fragment_init);
module_exit(fragment_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("nobody-cares");

Again, if you’re on Arch Linux, install base-devel and linux-headers and then you can build the module with the following Makefile (tabs for indentation):

obj-m := fragment.o

all:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

The code allocates 100'000 memory blocks of size 16'384 bytes. It will fail horribly if you run out of memory. Moreover, it will only free every other block when the module gets unloaded. After you unloaded the module, you will want to reboot your VM.

This means, we’re going to see a sequence roughly as follows:

1)    ________________    All memory unused
2)    ############____    Lots of memory used
3)    #_#_#_#_#_#_____    Half the memory freed again

I put the calls to kfree() in the exit function to make it easy to examine memory stats for each step.

Here’s what we see. Before doing anything, lots of memory is unused:

# free -m ; echo ; ./analyze /proc/buddyinfo
              total        used        free      shared  buff/cache   available
Mem:           2000          41        1826           0         133        1823
Swap:             0           0           0

  15.46 MiB    Node 0, zone      DMA      1      1      1      0      1      1      1      0      1      1      3
1809.91 MiB    Node 0, zone    DMA32    114    136    104     13     12      5      8     10      6      2    447

After loading the module, memory usage goes up. Notice how the number of order 10 areas drops significantly:

# insmod module/fragment.ko
# free -m ; echo ; ./analyze /proc/buddyinfo
              total        used        free      shared  buff/cache   available
Mem:           2000        1604         261           0         134         260
Swap:             0           0           0

  15.46 MiB    Node 0, zone      DMA      1      1      1      0      1      1      1      0      1      1      3
 246.31 MiB    Node 0, zone    DMA32     60     94     92     13     10      5      9     10      5      3     56

Now let’s unload it:

# rmmod fragment
# free -m ; echo ; ./analyze /proc/buddyinfo
              total        used        free      shared  buff/cache   available
Mem:           2000         822        1043           0         134        1042
Swap:             0           0           0

  15.46 MiB    Node 0, zone      DMA      1      1      1      0      1      1      1      0      1      1      3
1028.15 MiB    Node 0, zone    DMA32    116    145  50098     13     10      4      9     10      5      3     56

Almost 800 MiB have been freed, which is what we expected. We also now see a large amount of order 2 areas – the ones of size 16'384 bytes. Those are the ones we freed. But we do not see free areas of higher orders. Thus, our memory is successfully fragmented.

Trying to invoke memory compaction

You can ask the kernel to compact memory. Let’s do this:

# echo 1 >/proc/sys/vm/compact_memory
# free -m ; echo ; ./analyze /proc/buddyinfo
              total        used        free      shared  buff/cache   available
Mem:           2000         822        1042           0         135        1042
Swap:             0           0           0

  15.46 MiB    Node 0, zone      DMA      1      1      1      0      1      1      1      0      1      1      3
1027.19 MiB    Node 0, zone    DMA32     48      8  50004     12      4      2      5      3      6      2     58

Oops.

Yes, areas have been moved around a bit. But just a tiny little bit. This could also be explained by regular background activity on the system.

Turns out, not all memory is movable. If it’s not movable, compaction won’t do anything. This LWN article from 2010 says that “most memory used by the kernel directly cannot be moved”.

Another LWN article sheds some more light on this topic. You actually have to go through some hoops to make memory movable. Maybe I’ll do this in a later post.

Successfully compacting memory

Does memory compaction even work by running that echo call? Or am I doing something very wrong here?

Let’s try to create fragmentation via userspace processes. The assumption being that memory allocated for processes is movable – why shouldn’t it be, it can even be swapped out to disk.

Fragmenting memory in this way is not so easy. The following appears to work:

#include <stdio.h>
#include <string.h>
#include <unistd.h>

int
main()
{
    char *p;
    size_t i;

    p = sbrk(0);

    for (i = 0; i < 200000; i++)
    {
        sbrk(4096);
        memset(p + i * 4096, 'a', 4096);
    }

    puts("Allocated. Press Enter to quit.");
    getchar();
    return 0;
}

A program that allocates 200'000 · 4096 bytes while trying to bypass libc’s own memory allocation optimizations. Yes, it allocates memory in small chunks, but that’s not the main point – it only buys us time. It’s more important to run this program twice in parallel:

# ./userspace_fragment & ./userspace_fragment &

(While trying to read from stdin, the program will get a SIGTTIN which just pauses the process. This is fine.)

Hopefully the memory layout will then look something like this:

11212221121212121222212212

That is, some chunks are allocated to process number one, followed by some chunks of process number two, then process number one again, and so on. Terminating one of the processes should lead to fragmentation. Let’s see.

First, both processes are still running:

# ./analyze /proc/buddyinfo
  15.46 MiB    Node 0, zone      DMA      1      1      1      0      1      1      1      0      1      1      3
 198.29 MiB    Node 0, zone    DMA32    146    138    167    127     67      3      2      2      2      3     44

So far, so good. All memory allocated. Let’s terminate one of the processes:

# kill %1
[1]-  Terminated              userspace_fragment
# ./analyze /proc/buddyinfo
  15.46 MiB    Node 0, zone      DMA      1      1      1      0      1      1      1      0      1      1      3
 980.84 MiB    Node 0, zone    DMA32   6446   6410   6335   6271   6214    136     21      4      4      3     47

Good! About 800 MiB have been freed and we see quite some fragmentation. Let’s invoke the compaction:

# echo 1 >/proc/sys/vm/compact_memory
# ./analyze /proc/buddyinfo
  15.46 MiB    Node 0, zone      DMA      1      1      1      0      1      1      1      0      1      1      3
 980.19 MiB    Node 0, zone    DMA32    142    143    117     96     61     21     21     14     13     13    229

There you go. If memory is movable, compaction will do its job.

Conclusion

Should you install a cronjob to invoke compaction every minute? No. If the need arises, the kernel will do this automatically.

In other words, running that echo 1 >/proc/sys/vm/compact_memory manually is probably very much useless. Yes, it might compact memory, but there may be no need to do so. And it won’t help you anyway with special un-movable kernel memory. If one of your kernel modules runs amok and creates fragmentation, there’s probably nothing you can do about that. At least compaction won’t help.

Comments?