Redis: fork – Cannot allocate memory, Linux, virtual memory and vm.overcommit_memory

By | 08/28/2019
 

Currently, I’m configuring a Redis as a caching service for our application and during that, I faced with the question: do I need to set vm.overcommit_memory to the value 1, i.e. disable it – or not?

The question is quite old for me, see The story, but only now I found time to get to the real root of the question, put everything together and write the following post.

It was originally posted in Russian and this is a copy translated by myself. As there is really a lot of text – I hope I didn’t confuse anything during translation. If any – please, feel free to select a text with the mouse and press Shift+Enter to send me a notification.

So, the problem itself is that Redis documentation and almost every HowTo/guide about Redis performance carelessly tell us to disable the Linux overcommit_memory mechanism by setting vm.overcommit_memory to 1, especially as a solution for the “fork — Cannot allocate memory” error.

In this post, we will try to figure out – what exactly the overcommit_memory is, where and how it is used and do we really need to change it in my current case, i.e. when Redis will be used for caching only.

Why overcommitting is bad?

As this post originally was in Russian and has this part translated – I’ll just copy-past a part of the original text here. Read the full story – What is Overcommit? And why is it bad?

Overcommit refers to the practice of giving out virtual memory with no guarantee that physical storage for it exists. To make an analogy, it’s like using a credit card and not keeping track of your purchases. A system performing overcommit just keeps giving out virtual memory until the debt collector comes calling — that is, until some program touches a previously-untouched page, and the kernel fails to find any physical memory to instantiate it — and then stuff starts crashing down.

What happens when “stuff starts crashing down”? It can vary, but the Linux approach was to design an elaborate heuristic “OOM killer” in the kernel that judges the behavior of each process and decides who’s “guilty” of making the machine run out of memory, then kills the guilty parties. In practice this works fairly well from a standpoint of avoiding killing critical system processes and killing the process that’s “hogging” memory, but the problem is that no process is really “guilty” of using more memory than was available, because everyone was (incorrectly) told that the memory was available.

Suppose you don’t want this kind of uncertainty/danger when it comes to memory allocation? The naive solution would be to immediately and statically allocate physical memory corresponding to all virtual memory. To extend the credit card analogy, this would be like using cash for all your purchases, or like using a debit card. You get the safety from overspending, but you also lose a lot of fluidity. Thankfully, there’s a better way to manage memory.

The approach taken in reality when you want to avoid committing too much memory is to account for all the memory that’s allocated. In our credit card analogy, this corresponds to using a credit card, but keeping track of all the purchases on it, and never purchasing more than you have funds to pay off. This turns out to be the Right Thing when it comes to managing virtual memory, and in fact it’s what Linux does when you set the vm.overcommit_memory sysctl parameter to the value 2. In this mode, all virtual memory that could potentially be modified (i.e. has read-write permissions) or lacks backing (i.e. an original copy on disk or other device that it could be restored from if it needs to be discarded) is accounted for as “commit charge”, the amount of memory the kernel as committed/promised to applications. When a new virtual memory allocation would cause the commit charge to exceed a configurable limit (by default, the size of swap plus half the size of physical ram), the allocation fails.

Redis persistence

Redis uses two mechanisms to achieve data persistence – the RDB snapshotting (point-in-time snapshot) whiсh creates data copy from memory to the solid drive and AOF which constantly writes a log which every single operation performed by the server during its work. See more at the documentation – Redis Persistence.

The overcommit_memory steps in when Redis creates data snapshotting from the memory on the disk, specifically during the BGSAVE and BGREWRITEAOF commands execution.

Below we will concentrate on the BGSAVE command during which Redis creates a child process which makes data copy to the disk.

Redis save, SAVE и BGSAVE

A bit confusing may be Redis itself: in its configuration file the save option is responsible for the BGSAVE operation.

However, Redis also has the SAVE command but it works differently:

  • SAVE is in-sync command and performs write blocks on the memory during creating a copy
  • BGSAVE in its turn is an asynchronous mechanism – it works in a parallel to a main server’s process and doesn’t affect its operations and client connected, thus it is the preferable way to create a backup

But in a case when BGSAVE can not be used, for example, because of the “Can’t save in background: fork: Cannot allocate memory” error – one can use the SAVE command.

To check it let’s use the strace tool.

Create a test config file redis-testing.conf:

save 1 1
port 7777

Run strace and redis-server using this config:

[simterm]

root@bttrm-dev-console:/home/admin# strace -o redis-trace.log redis-server redis-testing.conf

[/simterm]

strace will write its output to the redis-trace.log file which we will check to find system calls used by the redis-server during the SAVE and BGSAVE operations:

[simterm]

root@bttrm-dev-console:/home/admin# tail -f redis-trace.log | grep -v 'gettimeofday\|close\|open\|epoll_wait\|getpid\|read\|write'

[/simterm]

Here with the grep -v we removed a “garbage” calls which we don’t need now.

We could use -e trace= to grab only necessary calls – but we don’t know yet what exactly we are looking for.

In the Redis configuration file, we set port 777 and save 1 1, e.g. to create a database copy to the disk each second if at least one key was changed.

Add a new key:

[simterm]

admin@bttrm-dev-console:~$ redis-cli -p 7777 set test test
OK

[/simterm]

And check the strace log:

[simterm]

root@bttrm-dev-console:/home/admin# tail -f redis-trace.log | grep -v 'gettimeofday\|close\|open\|epoll_wait\|getpid\|read\|write'
accept(5, {sa_family=AF_INET, sin_port=htons(60816), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 6
...
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2097, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff26beda190) = 1790
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2097, ...}) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1790, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 1790
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2097, ...}) = 0

[/simterm]

Here is the clone() call (why clone() instead of the fork() – we will speak a bit later, in the fork() vs fork() vs clone()). This clone() creates a new child process which in its turn will create the data copy.

Now – run the SAVE command:

[simterm]

admin@bttrm-dev-console:~$ redis-cli -p 7777 save
OK

[/simterm]

And check the log:

[simterm]

accept(5, {sa_family=AF_INET, sin_port=htons(32870), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 6
...
rename("temp-1652.rdb", "dump.rdb")     = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2097, ...}) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 6, 0x7ffe6712430c) = 0

[/simterm]

No clone() at this time – the dump was performed by the main Redis’ process and saved to the dump.rdb file – check the rename(“temp-1652.rdb”, “dump.rdb”) line in the strace‘s output (we will see shortly from there this name appeared – temp-1652.rdb).

Now call the BGSAVE:

[simterm]

admin@bttrm-dev-console:~$ redis-cli -p 7777 bgsave
Background saving started

[/simterm]

And check the log again:

[simterm]

accept(5, {sa_family=AF_INET, sin_port=htons(33030), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 6
...
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff26beda190) = 1879
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2097, ...}) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 6, 0x7ffe6712430c) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1879, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 1879
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2097, ...}) = 0

[/simterm]

And again our clone() is here which spawned another child process with PID 1879:

[simterm]

...
clone([...]) = 1879
...

[/simterm]

Redis rdbSave() and rdbSaveBackground() functions 

Exactly the dump itself is created by the only one Redis’ function – rdbSave():

...
/* Save the DB on disk. Return C_ERR on error, C_OK on success. */
int rdbSave(char *filename, rdbSaveInfo *rsi) {
    ...
    snprintf(tmpfile,256,"temp-%d.rdb", (int) getpid());
    fp = fopen(tmpfile,"w");
...

Which is called when you are executing the redis-cli -p 7777 SAVE command.

And here is our temp-1652.rdb file name from the strace output above:

...
snprintf(tmpfile,256,"temp-%d.rdb", (int) getpid());
...

Where 1652 – is the Redis server’s main process PID.

For its part, during the BGSAVE command another function is called – rdbSaveBackground():

...
int rdbSaveBackground(char *filename, rdbSaveInfo *rsi) {
    ...
    start = ustime();
    if ((childpid = fork()) == 0) {
        int retval;

        /* Child */
        closeListeningSockets(0);
        redisSetProcTitle("redis-rdb-bgsave");
        retval = rdbSave(filename,rsi);
...

Which in its turn creates a new child process:

...
if ((childpid = fork()) == 0)
...

And this process in its turn will execute the rdbSave():

...
retval = rdbSave(filename,rsi);
...

fork() vs fork() vs clone()

Now, let’s go back to the question – why in the strace‘s output we are seeing the clone() syscall instead of the  fork() which is called by the rdbSaveBackground() function?

Well, that’s just because of the fork() != fork():

  1. there is the Linux-kernel fork() syscall
  2. and there is also a glibc‘s fork() function which is a wrapper around the clone() syscall

Try to check them by using the apropos tool:

[simterm]

[setevoy@setevoy-arch-work ~/Temp/redis] [unstable*] $ apropos fork
fork (2)             - create a child process
fork (3am)           - basic process management
fork (3p)            - create a new process

[/simterm]

So, the fork(2) – is a system call, whereas the fork(3p) – is the glibc‘s function – https://github.com/bminor/glibc/blob/master/sysdeps/nptl/fork.c#L48.

Now, Read the Following Manual 🙂 – open the man 2 fork:

[simterm]

[setevoy@setevoy-arch-work ~/Temp/redis] [unstable*] $ man 2 fork | grep -A 5 NOTES
NOTES
       Under Linux, fork() is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child.

   C library/kernel differences
       Since version 2.3.3, rather than invoking the kernel's fork() system call, the glibc fork() wrapper that is provided as part of the NPTL threading implementation invokes clone(2) with flags that provide the same effect as the
       traditional system call.  (A call to fork() is equivalent to a call to clone(2) specifying flags as just SIGCHLD.)  The glibc wrapper invokes any fork handlers that have been established using pthread_atfork(3).

[/simterm]

rather than invoking the kernel’s fork() system call, the glibc fork() wrapper […] invokes clone(2)

Consequently, when the  rdbSaveBackground() executes the fork() – it uses not the fork(2) but the fork(3p) from the glibc, which in its turn is aliased to the __libc_fork():

...
weak_alias (__libc_fork, __fork)
libc_hidden_def (__fork)
weak_alias (__libc_fork, fork)

And inside of the __libc_fork() – the “magic” itself is happening by calling the arch_fork() macros:

...
pid = arch_fork (&THREAD_SELF->tid);
...

Find it in the glibc‘s source code just by grep-ing it:

[simterm]

[setevoy@setevoy-arch-work ~/Temp/glibc] [master*] $ grep -r arch_fork . 
./ChangeLog:    (arch_fork): Issue INLINE_CLONE_SYSCALL if defined.
./ChangeLog:    * sysdeps/nptl/fork.c (ARCH_FORK): Replace by arch_fork.
./ChangeLog:    * sysdeps/unix/sysv/linux/arch-fork.h (arch_fork): New function.
./sysdeps/unix/sysv/linux/arch-fork.h:/* arch_fork definition for Linux fork implementation.
./sysdeps/unix/sysv/linux/arch-fork.h:arch_fork (void *ctid)
./sysdeps/nptl/fork.c:  pid = arch_fork (&THREAD_SELF->tid);

[/simterm]

The arch_fork() is described in the sysdeps/unix/sysv/linux/arch-fork.h file which in its turn will call the clone():

...
ret = INLINE_SYSCALL_CALL (clone, flags, 0, NULL, 0, ctid);
...

Which we will see in the strace‘s log.

To check if it is so and we are really using the glibc fork() and not the system call – let’s write some small C-program by using the official GNU’s documentation:

#include <unistd.h>
#include <sys/wait.h>
#include <stdio.h>

int main () {
  pid_t pid;
  pid = fork ();
  if (pid == 0) {
      printf("Child created\n");
      sleep(100);
  }
}

Here in the pid = fork() we are calling the fork() in the same way, as it did by the rdbSaveBackground() function.

Then let’s use ltrace to track a libraries functions (unlike the strace which is used to trace system calls):

[simterm]

$ ltrace -C -f ./test_fork_lib
[pid 5530] fork( <unfinished ...>
[pid 5531] <... fork resumed> )                                                                                                                    = 0
[pid 5530] <... fork resumed> )                                                                                                                    = 5531
[pid 5531] puts("Child created" <no return ...>
[pid 5530] +++ exited (status 0) +++
Child created
[pid 5531] <... puts resumed> )                                                                                                                    = 14
[pid 5531] sleep(100)                                                                                                                              = 0

[/simterm]

And by using the lsof tool – find all the files opened by our process:

[simterm]

[setevoy@setevoy-arch-work ~/Temp/glibc] [master*] $ lsof -p 5531
COMMAND    PID    USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
test_fork 5531 setevoy  cwd    DIR  254,3     4096 4854992 /home/setevoy/Temp/glibc
test_fork 5531 setevoy  rtd    DIR  254,2     4096       2 /
test_fork 5531 setevoy  txt    REG  254,3    16648 4855715 /home/setevoy/Temp/glibc/test_fork_lib
test_fork 5531 setevoy  mem    REG  254,2  2133648  396251 /usr/lib/libc-2.29.so
...

[/simterm]

Or by using the ldd – check which libraries will be used to make out code working:

[simterm]

[setevoy@setevoy-arch-work ~/Temp/glibc] [master*] $ ldd test_fork_lib
        ...
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f26ba77f000)

[/simterm]

libc-2.29.so is taken from the glibc package:

[simterm]

[setevoy@setevoy-arch-work ~/Temp/glibc] [master*] $ pacman -Ql glibc | grep libc-2.29.so
glibc /usr/lib/libc-2.29.so

[/simterm]

Another way to check functions in a library’s file – by using the objdump tool:

[simterm]

[setevoy@setevoy-arch-work ~/Temp/linux] [master*] $ objdump -T /usr/lib/libc.so.6 | grep fork
00000000000c93c0 g    DF .text  00000000000001fe  GLIBC_PRIVATE __libc_fork

[/simterm]

__libc_fork – here is our function in the .text section (see more in the Linux: C — адресное пространство процесса and C: создание и применение shared library в Linux, both Rus).

Redis – fork: Cannot allocate memory – the cause

Again, as the text below originally was translated for the Russian post – I’ll just copy-paste the documentation here. Read the full documentation – Background saving fails with a fork() error under Linux even if I have a lot of free RAM.

Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can’t tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages, with the result that if you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.

Setting overcommit_memory to 1 tells Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.

The overcommit_memory values

vm.overcommit_memory can contain one of the following three values:

  • 0: the kernel will perform a virtual memory allocation more, then the server has, but will rely on the “heuristic algorithm” (heuristic overcommit handling) to decide whenever to approve or decline memory allocation for a process
  • 1: the kernel always will perform overcommit what can lead to more Out of memory errors but maybe good for services which are actively using memory
  • 2: the kernel will perform overcommitting but in the bounds set by the overcommit_ratio or overcommit_kbytes parameters

The famous “heuristic algorithm” (Heuristic Overcommit handling)

In the most documentations/guides/howtos etc this algorithm is only mentioned, but it was not so easy to find its detailed description to understand how it is working.

As usually – “Just read the source!”

The overcommit check is performed by the supplementary function __vm_enough_memory() from the memory management module and is described in the kernel’s mm/util.c file.

This function accepts a number of pages requested by a process to be allocated and then this function will::

  1. if overcommit_memory == 1 (if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)):
    1. return 0 and allow overcommit
  2. if overcommit_memory == 0 (if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) and sysctl_overcommit_memory by default is set to the OVERCOMMIT_GUESS, and OVERCOMMIT_GUESS is set to the 0 in the linux/mman.h file):
    1. count all free pages now and save them to the free variable:
      free = global_zone_page_state(NR_FREE_PAGES)
    2. increase the free to the number of the file-backed (see File-backed and Swap, Memory-mapped file) memory pages, i.e. pages which can be freed by swapping to the disk:
      free += global_node_page_state(NR_FILE_PAGES)
    3. decrease the free by the shared memory (see the Shared Memory, Shared memory) pages number
      free -= global_node_page_state(NR_SHMEM)
    4. increase the free by adding swap-pages
      free += get_nr_swap_pages()
    5. increase by the SReclaimable (see the man 5 proc SReclaimable) number
      free += global_node_page_state(NR_SLAB_RECLAIMABLE)
    6. increase by the KReclaimable (see the man 5 proc KReclaimable) number
      free += global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE)
    7. decrease by the minimal reserved pages (see the calculate_totalreserve_pages() and An enhancement of OVERCOMMIT_GUESS)
      free -= totalreserve_pages
    8. decrease by the memory which is reserved for the root user (см. init_admin_reserve())
      free -= sysctl_admin_reserve_kbytes
    9. and the last step is to check currently available memory – and the requested by a process – if the free variable with all our calculations above will contain enough pages – it will return the 0 value:
      if (free > pages) return 0;

See the How Linux handles virtual memory overcommit, overcommit-accounting, Checking Available Memory.

Checking vm.overcommit_memory

Well – that’s all is a theory – not it’s time to take a real look at how all this will work during vm.overcommit_memory changes, how the memory is allocated in general.

Let’s use the next simple code:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>

int main() {
    
    printf("main() started\n");
    
    long int mem_size = 4096;
    
    void *mem_stack = malloc(mem_size);
    
    printf("Parent pid: %lu\n", getpid());

    sleep(1200);
}

In the void *mem_stack = malloc(mem_size) line we are requested 4096 bytes allocation which is set in the mem_size variable.

Check the current overcommit_memory value:

[simterm]

root@bttrm-dev-console:/home/admin# cat /proc/sys/vm/overcommit_memory 0

[/simterm]

Run our program:

[simterm]

root@bttrm-dev-console:/home/admin# ./test_vm 
main() started
Parent pid: 14353

[/simterm]

Check the memory used by the process now:

[simterm]

root@bttrm-dev-console:/home/admin# ps aux | grep test_vm | grep -v grep
root     14353  0.0  0.0   4160   676 pts/4    S+   17:29   0:00 ./test_vm

[/simterm]

4160 VSZ (Virtual Size) – as we requested in the malloc(mem_size) call.

Now – change the mem_size variable value from the 4096 bytes – to the 10995116277761 terabyte:

...
long int mem_size = 1099511627776;
...

Build it, run – and:

[simterm]

root@bttrm-dev-console:/home/admin# ./test_vm

main() started
Segmentation fault

[/simterm]

Great!

Check with the strace:

[simterm]

root@bttrm-dev-console:/home/admin# strace -e trace=mmap ./test_vm 
mmap(NULL, 47657, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa268f24000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa268f22000
mmap(NULL, 3795296, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fa26896e000
mmap(0x7fa268d03000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x195000) = 0x7fa268d03000
mmap(0x7fa268d09000, 14688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fa268d09000
main() started
mmap(NULL, 1099511631872, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 1099511762944, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fa26096e000
mmap(NULL, 1099511631872, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xfffffffff8} ---
+++ killed by SIGSEGV +++
Segmentation fault

[/simterm]

And here is our lovely -1 ENOMEM (Cannot allocate memory), which we can see in the Redis logs in its “Can’t save in background: fork: Cannot allocate memory error message.

So, here is the mmap()  syscall which is calling the security_vm_enough_memory_mm() function:

...
    if (security_vm_enough_memory_mm(mm, grow))
        return -ENOMEM;
...

Which is described in the security.h header file and where is the __vm_enough_memory() called:

...
static inline int security_vm_enough_memory_mm(struct mm_struct *mm, long pages)
{
    return __vm_enough_memory(mm, pages, cap_vm_enough_memory(mm, pages));
}
...

Now – disable the overcommit limit checking:

[simterm]

root@bttrm-dev-console:/home/admin# echo 1 > /proc/sys/vm/overcommit_memory

[/simterm]

Run the program again:

[simterm]

root@bttrm-dev-console:/home/admin# ./test_vm 

main() started
Parent pid: 11337
Child pid: 11338
Child is running with PID 11338

[/simterm]

Check the VSZ used now:

[simterm]

admin@bttrm-dev-console:~/redis-logs$ ps aux | grep -v grep | grep 11337
root     11337  0.0  0.0 1073745988 656 pts/4  S+   16:34   0:00 ./test_vm

[/simterm]

VSZ == 1073745988 – just awesome: we just allocated 1 TERRABYTE of the virtual memory on the AWS t2.medium EC2 instance with the only 4 Gigabyte of the “real” memory!

And now – guess what will happen, one the child process will start actively using this allocated virtual (yet) memory?

Add the memset() syscall which will set the 0 into our mem_stack by filling the whole mem_size, e.g. 1 TB:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>

int main() {
    
    printf("main() started\n");
    
    long int mem_size = 1099511627776;
    
    void *mem_stack = malloc(mem_size);
    
    printf("Parent pid: %lu\n", getpid());
                
    memset(mem_stack, 0, mem_size);

    sleep(120);
}

Run it (do NOT do it on a Production environment!):

[simterm]

root@bttrm-dev-console:/home/admin# ./test_vm 
main() started
Parent pid: 15219
Killed

[/simterm]

And check the operating system’s log:

Aug 27 17:46:43 localhost kernel: [7974462.384723] Out of memory: Kill process 15219 (test_vm) score 818 or sacrifice child
Aug 27 17:46:43 localhost kernel: [7974462.393395] Killed process 15219 (test_vm) total-vm:1073745988kB, anon-rss:3411676kB, file-rss:16kB, shmem-rss:0kB
Aug 27 17:46:43 localhost kernel: [7974462.600138] oom_reaper: reaped process 15219 (test_vm), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The OOM Killer came here and killed everybody. At this time. Next time it may not be in time.

Pay attention, that our process had time to consume whole 3.4 GB of the real memory – anon-rss:3411676k from the one terabyte of the virtual memory given – total-vm:1073745988kB.

Conclusions

In our current case, when Redis is used for caching only and has no RDB or AOF backups enabled – no need to change the overcommit_memory and best to leave it with its default value – 0.

In the case when you really want to set the boundaries by yourself – it’s best to use overcommit_memory == 2 and limit the overcommit by setting the overcommit_ratio or overcommit_kbytes parameters.

The story

Actually, the whole story with the vm.overcommit_memory started for me about a year ago.

I wasn’t too much familiar with the Redis at this time and I just came to the new project where Redis already was used.

In one perfect day – our Production server (and by the time when I just came to this project – the whole backend was working on the only one AWS EC2) tired a bit and went down for some rest.

After a magic dropkick via AWS Console – the server went back online and I started looking for the root cause.

In its logs, I found records about OOM Killer that cam to the Redis or RabbitMQ – not sure now exactly. But anyway during the investigation I found the vm.overcommit_memory was set to the 1, i.e. disabled at all.

So anyway – this story at first gave me a reason to create a more reliable and fault-tolerant architecture for our backend’s infrastructure, and as the second thing – teach me not to blindly trust to any documentation.

Useful links