Currently, I’m configuring a Redis as a caching service for our application and during that, I faced with the question: do I need to set vm.overcommit_memory to the value 1, i.e. disable it – or not?
The question is quite old for me, see The story, but only now I found time to get to the real root of the question, put everything together and write the following post.
It was originally posted in Russian and this is a copy translated by myself. As there is really a lot of text – I hope I didn’t confuse anything during translation. If any – please, feel free to select a text with the mouse and press Shift+Enter to send me a notification.
So, the problem itself is that Redis documentation and almost every HowTo/guide about Redis performance carelessly tell us to disable the Linux overcommit_memory mechanism by setting vm.overcommit_memory to 1, especially as a solution for the “fork — Cannot allocate memory” error.
In this post, we will try to figure out – what exactly the overcommit_memory is, where and how it is used and do we really need to change it in my current case, i.e. when Redis will be used for caching only.
Contents
Why overcommitting is bad?
As this post originally was in Russian and has this part translated – I’ll just copy-past a part of the original text here. Read the full story – What is Overcommit? And why is it bad?
Overcommit refers to the practice of giving out virtual memory with no guarantee that physical storage for it exists. To make an analogy, it’s like using a credit card and not keeping track of your purchases. A system performing overcommit just keeps giving out virtual memory until the debt collector comes calling — that is, until some program touches a previously-untouched page, and the kernel fails to find any physical memory to instantiate it — and then stuff starts crashing down.
What happens when “stuff starts crashing down”? It can vary, but the Linux approach was to design an elaborate heuristic “OOM killer” in the kernel that judges the behavior of each process and decides who’s “guilty” of making the machine run out of memory, then kills the guilty parties. In practice this works fairly well from a standpoint of avoiding killing critical system processes and killing the process that’s “hogging” memory, but the problem is that no process is really “guilty” of using more memory than was available, because everyone was (incorrectly) told that the memory was available.
Suppose you don’t want this kind of uncertainty/danger when it comes to memory allocation? The naive solution would be to immediately and statically allocate physical memory corresponding to all virtual memory. To extend the credit card analogy, this would be like using cash for all your purchases, or like using a debit card. You get the safety from overspending, but you also lose a lot of fluidity. Thankfully, there’s a better way to manage memory.
The approach taken in reality when you want to avoid committing too much memory is to account for all the memory that’s allocated. In our credit card analogy, this corresponds to using a credit card, but keeping track of all the purchases on it, and never purchasing more than you have funds to pay off. This turns out to be the Right Thing when it comes to managing virtual memory, and in fact it’s what Linux does when you set the vm.overcommit_memory sysctl parameter to the value 2. In this mode, all virtual memory that could potentially be modified (i.e. has read-write permissions) or lacks backing (i.e. an original copy on disk or other device that it could be restored from if it needs to be discarded) is accounted for as “commit charge”, the amount of memory the kernel as committed/promised to applications. When a new virtual memory allocation would cause the commit charge to exceed a configurable limit (by default, the size of swap plus half the size of physical ram), the allocation fails.
Redis persistence
Redis uses two mechanisms to achieve data persistence – the RDB snapshotting (point-in-time snapshot) whiсh creates data copy from memory to the solid drive and AOF which constantly writes a log which every single operation performed by the server during its work. See more at the documentation – Redis Persistence.
The overcommit_memory steps in when Redis creates data snapshotting from the memory on the disk, specifically during the BGSAVE and BGREWRITEAOF commands execution.
Below we will concentrate on the BGSAVE command during which Redis creates a child process which makes data copy to the disk.
Redis save, SAVE и BGSAVE
A bit confusing may be Redis itself: in its configuration file the save option is responsible for the BGSAVE operation.
However, Redis also has the SAVE command but it works differently:
SAVE is in-sync command and performs write blocks on the memory during creating a copy
BGSAVE in its turn is an asynchronous mechanism – it works in a parallel to a main server’s process and doesn’t affect its operations and client connected, thus it is the preferable way to create a backup
strace will write its output to the redis-trace.log file which we will check to find system calls used by the redis-server during the SAVE and BGSAVE operations:
Here is the clone() call (why clone() instead of the fork() – we will speak a bit later, in the fork() vs fork() vs clone()). This clone() creates a new child process which in its turn will create the data copy.
No clone() at this time – the dump was performed by the main Redis’ process and saved to the dump.rdb file – check the rename(“temp-1652.rdb”, “dump.rdb”) line in the strace‘s output (we will see shortly from there this name appeared – temp-1652.rdb).
And again our clone() is here which spawned another child process with PID 1879:
...
clone([...]) = 1879
...
Redis rdbSave() and rdbSaveBackground()functions
Exactly the dump itself is created by the only one Redis’ function – rdbSave():
...
/* Save the DB on disk. Return C_ERR on error, C_OK on success. */
int rdbSave(char *filename, rdbSaveInfo *rsi) {
...
snprintf(tmpfile,256,"temp-%d.rdb", (int) getpid());
fp = fopen(tmpfile,"w");
...
Which is called when you are executing the redis-cli -p 7777 SAVE command.
And here is our temp-1652.rdb file name from the strace output above:
Where 1652 – is the Redis server’s main process PID.
For its part, during the BGSAVE command another function is called – rdbSaveBackground():
...
int rdbSaveBackground(char *filename, rdbSaveInfo *rsi) {
...
start = ustime();
if ((childpid = fork()) == 0) {
int retval;
/* Child */
closeListeningSockets(0);
redisSetProcTitle("redis-rdb-bgsave");
retval = rdbSave(filename,rsi);
...
Which in its turn creates a new child process:
...
if ((childpid = fork()) == 0)
...
And this process in its turn will execute the rdbSave():
...
retval = rdbSave(filename,rsi);
...
fork() vs fork() vs clone()
Now, let’s go back to the question – why in the strace‘s output we are seeing the clone() syscall instead of the fork() which is called by the rdbSaveBackground() function?
Well, that’s just because of the fork() != fork():
there is the Linux-kernel fork() syscall
and there is also a glibc‘s fork() function which is a wrapper around the clone() syscall
Now, Read the Following Manual 🙂 – open the man 2 fork:
[setevoy@setevoy-arch-work ~/Temp/redis] [unstable*] $ man 2 fork | grep -A 5 NOTES
NOTES
Under Linux, fork() is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child.
C library/kernel differences
Since version 2.3.3, rather than invoking the kernel's fork() system call, the glibc fork() wrapper that is provided as part of the NPTL threading implementation invokes clone(2) with flags that provide the same effect as the
traditional system call. (A call to fork() is equivalent to a call to clone(2) specifying flags as just SIGCHLD.) The glibc wrapper invokes any fork handlers that have been established using pthread_atfork(3).
rather than invoking the kernel’s fork() system call, the glibc fork() wrapper […] invokes clone(2)
Consequently, when the rdbSaveBackground() executes the fork() – it uses not the fork(2) but the fork(3p) from the glibc, which in its turn is aliased to the __libc_fork():
To check if it is so and we are really using the glibc fork() and not the system call – let’s write some small C-program by using the official GNU’s documentation:
#include <unistd.h>
#include <sys/wait.h>
#include <stdio.h>
int main () {
pid_t pid;
pid = fork ();
if (pid == 0) {
printf("Child created\n");
sleep(100);
}
}
Here in the pid = fork() we are calling the fork() in the same way, as it did by the rdbSaveBackground() function.
Then let’s use ltrace to track a libraries functions (unlike the strace which is used to trace system calls):
ltrace -C -f ./test_fork_lib
[pid 5530] fork( <unfinished ...>
[pid 5531] <... fork resumed> ) = 0
[pid 5530] <... fork resumed> ) = 5531
[pid 5531] puts("Child created" <no return ...>
[pid 5530] +++ exited (status 0) +++
Child created
[pid 5531] <... puts resumed> ) = 14
[pid 5531] sleep(100) = 0
And by using the lsof tool – find all the files opened by our process:
Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can’t tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages, with the result that if you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.
Setting overcommit_memory to 1 tells Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.
The overcommit_memory values
vm.overcommit_memory can contain one of the following three values:
0: the kernel will perform a virtual memory allocation more, then the server has, but will rely on the “heuristic algorithm” (heuristic overcommit handling) to decide whenever to approve or decline memory allocation for a process
1: the kernel always will perform overcommit what can lead to more Out of memory errors but maybe good for services which are actively using memory
2: the kernel will perform overcommitting but in the bounds set by the overcommit_ratio or overcommit_kbytes parameters
The famous “heuristic algorithm” (Heuristic Overcommit handling)
In the most documentations/guides/howtos etc this algorithm is only mentioned, but it was not so easy to find its detailed description to understand how it is working.
As usually – “Just read the source!”
The overcommit check is performed by the supplementary function __vm_enough_memory() from the memory management module and is described in the kernel’s mm/util.c file.
This function accepts a number of pages requested by a process to be allocated and then this function will::
if overcommit_memory == 1 (if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)):
return 0 and allow overcommit
if overcommit_memory == 0 (if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) and sysctl_overcommit_memoryby default is set to the OVERCOMMIT_GUESS, and OVERCOMMIT_GUESS is set to the 0 in the linux/mman.h file):
count all free pages now and save them to the free variable: free = global_zone_page_state(NR_FREE_PAGES)
increase the free to the number of the file-backed (see File-backed and Swap, Memory-mapped file) memory pages, i.e. pages which can be freed by swapping to the disk: free += global_node_page_state(NR_FILE_PAGES)
decrease the free by the shared memory (see the Shared Memory, Shared memory) pages number free -= global_node_page_state(NR_SHMEM)
increase the free by adding swap-pages free += get_nr_swap_pages()
increase by the SReclaimable (see the man 5 procSReclaimable) number free += global_node_page_state(NR_SLAB_RECLAIMABLE)
increase by the KReclaimable (see the man 5 procKReclaimable) number free += global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE)
decrease by the memory which is reserved for the root user (см. init_admin_reserve()) free -= sysctl_admin_reserve_kbytes
and the last step is to check currently available memory – and the requested by a process – if the free variable with all our calculations above will contain enough pages – it will return the 0 value: if (free > pages) return 0;
Well – that’s all is a theory – not it’s time to take a real look at how all this will work during vm.overcommit_memory changes, how the memory is allocated in general.
Let’s use the next simple code:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
int main() {
printf("main() started\n");
long int mem_size = 4096;
void *mem_stack = malloc(mem_size);
printf("Parent pid: %lu\n", getpid());
sleep(1200);
}
In the void *mem_stack = malloc(mem_size) line we are requested 4096 bytes allocation which is set in the mem_size variable.
And here is our lovely -1 ENOMEM (Cannot allocate memory), which we can see in the Redis logs in its “Can’t save in background: fork: Cannot allocate memory“ error message.
VSZ == 1073745988 – just awesome: we just allocated 1 TERRABYTE of the virtual memory on the AWS t2.medium EC2 instance with the only 4 Gigabyte of the “real” memory!
And now – guess what will happen, one the child process will start actively using this allocated virtual (yet) memory?
Add the memset() syscall which will set the 0 into our mem_stack by filling the whole mem_size, e.g. 1 TB:
Run it (do NOT do it on a Production environment!):
root@bttrm-dev-console:/home/admin# ./test_vm
main() started
Parent pid: 15219
Killed
And check the operating system’s log:
Aug 27 17:46:43 localhost kernel: [7974462.384723] Out of memory: Kill process 15219 (test_vm) score 818 or sacrifice child
Aug 27 17:46:43 localhost kernel: [7974462.393395] Killed process 15219 (test_vm) total-vm:1073745988kB, anon-rss:3411676kB, file-rss:16kB, shmem-rss:0kB
Aug 27 17:46:43 localhost kernel: [7974462.600138] oom_reaper: reaped process 15219 (test_vm), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
The OOM Killer came here and killed everybody. At this time. Next time it may not be in time.
Pay attention, that our process had time to consume whole 3.4 GB of the real memory – anon-rss:3411676k from the one terabyte of the virtual memory given – total-vm:1073745988kB.
Conclusions
In our current case, when Redis is used for caching only and has no RDB or AOF backups enabled – no need to change the overcommit_memory and best to leave it with its default value – 0.
In the case when you really want to set the boundaries by yourself – it’s best to use overcommit_memory == 2 and limit the overcommit by setting the overcommit_ratio or overcommit_kbytes parameters.
The story
Actually, the whole story with the vm.overcommit_memory started for me about a year ago.
I wasn’t too much familiar with the Redis at this time and I just came to the new project where Redis already was used.
In one perfect day – our Production server (and by the time when I just came to this project – the whole backend was working on the only one AWS EC2) tired a bit and went down for some rest.
After a magic dropkick via AWS Console – the server went back online and I started looking for the root cause.
In its logs, I found records about OOM Killer that cam to the Redis or RabbitMQ – not sure now exactly. But anyway during the investigation I found the vm.overcommit_memory was set to the 1, i.e. disabled at all.
So anyway – this story at first gave me a reason to create a more reliable and fault-tolerant architecture for our backend’s infrastructure, and as the second thing – teach me not to blindly trust to any documentation.