We are migrating our Backend API from DynamoDB to AWS RDS PostgreSQL, and several times RDS crashed.

Actually, given that we took db.t3.small with two vCPUs and two gigabytes of memory to save money, it was quite expected, but I wondered why everything was crashing.

A few days later, I started to debug the issue, and although we haven’t found the reason yet, I got a good draft of how to troubleshoot performance issues in RDS PostgreSQL.

The post is not a simple “how-to-do”, but rather a note for yourself – where and what to look for next time, and what changes in monitoring can be made to see the problem next time before it becomes critical.

Contents

The Issue

So, here’s how it all started.

The Backend API is running on AWS Elastic Kubernetes Service, and at some point, we started getting 503 error alerts in Slack:

Alerts about Swap usage on Production RDS have appeared:

Got errors from Sentry about problems connecting to the database server:

We start checking RDS monitoring and saw that at some point Freeable Memory dropped to 50 megabytes:

When the server crashed, we restarted it, but the problem immediately reappeared.

So we decided to move to db.m5.large for now – you can see on the graph how the free memory has become 7.25 GB.

Now, let’s see what we can learn from this whole story.

Set “Application Name”

In Performance Insights, you can display statistics on individual Applications:

We have not done this, but I think it makes sense to set up separate Applications for connecting monitoring exporters and Backend API services.

There are several ways to do this.

Either pass parameters to connection strings:

"postgresql://user:password@host:port/database?application_name=MyApp"

Or execute directly from the code when initializing connections:

with engine.connect() as conn: 
  conn.execute("SET application_name TO 'MyApp'")

The second option looks more attractive, because the connection string to the Backend API is passed as an environment variable from the AWS Secret Store, and making a separate URL just for the sake of an only one application_name parameter for each API service looks a bit lame.

Therefore, it is better to set the own parameter in each backend app when creating a connection.

Useful PostgreSQL Extensions

During the investigation, I added a few PostgreSQL Extensions, which are very useful in such cases for monitoring and investment banking.

Enabling `pg_stat_statements`

Unfortunately, it was also not enabled at the time of the problem, but in general, it is a must-have.

Documentation – pg_stat_statements – track statistics of SQL planning and execution and SQL statistics for RDS PostgreSQL.

In RDS PostgreSQL versions 11 and higher, the library is enabled by default, so all you need to do is create an EXTENSION, see CREATE EXTENSION.

Check if there is an extension now:

dev_kraken_db=> SELECT * 
FROM pg_available_extensions 
WHERE 
    name = 'pg_stat_statements' and 
    installed_version is not null;
 name | default_version | installed_version | comment 
------+-----------------+-------------------+---------
(0 rows)

(0 rows) – ok, empty.

Let’s create it:

dev_kraken_db=> CREATE EXTENSION pg_stat_statements;
CREATE EXTENSION

Check it again:

dev_kraken_db=> SELECT *                            
FROM pg_available_extensions 
WHERE 
    name = 'pg_stat_statements' and 
    installed_version is not null;
        name        | default_version | installed_version |                                comment                                 
--------------------+-----------------+-------------------+------------------------------------------------------------------------
 pg_stat_statements | 1.10            | 1.10              | track planning and execution statistics of all SQL statements executed
(1 row)

Try to get some information from pg_stat_statements using such a query:

SELECT query 
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

Later in this post, there will be more examples of what information we can get from pg_stat_statements.

Enabling `pg_stat_activity`

In addition to the pg_stat_statements, useful information about current activity can be obtained from pg_stat_activity, which is enabled by default.

Both are views, although in different database schemas:

dev_kraken_db=> \dv *.pg_stat_(statements|activity)
                 List of relations
   Schema   |        Name        | Type |  Owner   
------------+--------------------+------+----------
 pg_catalog | pg_stat_activity   | view | rdsadmin
 public     | pg_stat_statements | view | rdsadmi

Difference between `pg_stat_activity` and `pg_stat_statements` in PostgreSQL

Both help to analyze queries, but pg_stat_activity is about current activity, and pg_stat_statements is the “historical” one:

Parameter	pg_stat_activity	pg_stat_statements
What does it show?	Current active sessions and their status.	Історія виконаних SQL-запитів зі статистикою.
Real-time data?	Yes, only active processes.	Ні, це накопичена статистика по всіх запитах.
What queries are visible?	Only those that are being performed right now.	Запити, які виконувались раніше (навіть якщо вже завершилися).
Does it preserve history?	No, the data disappears after the request is completed.	Так, PostgreSQL збирає та агрегує статистику.
What can be found?	Which request is currently running, how long it is taking, what it is waiting for (CPU, I/O, Locks).	Average, minimum, maximum request execution time, number of calls.
Main use	Real-time performance analysis, search for problematic requests now.	Search for “heavy” queries that create a load in the long run.

Enabling `pg_buffercache`

Another useful extension is pg_buffercache, which can display information about the state of memory in PostgreSQL.

It is included in the same way as pg_stat_statements:

CREATE EXTENSION IF NOT EXISTS pg_buffercache;

Next, we’ll also look at some interesting queries for checking the memory status in PostgreSQL.

Okay. Let’s get back to our problem.

CPU utilization, and DB Load

The first thing we noticed was the CPU load.

In Performance Insights, it looked like this:

And in the Monitoring tab of the instance itself:

DBLoad (Database Load)

The documentation on DBLoad in CloudWatch is here>>>.

In PostgreSQL, each client session creates a separate process (a backend process).

DBLoad is an AWS RDS PostgreSQL metric that displays the value of active sessions that are executing or waiting for resources – CPU, disk I/O, Locks. Sessions in idle status are not taken into account, but sessions in active, idle in transaction or waiting status are counted.

DBLoad is similar to Load Average on Linux, but only considers PostgreSQL sessions:

In Linux, Load Average shows the number of processes on a Linux server that are either using CPU, or waiting for CPU, or I/O
In RDS, DBLoad displays the average number of active sessions on the PostgreSQL server that are either running or waiting for resources

That is, ideally, each backend process that executes requests from a connected client should have access to its own vCPU core, so DBLoad should be ~= the number of vCPUs or less.

If DBLoad significantly exceeds the number of available cores, it means that the system is overloaded and processes (sessions) are waiting in the queue for CPU or other resources.

DBLoad includes two more indicators:

DBLoadCPU: sessions that are just waiting for free CPU
DBLoadNonCPU: sessions that are waiting for disk, database table locks, networking, etc.

You can check the sessions that will be considered active and included in DBLoad as follows:

SELECT pid, usename, state, wait_event, backend_type, query
FROM pg_stat_activity
WHERE state != 'idle'

Normal value for DBLoad

DBLoad should be approximately equal to or lower than the number of available vCPUs.

DBLoad vs CPU Utilization

Why do we see “100%” on the first screen and just “17.5” number on the second?

CPU Utilization: the percentage of CPU utilization of the total available power
DBLoad: number of active sessions

Considering that there were 2 available vCPUs on the server at the time, and 17 active sessions, we have 100% CPU time utilization.

We should also pay attention to DBLoadRelativeToNumVCPUs, which is DBLoad divided by the number of available vCPUs, i.e. the average load on each CPU core.

DBLoadCPU (Database Load on CPU Wait)

DBLoadCPU displays the number of active sessions waiting for CPUs, i.e. processes that cannot run because all available CPUs are busy.

Ideally, it should be close to zero, meaning that there should be no processes waiting for the CPU.

If DBLoadCPU is close to DBLoad, then RDS does not have time to process all requests – there is not enough CPU time, and they are queued.

We can check it with the same query from pg_stat_activity as above: if wait_event = "CPU“, then these are processes that are waiting for free CPU.

Normal value for DBLoadCPU

DBLoadCPU should be as low as possible (close to zero).

If DBLoadCPU is almost equal to DBLoad, then:

the main load is on the CPU
sessions are not blocked due to Table Locks or slow disk (I/O), but simply wait for the CPU

DBLoadNonCPU (Database Load on Non-CPU)

DBLoadNonCPU actually displays information about the waiting for non-CPU resources.

These can be:

Locks: waiting for access to a table or row
I/O Wait: Slow reads or writes due to disk constraints
Network Wait: Delays due to network operations (such as replication or data transfer)
Other Wait Events: other waits, such as background maintenance processes

We can check such sessions similarly to the previous queries with pg_stat_activity, but add the selections wait_event_type and wait_event:

SELECT pid, usename, state, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

Here, wait_event_type indicates the type of resource the process is waiting for (CPU, IO, Lock, WAL, Client), and wait_event details what specific operation the process is waiting for.

For example, wait_event_type can be“IO“, then wait_event_type can contain the values“DataFileRead” (waiting for a read from disk) or“DataFileWrite” – waiting for a write to disk.

Or, if wait_event_type == Client, then wait_event_type can be“ClientRead“,“ClientWrite“,“ClientSocket“.

See RDS for PostgreSQL wait events.

Let’s get back to our graphs:

CPU Utilization 100%: the system is overloaded
DBLoad 17.5: with two vCPUs – there are many active sessions, the processor does not have time to process all requests
DBLoadCPU 13.9: many sessions are waiting for available CPU
DBLoadNonCPU 3 .59: some requests were waiting for disk, locks, or some network requests

Operating System CPU utilization

In addition to the DBLoad metrics that relate specifically to RDS and PostgreSQL, we also have information on the operating system itself, where we have information on disk, memory, and CPU.

Actually, in CPU utilization, we have a graph as a percentage of CPU utilization, which consists of several metrics, each of which reflects a separate mode:

os.cpuUtilization.steal.avg (Steal Time): waiting for physical CPU if AWS has allocated it to another instance on that physical server, or if CPU Credits are used up and AWS is limiting your instance
os.cpuUtilization.guest.avg (Guest Time): CPU that was “eaten” by the guest operating system – if the server has a Virtual Machine or Docker, but that’s not about RDS
os.cpuUtilization.irq.avg (Interrupt Requests, IRQ Time): waiting for Interrupt Requests, IRQ Time – processing hardware interrupts (network requests or disk), may be associated with high IOPS on EBS
os.cpuUtilization.wait.avg (I/O Wait Time): I/O Wait Time, disk operations, for example, reading files
os.cpuUtilization.user.avg (User Time): time for user processes, in this case it can be PostgreSQL query processing
os.cpuUtilization.system.avg (System Time): operating system kernel operation (processing of processes from user space, disk operations, memory operations)
os.cpuUtilization.nice.avg (Nice Time): time for processes with nice priority – low-priority background tasks

We have similar data in the Linux top output:

Tasks: 611 total, 7 running, 603 sleep, 1 d-sleep, 0 stopped, 0 zombie
%Cpu(s):  5.9 us,  3.6 sy,  0.0 ni, 89.8 id,  0.2 wa,  0.0 hi,  0.5 si,  0.0 st 
...

Here us stands for user, sy for system, ni for nice, id for idle, and so on.

From our graph in RDS Performance Insights, we have the largest part of wait – and at the same time there are spikes in EBS IO operations:

That is, a query arrived and started actively reading from the disk:

And while the CPU was waiting for the disk operations to complete, the rest of the requests were executed more slowly.

At the same time, we have a low Freeable memory – because the data from the disk was written to memory.

And although the value of 460 IOPS itself does not look too high, it appears that it was at this moment that we “ate” the available memory.

What can the high os.cpuUtilization.wait.avg indicate in this case?

Slow EBS: still not our case, because we have EBS with up to 3000 IOPS; although waiting for disk-to-memory reads could cause an increase in I/O Waits on the CPU
locks/Locks: as an option, but we have a metric db.Transactions.blocked_transactions.avg – and everything was fine there, that is, PostgreSQL did not wait for the release of locks – we will also look at the Transactions now
low RAM: reading new data from disk to memory pushed the existing data to Swap, and then read it back from there, while dumping data from memory back to Swap to load new data from Swap (swap storm)

The use of Swap at this time also increased:

We’ll move on to Swap and ReadIOPS soon, but first, let’s take a look at transactions.

Transactions

Another interesting thing is the activity of transactions:

We can see that as soon as the CPU problems started, our xact_commit and xact_rollback (graph on the left) dropped to zero, and at the same time, the number of active_transactions increased to ~20, but at the same time, blocked_transactions was 0.

I can’t make a screenshot anymore, but there was also an “idle in transaction” spike – that is, transactions were started (BEGIN) but not completed (did not perform COMMIT or ROLLBACK).

But how can this be? Even with high CPU Waits, at least some of the transactions should have completed.

high Read IOPS: the system could not get data from the disk?
- no – ReadIOPS increased, but not so critically
- however, because FreeableMemory was too low, data from shared_buffers could be dumped to the Swap, which caused even more delays for the processes that were waiting for this data
too high Write IOPS: the system could not perform WAL operations (Write Ahead Logs, we will look at it in details later), which is required by the transaction completion
- but we saw that Write IOPS was normal
a lot of table locks, and processes were waiting for resources to freed up?
- also not, because we saw that blocked_transactions was at zero
autovacuum or ANALYZE running, which could have blocked transactions?
- but again, db.Transactions.blocked_transactions.avg was at zero
Swap storm: this is more like the truth:
- reads from disk pushed active data to Swap (FreeableMemory dropped)
- Swap Usage grew to almost 3 gigabytes
- PostgreSQL could not get pages from shared_buffers, because they were in Swap (more on the PostgreSQL memory later)
- because of this, transactions “hung” waiting to read from disk instead of working in RAM

What can we check in such cases?

I/O Waits or Blocks

Check for the wait_event:

SELECT pid, usename, query, wait_event_type, wait_event, state 
FROM pg_stat_activity 
WHERE state = 'active';

If we have“I/O” or“Locks” in wait_event, the reason may be here.

WAL – Write Ahead Logs

Let’s speak a bit about WAL in PostgreSQL.

with each DML (Data Manipulation Language) operation, such as INSERT, UPDATE or DELETE, the data is first changed in memory (shared_buffers – we will talk about them further), where the “operation context” is created
at the same time, this operation is written to the WAL-buffer (wal_buffers – memory buffer)
when wal_buffers is full, or when the transaction is completed, the PostgreSQL process wal_writer uses the fsync() system call to write data from the buffer to a wal-file (the pg_wal/ directory) – this is a log of all changes that occurred before COMMIT
the client that started the query execution receives the COMMIT message – the operation was successfully completed
- if synchronous_commit = on, PostgreSQL waits for the fsync() to complete before sending COMMIT
- if synchronous_commit = off, PostgreSQL does not wait for fsync() and COMMIT is faster, but with the risk of data loss
- if the transaction cannot be executed, the client will receive the “could not commit transaction” error
data from the shared_buffers is written to the files of the database itself (the base/ directory) – this is done by the checkpointer process, which writes modified data in memory (dirty pages) to the disk
- this happens using the CHECKPOINT process, and not immediately after COMMIT, but periodically
- after executing CHECKPOINT – PostgreSQL archives or deletes WAL files

A very cool article on WAL, memory and checkpoints is PostgreSQL: What is a checkpoint?

So, if the EBS disk was overloaded with Write IOPS, then WAL could stop writing, and this could lead to a transaction stall.

But in our case, we can see that db.Transactions.xact_rollback.avg was at zero, and it does not depend on WAL and Write operations on the disk.

PostgreSQL Exporter has several useful metrics that display WAL activity:

pg_stat_archiver_archived_count: the total number of successfully archived WAL files (which tells us that WAL is working correctly)
pg_stat_archiver_failed_count: number of failed attempts to archive WAL files
pg_stat_bgwriter_checkpoint_time: time spent on executing CHECKPOINTS

You can also make the following query:

SELECT * FROM pg_stat_wal;

If wal_buffers_full is high and growing, transactions might be waiting for the fsync() to execute. Or that the value of wal_buffers is too low and needs to be increased to reduce the frequency of forced WAL writes to disk.

PostgreSQL doesn’t seem to have such a metric, but we can make our own with custom.yaml:

pg_stat_wal:
  query: "SELECT wal_buffers_full FROM pg_stat_wal;"
  metrics:
    - wal_buffers_full:
        usage: "COUNTER"
        description: "Number of times the WAL buffers were completely full, causing WAL data to be written to disk."

Read IOPS, and Swap

Okay.

Let’s go back to the issue with the ReadIOPS and Swap.

What could have happened here:

some queries started actively reading data from the disk
the data was written to the shared_buffers, there was not enough space in the memory, and the data that was already in the shared_buffers was moved to Swap
queries in PostgreSQL continue to be executed, but now instead of just taking data from memory, PostgreSQL has to go to the Swap, and therefore we have high ReadIOPS and CPU I/O Waits – that is, the CPU waited for the data to be read from the disk

But then there should be a spike in db.IO.blks_read.avg, since we are reading from disk, isn’t it?

But no, because db.IO.blks_read are requests from PostgreSQL itself to read data.

When it operates with a Swap, it still thinks that it is working with RAM.

But the ReadIOPS metric comes from the operating system/EBS itself, and it shows all read operations, not just from PostgreSQL processes.

Interestingly, at the time of the problem, db.Cache.blks_hit dropped to zero. What does this mean? Usually, the backend processes (sessions) did not find data in shared_buffers.

But knowing that all our transactions stopped, and db.IO.blks_read also dropped to zero, it’s more likely that PostgreSQL just stopped accessing the cache at all because everyone was waiting for free CPU.

Okay, Google…

What about the Swap?

SWAP, and Enhanced monitoring

What do we have on the graphs in the Monitoring?

This is where Enhanced monitoring will be useful:

Select Swap ( the Manage graphs button), and we see such a picture:

Free Memory is going down
Free Swap is going down
but Swaps in and Swaps out remain unchanged

That is, it looks like the memory is running out – the system is dumping data from RAM to Swap – but at the same time, the Swaps in/out operation itself “did not happen”.

It looks interesting, and it would be worth digging further, but AWS Console is constantly crashing in these graphs:

See OS metrics in Enhanced Monitoring.

However, although we can’t get metrics from Enhanced monitoring directly, they are written to CloudWatch Logs! And from the logs, we can generate any metrics from VictoriaLogs or Loki:

In the logs, we have that the Swap In/Out did take place. In the future, it would be useful to create some metrics from these logs.

To find these logs, go to the CloudWatch Logs, in the Log Groups search for RDSOSMetrics, and then select a log by the RDS ID:

Operating system process list

A list of processes can also be very useful:

If the free memory starts to drop, go here, look at the Resident memory, find the PID of the process that is consuming memory, and see what kind of request it is:

prod_kraken_db=> SELECT user, pid, query FROM pg_stat_activity WHERE pid = '26421';
       user       |  pid  |  query   
------------------+-------+----------
 prod_kraken_user | 26421 | ROLLBACK

We also have all these processes in the logs I mentioned above – but it’s better to check them at the moment when the problem occurs to find out which one was executed.

Because yes, we can enable slow queries logs – but in those logs, we won’t see the PID, and we won’t be able to find out how much memory this request used.

Memory in PostgreSQL

Let’s dig a little deeper into what’s in PostgreSQL memory.

See RDS for PostgreSQL memory and Tuning memory parameters.

Memory in PostgreSQL is divided into two main types: “shared memory” for all processes, and “local memory” – the memory of each backend process (session).

In the shared memory, we have:

shared_buffers: the main memory where PostgreSQL keeps a cache of data it reads from disk when processing queries – caching table pages and indexes
- analog of the JVM Heap Memory (Java Heap)
- shared_buffers by default is 25% of the total RAM, but can be changed
wal_buffers: already seen above – used to temporarily store WAL records to buffer the transaction before writing to the WAL file

From the shared_buffers, changed data (dirty pages) is written to disk by two processes:

Background Writer (bgwriter): runs in the background, gradually writing data to disk
Checkpointer (checkpoint): forcibly writes all pages during the CHECKPOINT

Process memory has:

work_mem: is allocated to queries that performORDER BY, HASH JOIN, and aggregations
- each query gets its own copy of work_mem, therefore, with a large number of simultaneous queries, memory can quickly run out
- if the work_mem process is not enough, PostgreSQL starts writing temporary files to disk (temp_blks_written), which slows down query execution
maintenance_work_mem: actually, maintenance operations – vacuuming operations, creating indexes, adding foreign keys
temp_buffers: allocated for temporary tables(CREATE TEMP TABLE).

We can get all the data from pg_settings as follows:

SELECT 
    name, 
    setting, 
    unit,
    CASE 
        WHEN unit = '8kB' THEN setting::bigint * 8 
        WHEN unit = 'kB' THEN setting::bigint 
        ELSE NULL 
    END AS total_kb,
    pg_size_pretty(
        CASE 
            WHEN unit = '8kB' THEN setting::bigint * 8 * 1024
            WHEN unit = 'kB' THEN setting::bigint * 1024
            ELSE NULL 
        END
    ) AS total_pretty
FROM pg_settings
WHERE name IN ('shared_buffers', 'work_mem', 'temp_buffers', 'wal_buffers');

We have 238967 shared_buffers, each of them is 8 KB, for a total of ~1.9 GB.

But this is already on db.m5.large.

Checking `shared_buffers`

The Cache hit ratio will show how much data was fetched from memory and how much from the disk itself – although we have this in the db.IO.blks_read.avg and db.Cache.blks_hit.avg metrics (or the pg_stat_database_blks_hit and pg_stat_database_blks_read metrics in PostgreSQL Exporter):

SELECT 
    blks_read, blks_hit, 
    ROUND(blks_hit::numeric / NULLIF(blks_hit + blks_read, 0), 4) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname = current_database();

If the cache_hit_ratio < 0.9, it means that the PostgreSQL cache is not efficient, and too much data is read from disk instead of the cache.

To see how many of the allocated shared_buffers are currently used and how many are free – here we need the pg_buffercache extension.

The query:

SELECT
    COUNT(*) AS total_buffers,
    SUM(CASE WHEN isdirty THEN 1 ELSE 0 END) AS dirty_buffers,
    SUM(CASE WHEN relfilenode IS NULL THEN 1 ELSE 0 END) AS free_buffers,
    SUM(CASE WHEN relfilenode IS NOT NULL THEN 1 ELSE 0 END) AS used_buffers,
    ROUND(100.0 * SUM(CASE WHEN relfilenode IS NOT NULL THEN 1 ELSE 0 END) / COUNT(*), 2) AS used_percent,
    ROUND(100.0 * SUM(CASE WHEN relfilenode IS NULL THEN 1 ELSE 0 END) / COUNT(*), 2) AS free_percent
FROM pg_buffercache;

We have 238967 buffers in total, of which only 12280, or 5%, are used.

Or another option is to see how many pages are currently in shared_buffers is:

prod_kraken_db=> SELECT 
    count(*) AS cached_pages,
    pg_size_pretty(count(*) * 8192) AS cached_size
FROM pg_buffercache;
 cached_pages | cached_size 
--------------+-------------
       117495 | 918 MB

Even though total allocated size for the shared_buffers is:

prod_kraken_db=> SHOW shared_buffers;
 shared_buffers 
----------------
 939960kB

918 megabytes.

But then why did we see in the previous query that “5% used”?

Because as a result of pg_buffercache, in the used_buffers and used_percent fields only active pages(used) are considered, i.e. those that either have a file reference(relfilenode) or have been recently used.

Using `EXPLAIN ANALYZE`

See EXPLAIN.

EXPLAIN (ANALYZE, BUFFERS) will show us how much data is in the buffers now, how much data will be read from the disk:

Here:

shared hit: how many pages were read from the cache (shared_buffers)
shared read: how many pages were read from disk (loaded into shared_buffers)
shared dirtied: how many pages were modified

The Explain topic is quite interesting and gives us a lot of interesting information, so I wrote about it separately – PostgreSQL: using EXPLAIN and setting up “auto_explain” in AWS RDS.

See the contents of `shared_buffers`

To get the number of buffers for all tables, we can use such a query:

SELECT
c.relname, count(*) AS buffers
FROM
pg_buffercache b
INNER JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
AND
b.reldatabase IN (0, (SELECT oid FROM pg_database
WHERE
datname = current_database()))
GROUP BY c.relname
ORDER BY 2 DESC
LIMIT 10;

Here, 1636 buffers are used for the challenge_progress table, which gives us:

1636*8/1024
12

12 megabytes.

Or you can get it from the pg_relation_size() – see System Administration Functions and pg_relation_size():

prod_kraken_db=> SELECT pg_size_pretty(pg_relation_size('challenge_progress'));
 pg_size_pretty 
----------------
 13 MB

Checking `work_mem`

If the FreeableMemory drops, then perhaps too much work_mem is being used.

Check how much is allocated to each process:

dev_kraken_db=> SHOW work_mem;
 work_mem 
----------
 4MB

You can check whether the processes have enough work_mem from the value of temp_blks_written, because when the memory in work_mem runs out, the process begins to write data to temporary tables:

And actually that’s probably all.

It is clear that db.t3.small with two gigabytes was not enough for us.

We have a suspicion which query caused this “chain reaction”, and we saw an “ugly” SELECT in the slow queries logs, and the developers seemed to have optimized it.

We’ll try to reduce the instance type to the db.t3.medium with 4 gigabytes of memory, and will see if the problem occurs again.

Monitoring summary

Instead of conclusions, here are some ideas of what should be monitored and what could be improved.

Our alerts

I will outline some alerts that we already have now.

CloudWatch metrics

CloudWatch metrics. We are collecting them to VictoriaMetrics with the YACE exporter.

CPUUtilization

The alert:

- alert: HighCPUUtilization
  expr: avg(aws_rds_cpuutilization_average{dimension_DBInstanceIdentifier!="", dimension_DBInstanceIdentifier!~"kraken-ops-rds-.*"}[5m]) by (dimension_DBInstanceIdentifier) > 80
  for: 5m
  labels:
    severity: warning
    component: devops
    environment: ops
  annotations:
    summary: "High CPU utilization on RDS instance"
    description: "CPU utilization is above 80% for more than 5 minutes on RDS instance {{ "{{" }} $labels.instance }}."

DBLoadRelativeToNumVCPUs

The alert:

- alert: HighCPULoadPerVCPUWarningAll
  expr: avg(aws_rds_dbload_relative_to_num_vcpus_average{dimension_DBInstanceIdentifier!="", dimension_DBInstanceIdentifier!~"kraken-ops-rds-.*"}[5m]) by (dimension_DBInstanceIdentifier) > 0.8
  for: 5m
  labels:
    severity: warning
    component: devops
    environment: ops
  annotations:
    summary: "High per-core CPU utilization on RDS instance"
    description: |
      CPU utilization is above 80% for more than 5 minutes on RDS instance {{ "{{" }} $labels.instance }}
      *DB instance*: `{{ "{{" }} $labels.dimension_DBInstanceIdentifier }}`
      *Per-vCPU load*: `{{ "{{" }} $value | humanize }}`

FreeStorageSpace

Not very relevant if you have dynamic storage, but it can be useful.

The alert:

- record: aws:rds:free_storage:gigabytes
  expr: sum(aws_rds_free_storage_space_average{dimension_DBInstanceIdentifier!=""}) by (dimension_DBInstanceIdentifier) / 1073741824

# ALL
- alert: LowFreeStorageSpaceCriticalAll
  expr: aws:rds:free_storage:gigabytes < 5
  for: 5m
  labels:
    severity: warning
    component: devops
    environment: ops        
  annotations:
    summary: "Low Disk Space on an RDS instance"
    description: |-
      Free storage below 5 GB
      *DB instance*: `{{ "{{" }} $labels.dimension_DBInstanceIdentifier }}`
      *Free storage*: `{{ "{{" }} $value | humanize }}`

FreeableMemory

The alert:

- alert: LowFreeableMemoryDev
  expr: avg(aws_rds_freeable_memory_average{dimension_DBInstanceIdentifier="kraken-ops-rds-dev"}[5m]*0.000001) by (dimension_DBInstanceIdentifier) < 20
  for: 5m
  labels:
    severity: warning
    component: backend
    environment: dev       
  annotations:
    summary: "High memory usage on RDS instance"
    description: |-
      Freeable memory is less than 100mb
      *DB instance*: `{{ "{{" }} $labels.dimension_DBInstanceIdentifier }}`
      *Free memory*: `{{ "{{" }} $value | humanize }}`

ReadLatency, ReadIOPS, and WriteLatency with WriteIOPS

Similar metrics, and are useful to monitor.

The alert:

- alert: HighDiskReadLatencyKrakenStaging
  expr: sum(aws_rds_read_latency_average{dimension_DBInstanceIdentifier="kraken-ops-rds-dev"}) by (dimension_DBInstanceIdentifier) > 0.1
  for: 1s
  labels:
    severity: warning
    component: backend
    environment: dev
  annotations:
    summary: "High Disk Read Latency on RDS instance"
    description: |-
      Reads from a storage are too slow
      *DB instance*: `{{ "{{" }} $labels.dimension_DBInstanceIdentifier }}`
      *Read Latency*: `{{ "{{" }} $value | humanize }}`

SwapUsage

Also must have metric.

The alert:

- record: aws:rds:swap_used:gigabytes
  expr: sum(aws_rds_swap_usage_average{dimension_DBInstanceIdentifier!=""}) by (dimension_DBInstanceIdentifier) / 1073741824

# ALL
- alert: SwapUsedAllWarning
  expr: sum(aws:rds:swap_used:gigabytes{dimension_DBInstanceIdentifier!="", dimension_DBInstanceIdentifier!~"kraken-ops-rds-.*"}) by (dimension_DBInstanceIdentifier) > 0.8
  for: 1s
  labels:
    severity: warning
    component: devops
    environment: ops
  annotations:
    summary: "Swap space use is too high on an RDS instance"
    description: |-
      The RDS instance is using more than *0.8 GB* of swap space
      *DB instance*: `{{ "{{" }} $labels.dimension_DBInstanceIdentifier }}`
      *Swap used GB*: `{{ "{{" }} $value | humanize }}`

DatabaseConnections

There is a CloudWatch metric, but it just returns the number of connections, and the limit can be different for different types of instances.

So I’ll show you an example of an alert here, but then there will be another one from PostgreSQL Exporter metrics:

# db.t3.micro - 112 max_connections (Backend Dev)
# db.t3.small - 225 max_connections (Backend Prod)
# db.t3.medium - 450 max_connections
# db.t3.large - 901 max_connections

# ALL
- alert: HighConnectionCountWarning
  expr: avg(aws_rds_database_connections_average{dimension_DBInstanceIdentifier!="", dimension_DBInstanceIdentifier!~".*kraken.*"}[5m]) by (dimension_DBInstanceIdentifier) > 50
  for: 1m
  labels:
    severity: warning
    component: devops
    environment: ops
  annotations:
    summary: "High number of connections on RDS instance"
    description: |-
      An RDS instance Connections Pool is almost full. New connections may be rejected.
      *DB instance*: `{{ "{{" }} $labels.dimension_DBInstanceIdentifier }}`
      *Instance type*: `db.t3.micro`
      *Max connections*: `112`
      *Current connections*: `{{ "{{" }} $value | humanize }}`

Loki Recording Rules metrics

We generate a couple of metrics from the logs of our Backend API, for example:

- record: aws:rds:backend:connection_failed:sum:rate:5m
  expr: |
    sum(
        rate(
            {app=~"backend-.*"} 
            != "token_email" 
            |= "sqlalchemy.exc.OperationalError" 
            | regexp `.*OperationalError\) (?P<message>connection to server at "(?P<db_server>[^"]+)".*$)`
            [5m]
        )
    ) by (message, db_server)

And then we have alerts from them:

- alert: BackendRDSConnectionFailed
  expr: sum(aws:rds:backend:connection_failed:sum:rate:5m{db_server="dev.db.kraken.ops.example.co"}) by (db_server, message) > 0
  for: 1s
  labels:
    severity: critical
    component: backend
    environment: dev
  annotations:
    summary: "Connection to RDS server failed"
    description: |-
      Backend Pods can't connect to an RDS instance
      *Database server:*: {{ "{{" }} $labels.db_server }}
      *Error message*: {{ "{{" }} $labels.message }}

PostgreSQL Exporter metrics

pg_stat_database_numbackends

Again about Connections limit: in the Exporter, we have the pg_settings_max_connections metric, which indicates the maximum number of connections depending on the type of instance, and pg_stat_database_numbackends – the number of active sessions (connections).

Accordingly, we can calculate the % of used from max connections.

The only problem is that these metrics have different labels, and I skipped to do any label_replace, so just added three records for each environment:

# 'pg_stat_database_numbackends' and 'pg_settings_max_connections' have no common labels
# don't want to waste time with 'label_replace' or similar
# thus just create different 'records' for Prod and Staging
- record: aws:rds:kraken_dev:max_connections_used:percent
  expr: |
    (
      sum(pg_stat_database_numbackends{datname=~"dev_kraken_db", job="atlas-victoriametrics-postgres-exporter-kraken-dev"})
      /
      sum(pg_settings_max_connections{container=~".*kraken-dev"})
    ) * 100

- alert: ExporterHighConnectionPercentBackendDevWarning
  expr: aws:rds:kraken_dev:max_connections_used:percent > 40
  for: 1s
  labels:
    severity: warning
    component: backend
    environment: dev
  annotations:
    summary: "High number of connections on the Backend RDS instance"
    description: |-
      RDS instance Connections Pool is almost full. New connections may be rejected.
      *DB instance*: `kraken-ops-rds-dev`
      *Connections pool use*: `{{ "{{" }} $value | humanize }}%`
    grafana_rds_overview_url: 'https://{{ .Values.monitoring.root_url }}/d/ceao6muzwos1sa/kraken-rds?orgId=1&from=now-1h&to=now&timezone=browser&var-query0=&var-db_name=kraken-ops-rds-dev'

pg_stat_activity_max_tx_duration

Alerts when some transactions are taking too long.

I can’t say that this is a very useful metric, because we don’t have a PID and the amount of memory, but it’s enough for now.

Then we can think about custom metrics.

For now, the alert is as follows:

- alert: ExporterTransactionExecutionTimeBackendDevWarning
  expr: sum(rate(pg_stat_activity_max_tx_duration{datname="dev_kraken_db"}[5m])) by (state, datname) > 0.1
  for: 1m
  labels:
    severity: warning
    component: backend
    environment: dev
  annotations:
    summary: "RDS transactions running too long"
    description: |-
      Too long duration in seconds active transaction has been running
      *Database name*: `{{ "{{" }} $labels.datname }}`
      *State*: `{{ "{{" }} $labels.state }}`
      *Duration*: `{{ "{{" }} printf "%.2f" $value }}` seconds
    grafana_rds_overview_url: 'https://{{ .Values.monitoring.root_url }}/d/ceao6muzwos1sa/kraken-rds?orgId=1&from=now-1h&to=now&timezone=browser&var-query0=&var-db_name={{ "{{" }} $labels.datname }}'

Worth adding alerts

Well, there is a lot of stuff here.

PostgreSQL Exporter custom metrics

The main thing I mentioned above is that in PostgreSQL Exporter we can create custom metrics with the results of PostgreSQL queries using config.queries.

See Create Prometheus integrated Postgres custom metrics.

Although this feature seems to be deprecated.

But even if it is deprecated, you can still go the extra mile and write your own exporter. See Prometheus: Creating a Custom Prometheus Exporter in Python and Prometheus: GitHub Exporter – writing a custom exporter for the GitHub API.

CloudWatch Logs, and Enhanced Monitoring

As I mentioned before, it would be very useful to have our own metrics for Swap, or to have the PID of processes and their resident memory.

How can I filter Enhanced Monitoring CloudWatch logs to generate automated custom metrics for Amazon RDS?

Transactions monitoring

I have it on my Grafana dashboard now:

But maybe will add separate alerts – by the number of the “active” transactions, or by “idle in transaction” count.

Basic metrics from PostgreSQL Exporter:

pg_stat_database_xact_commit and pg_stat_database_xact_rollback: as we saw in our case, if the value drops, then we have problems – requests are not completed
pg_stat_activity: by the state label, we have two main ones:
- active: the total number of active requests
- idle in transaction: we also saw in our case that many requests were hanging waiting for completion
pg_locks: the number of blocks (see pg_locks and Concurrency Control)

WAL monitoring

I also mentioned a few metrics that are available in PostgreSQL Exporter, maybe I’ll add either alerts or graphs to Grafana for them:

pg_stat_archiver_archived_count: total number of successfully archived WAL files (which will tell us that WAL is working correctly)
pg_stat_archiver_failed_count: number of failed attempts to archive WAL files
pg_stat_bgwriter_checkpoint_time: time spent on executing CHECKPOINT

In the PostgreSQL server itself, we can check it with the pg_stat_wal view:

SELECT * FROM pg_stat_wal;

The main ones here are:

wal_records: number of WAL records written (INSERT, UPDATE, DELETE operations)
wal_bytes: total amount of data (in bytes) written to the WAL
wal_buffers_full: the number of times the WAL buffers were completely full, forcing backend processes to write directly to the WAL file
wal_write: the number of times PostgreSQL wrote the WAL to a file
wal_write_time: total time in milliseconds spent on WAL writes
wal_sync_time: total time (in milliseconds) spent on fsync() (guaranteed write to disk)

shared_buffers monitoring

We still need to think about what metrics we could generate and what graphs or alerts we could come up with.

From what comes to mind:

monitor shared hit and read: how much data was found in the cache and how much had to be read from disk
buffers_backend: how many buffers were written directly by backend processes
- in a normal situation, all data from dirty pages should be written by bgwriter or checkpoint
- if all shared_buffers are busy, and bgwriter, wal_writer or checkpointer does not have time to transfer data from them to disk, then the clients’ backend processes are forced to transfer data themselves, which slows down the execution of their requests

Check with:

SELECT buffers_backend, buffers_checkpoint, buffers_alloc FROM pg_stat_bgwriter;

Here:

buffers_backend: how many buffers were written directly by backend processes
buffers_checkpoint: how many buffers are written during CHECKPOINT
- if we have a high value here:
  - then checkpoints occurs rarely, and writes many pages at once
  - or bgwriter does not have time to write, and CHECKPOINT writes everything at once
buffers_alloc: how many new buffers are allocated to shared_buffers
- if we have a high value, then the cache is constantly being pushed out, and PostgreSQL is forced to load pages from disk

Checkpointer monitoring

It also makes sense to keep an eye on the Checkpointer activity:

SELECT checkpoint_write_time, checkpoint_sync_time FROM pg_stat_bgwriter;

Here:

checkpoint_write_time: time spent on writing changed pages (dirty pages) from shared_buffers to the file system; if the value is large, then:
- shared_buffers is too large – when checkpointing, you have to write too many pages at once
- many operations(UPDATE, DELETE), which leads to a large number of “dirty pages”
- or the checkpoint_timeout is too big, so many changes are written at once during the checkpoint
checkpoint_sync_time: time spent on forced writing (performing fsync()) of changed pages to the physical disk; if the value is large, then possible there are disk problems – slow data writing

work_mem monitoring

It also makes sense to look here.

If work_mem is not enough, processes start writing temp_blks_written, which, firstly, slows down the execution of queries, and secondly, creates an additional load on the disk.

Check with:

SELECT temp_files, temp_bytes FROM pg_stat_database WHERE datname = current_database();

The Issue

Set “Application Name”

Useful PostgreSQL Extensions

Enabling pg_stat_statements

Enabling pg_stat_activity

Difference between pg_stat_activity and pg_stat_statements in PostgreSQL

Enabling pg_buffercache

CPU utilization, and DB Load

DBLoad (Database Load)

Normal value for DBLoad

DBLoad vs CPU Utilization

DBLoadCPU (Database Load on CPU Wait)

Normal value for DBLoadCPU

DBLoadNonCPU (Database Load on Non-CPU)

Operating System CPU utilization

Transactions

I/O Waits or Blocks

WAL – Write Ahead Logs

Read IOPS, and Swap

SWAP, and Enhanced monitoring

Operating system process list

Memory in PostgreSQL

Checking shared_buffers

Using EXPLAIN ANALYZE

See the contents of shared_buffers

Checking work_mem

Monitoring summary

Our alerts

CloudWatch metrics

CPUUtilization

DBLoadRelativeToNumVCPUs

FreeStorageSpace

FreeableMemory

ReadLatency, ReadIOPS, and WriteLatency with WriteIOPS

SwapUsage

DatabaseConnections

Loki Recording Rules metrics

PostgreSQL Exporter metrics

pg_stat_database_numbackends

pg_stat_activity_max_tx_duration

Worth adding alerts

PostgreSQL Exporter custom metrics

CloudWatch Logs, and Enhanced Monitoring

Transactions monitoring

WAL monitoring

shared_buffers monitoring

Checkpointer monitoring

work_mem monitoring

Useful links

Enabling `pg_stat_statements`

Enabling `pg_stat_activity`

Difference between `pg_stat_activity` and `pg_stat_statements` in PostgreSQL

Enabling `pg_buffercache`

Checking `shared_buffers`

Using `EXPLAIN ANALYZE`

See the contents of `shared_buffers`

Checking `work_mem`