We are migrating our Backend API from DynamoDB to AWS RDS PostgreSQL, and several times RDS crashed.
Actually, given that we took db.t3.small
with two vCPUs and two gigabytes of memory to save money, it was quite expected, but I wondered why everything was crashing.
A few days later, I started to debug the issue, and although we haven’t found the reason yet, I got a good draft of how to troubleshoot performance issues in RDS PostgreSQL.
The post is not a simple “how-to-do”, but rather a note for yourself – where and what to look for next time, and what changes in monitoring can be made to see the problem next time before it becomes critical.
Contents
The Issue
So, here’s how it all started.
The Backend API is running on AWS Elastic Kubernetes Service, and at some point, we started getting 503 error alerts in Slack:
Alerts about Swap usage on Production RDS have appeared:
Got errors from Sentry about problems connecting to the database server:
We start checking RDS monitoring and saw that at some point Freeable Memory dropped to 50 megabytes:
When the server crashed, we restarted it, but the problem immediately reappeared.
So we decided to move to db.m5.large
for now – you can see on the graph how the free memory has become 7.25 GB.
Now, let’s see what we can learn from this whole story.
Set “Application Name”
In Performance Insights, you can display statistics on individual Applications:
We have not done this, but I think it makes sense to set up separate Applications for connecting monitoring exporters and Backend API services.
There are several ways to do this.
Either pass parameters to connection strings:
"postgresql://user:password@host:port/database?application_name=MyApp"
Or execute directly from the code when initializing connections:
with engine.connect() as conn: conn.execute("SET application_name TO 'MyApp'")
The second option looks more attractive, because the connection string to the Backend API is passed as an environment variable from the AWS Secret Store, and making a separate URL just for the sake of an only one application_name
parameter for each API service looks a bit lame.
Therefore, it is better to set the own parameter in each backend app when creating a connection.
Useful PostgreSQL Extensions
During the investigation, I added a few PostgreSQL Extensions, which are very useful in such cases for monitoring and investment banking.
Enabling pg_stat_statements
Unfortunately, it was also not enabled at the time of the problem, but in general, it is a must-have.
Documentation – pg_stat_statements – track statistics of SQL planning and execution and SQL statistics for RDS PostgreSQL.
In RDS PostgreSQL versions 11 and higher, the library is enabled by default, so all you need to do is create an EXTENSION
, see CREATE EXTENSION.
Check if there is an extension now:
dev_kraken_db=> SELECT * FROM pg_available_extensions WHERE name = 'pg_stat_statements' and installed_version is not null; name | default_version | installed_version | comment ------+-----------------+-------------------+--------- (0 rows)
(0 rows) – ok, empty.
Let’s create it:
dev_kraken_db=> CREATE EXTENSION pg_stat_statements; CREATE EXTENSION
Check it again:
dev_kraken_db=> SELECT * FROM pg_available_extensions WHERE name = 'pg_stat_statements' and installed_version is not null; name | default_version | installed_version | comment --------------------+-----------------+-------------------+------------------------------------------------------------------------ pg_stat_statements | 1.10 | 1.10 | track planning and execution statistics of all SQL statements executed (1 row)
Try to get some information from pg_stat_statements
using such a query:
SELECT query FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;
Later in this post, there will be more examples of what information we can get from pg_stat_statements
.
Enabling pg_stat_activity
In addition to the pg_stat_statements
, useful information about current activity can be obtained from pg_stat_activity
, which is enabled by default.
Both are views, although in different database schemas:
dev_kraken_db=> \dv *.pg_stat_(statements|activity) List of relations Schema | Name | Type | Owner ------------+--------------------+------+---------- pg_catalog | pg_stat_activity | view | rdsadmin public | pg_stat_statements | view | rdsadmi
Difference between pg_stat_activity
and pg_stat_statements
in PostgreSQL
Both help to analyze queries, but pg_stat_activity
is about current activity, and pg_stat_statements
is the “historical” one:
Parameter | pg_stat_activity | pg_stat_statements |
---|---|---|
What does it show? | Current active sessions and their status. | Історія виконаних SQL-запитів зі статистикою. |
Real-time data? | Yes, only active processes. | Ні, це накопичена статистика по всіх запитах. |
What queries are visible? | Only those that are being performed right now. | Запити, які виконувались раніше (навіть якщо вже завершилися). |
Does it preserve history? | No, the data disappears after the request is completed. | Так, PostgreSQL збирає та агрегує статистику. |
What can be found? | Which request is currently running, how long it is taking, what it is waiting for (CPU, I/O, Locks). | Average, minimum, maximum request execution time, number of calls. |
Main use | Real-time performance analysis, search for problematic requests now. | Search for “heavy” queries that create a load in the long run. |
Enabling pg_buffercache
Another useful extension is pg_buffercache
, which can display information about the state of memory in PostgreSQL.
It is included in the same way as pg_stat_statements
:
CREATE EXTENSION IF NOT EXISTS pg_buffercache;
Next, we’ll also look at some interesting queries for checking the memory status in PostgreSQL.
Okay. Let’s get back to our problem.
CPU utilization, and DB Load
The first thing we noticed was the CPU load.
In Performance Insights, it looked like this:
And in the Monitoring tab of the instance itself:
DBLoad (Database Load)
The documentation on DBLoad in CloudWatch is here>>>.
In PostgreSQL, each client session creates a separate process (a backend process).
DBLoad is an AWS RDS PostgreSQL metric that displays the value of active sessions that are executing or waiting for resources – CPU, disk I/O, Locks. Sessions in idle status are not taken into account, but sessions in active, idle in transaction or waiting status are counted.
DBLoad is similar to Load Average on Linux, but only considers PostgreSQL sessions:
- In Linux, Load Average shows the number of processes on a Linux server that are either using CPU, or waiting for CPU, or I/O
- In RDS, DBLoad displays the average number of active sessions on the PostgreSQL server that are either running or waiting for resources
That is, ideally, each backend process that executes requests from a connected client should have access to its own vCPU core, so DBLoad should be ~= the number of vCPUs or less.
If DBLoad significantly exceeds the number of available cores, it means that the system is overloaded and processes (sessions) are waiting in the queue for CPU or other resources.
DBLoad includes two more indicators:
- DBLoadCPU: sessions that are just waiting for free CPU
- DBLoadNonCPU: sessions that are waiting for disk, database table locks, networking, etc.
You can check the sessions that will be considered active and included in DBLoad as follows:
SELECT pid, usename, state, wait_event, backend_type, query FROM pg_stat_activity WHERE state != 'idle'
Normal value for DBLoad
DBLoad should be approximately equal to or lower than the number of available vCPUs.
DBLoad vs CPU Utilization
Why do we see “100%” on the first screen and just “17.5” number on the second?
- CPU Utilization: the percentage of CPU utilization of the total available power
- DBLoad: number of active sessions
Considering that there were 2 available vCPUs on the server at the time, and 17 active sessions, we have 100% CPU time utilization.
We should also pay attention to DBLoadRelativeToNumVCPUs, which is DBLoad divided by the number of available vCPUs, i.e. the average load on each CPU core.
DBLoadCPU (Database Load on CPU Wait)
DBLoadCPU displays the number of active sessions waiting for CPUs, i.e. processes that cannot run because all available CPUs are busy.
Ideally, it should be close to zero, meaning that there should be no processes waiting for the CPU.
If DBLoadCPU is close to DBLoad, then RDS does not have time to process all requests – there is not enough CPU time, and they are queued.
We can check it with the same query from pg_stat_activity
as above: if wait_event = "CPU
“, then these are processes that are waiting for free CPU.
Normal value for DBLoadCPU
DBLoadCPU should be as low as possible (close to zero).
If DBLoadCPU is almost equal to DBLoad, then:
- the main load is on the CPU
- sessions are not blocked due to Table Locks or slow disk (I/O), but simply wait for the CPU
DBLoadNonCPU (Database Load on Non-CPU)
DBLoadNonCPU actually displays information about the waiting for non-CPU resources.
These can be:
- Locks: waiting for access to a table or row
- I/O Wait: Slow reads or writes due to disk constraints
- Network Wait: Delays due to network operations (such as replication or data transfer)
- Other Wait Events: other waits, such as background maintenance processes
We can check such sessions similarly to the previous queries with pg_stat_activity
, but add the selections wait_event_type
and wait_event
:
SELECT pid, usename, state, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start;
Here, wait_event_type
indicates the type of resource the process is waiting for (CPU, IO, Lock, WAL, Client), and wait_event
details what specific operation the process is waiting for.
For example, wait_event_type
can be“IO“, then wait_event_type
can contain the values“DataFileRead” (waiting for a read from disk) or“DataFileWrite” – waiting for a write to disk.
Or, if wait_event_type == Client
, then wait_event_type
can be“ClientRead“,“ClientWrite“,“ClientSocket“.
See RDS for PostgreSQL wait events.
Let’s get back to our graphs:
- CPU Utilization 100%: the system is overloaded
- DBLoad 17.5: with two vCPUs – there are many active sessions, the processor does not have time to process all requests
- DBLoadCPU 13.9: many sessions are waiting for available CPU
- DBLoadNonCPU 3 .59: some requests were waiting for disk, locks, or some network requests
Operating System CPU utilization
In addition to the DBLoad metrics that relate specifically to RDS and PostgreSQL, we also have information on the operating system itself, where we have information on disk, memory, and CPU.
Actually, in CPU utilization, we have a graph as a percentage of CPU utilization, which consists of several metrics, each of which reflects a separate mode:
os.cpuUtilization.steal.avg
(Steal Time): waiting for physical CPU if AWS has allocated it to another instance on that physical server, or if CPU Credits are used up and AWS is limiting your instanceos.cpuUtilization.guest.avg
(Guest Time): CPU that was “eaten” by the guest operating system – if the server has a Virtual Machine or Docker, but that’s not about RDSos.cpuUtilization.irq.avg
(Interrupt Requests, IRQ Time): waiting for Interrupt Requests, IRQ Time – processing hardware interrupts (network requests or disk), may be associated with high IOPS on EBSos.cpuUtilization.wait.avg
(I/O Wait Time): I/O Wait Time, disk operations, for example, reading filesos.cpuUtilization.user.avg
(User Time): time for user processes, in this case it can be PostgreSQL query processingos.cpuUtilization.system.avg
(System Time): operating system kernel operation (processing of processes from user space, disk operations, memory operations)os.cpuUtilization.nice.avg
(Nice Time): time for processes with nice priority – low-priority background tasks
We have similar data in the Linux top
output:
Tasks: 611 total, 7 running, 603 sleep, 1 d-sleep, 0 stopped, 0 zombie %Cpu(s): 5.9 us, 3.6 sy, 0.0 ni, 89.8 id, 0.2 wa, 0.0 hi, 0.5 si, 0.0 st ...
Here us
stands for user, sy
for system, ni
for nice, id
for idle, and so on.
From our graph in RDS Performance Insights, we have the largest part of wait
– and at the same time there are spikes in EBS IO operations:
That is, a query arrived and started actively reading from the disk:
And while the CPU was waiting for the disk operations to complete, the rest of the requests were executed more slowly.
At the same time, we have a low Freeable memory – because the data from the disk was written to memory.
And although the value of 460 IOPS itself does not look too high, it appears that it was at this moment that we “ate” the available memory.
What can the high os.cpuUtilization.wait.avg
indicate in this case?
- Slow EBS: still not our case, because we have EBS with up to 3000 IOPS; although waiting for disk-to-memory reads could cause an increase in I/O Waits on the CPU
- locks/Locks: as an option, but we have a metric
db.Transactions.blocked_transactions.avg
– and everything was fine there, that is, PostgreSQL did not wait for the release of locks – we will also look at the Transactions now - low RAM: reading new data from disk to memory pushed the existing data to Swap, and then read it back from there, while dumping data from memory back to Swap to load new data from Swap (swap storm)
The use of Swap at this time also increased:
We’ll move on to Swap and ReadIOPS soon, but first, let’s take a look at transactions.
Transactions
Another interesting thing is the activity of transactions:
We can see that as soon as the CPU problems started, our xact_commit
and xact_rollback
(graph on the left) dropped to zero, and at the same time, the number of active_transactions
increased to ~20, but at the same time, blocked_transactions
was 0.
I can’t make a screenshot anymore, but there was also an “idle in transaction” spike – that is, transactions were started (BEGIN
) but not completed (did not perform COMMIT
or ROLLBACK
).
But how can this be? Even with high CPU Waits, at least some of the transactions should have completed.
- high Read IOPS: the system could not get data from the disk?
- no – ReadIOPS increased, but not so critically
- however, because FreeableMemory was too low, data from
shared_buffers
could be dumped to the Swap, which caused even more delays for the processes that were waiting for this data
- too high Write IOPS: the system could not perform WAL operations (Write Ahead Logs, we will look at it in details later), which is required by the transaction completion
- but we saw that Write IOPS was normal
- a lot of table locks, and processes were waiting for resources to freed up?
- also not, because we saw that
blocked_transactions
was at zero
- also not, because we saw that
- autovacuum or ANALYZE running, which could have blocked transactions?
- but again,
db.Transactions.blocked_transactions.avg
was at zero
- but again,
- Swap storm: this is more like the truth:
- reads from disk pushed active data to Swap (FreeableMemory dropped)
- Swap Usage grew to almost 3 gigabytes
- PostgreSQL could not get pages from
shared_buffers
, because they were in Swap (more on the PostgreSQL memory later) - because of this, transactions “hung” waiting to read from disk instead of working in RAM
What can we check in such cases?
I/O Waits or Blocks
Check for the wait_event
:
SELECT pid, usename, query, wait_event_type, wait_event, state FROM pg_stat_activity WHERE state = 'active';
If we have“I/O” or“Locks” in wait_event
, the reason may be here.
WAL – Write Ahead Logs
Let’s speak a bit about WAL in PostgreSQL.
- with each DML (Data Manipulation Language) operation, such as
INSERT
,UPDATE
orDELETE
, the data is first changed in memory (shared_buffers
– we will talk about them further), where the “operation context” is created - at the same time, this operation is written to the WAL-buffer (
wal_buffers
– memory buffer) - when
wal_buffers
is full, or when the transaction is completed, the PostgreSQL processwal_writer
uses thefsync()
system call to write data from the buffer to a wal-file (thepg_wal/
directory) – this is a log of all changes that occurred beforeCOMMIT
- the client that started the query execution receives the
COMMIT
message – the operation was successfully completed- if
synchronous_commit = on
, PostgreSQL waits for thefsync()
to complete before sendingCOMMIT
- if
synchronous_commit = off
, PostgreSQL does not wait forfsync()
andCOMMIT
is faster, but with the risk of data loss - if the transaction cannot be executed, the client will receive the “could not commit transaction” error
- if
- data from the
shared_buffers
is written to the files of the database itself (thebase/
directory) – this is done by thecheckpointer
process, which writes modified data in memory (dirty pages) to the disk- this happens using the
CHECKPOINT
process, and not immediately afterCOMMIT
, but periodically - after executing
CHECKPOINT
– PostgreSQL archives or deletes WAL files
- this happens using the
A very cool article on WAL, memory and checkpoints is PostgreSQL: What is a checkpoint?
So, if the EBS disk was overloaded with Write IOPS, then WAL could stop writing, and this could lead to a transaction stall.
But in our case, we can see that db.Transactions.xact_rollback.avg
was at zero, and it does not depend on WAL and Write operations on the disk.
PostgreSQL Exporter has several useful metrics that display WAL activity:
pg_stat_archiver_archived_count
: the total number of successfully archived WAL files (which tells us that WAL is working correctly)pg_stat_archiver_failed_count
: number of failed attempts to archive WAL filespg_stat_bgwriter_checkpoint_time
: time spent on executing CHECKPOINTS
You can also make the following query:
SELECT * FROM pg_stat_wal;
If wal_buffers_full
is high and growing, transactions might be waiting for the fsync()
to execute. Or that the value of wal_buffers
is too low and needs to be increased to reduce the frequency of forced WAL writes to disk.
PostgreSQL doesn’t seem to have such a metric, but we can make our own with custom.yaml
:
pg_stat_wal: query: "SELECT wal_buffers_full FROM pg_stat_wal;" metrics: - wal_buffers_full: usage: "COUNTER" description: "Number of times the WAL buffers were completely full, causing WAL data to be written to disk."
Read IOPS, and Swap
Okay.
Let’s go back to the issue with the ReadIOPS and Swap.
What could have happened here:
- some queries started actively reading data from the disk
- the data was written to the
shared_buffers
, there was not enough space in the memory, and the data that was already in theshared_buffers
was moved to Swap - queries in PostgreSQL continue to be executed, but now instead of just taking data from memory, PostgreSQL has to go to the Swap, and therefore we have high ReadIOPS and CPU I/O Waits – that is, the CPU waited for the data to be read from the disk
But then there should be a spike in db.IO.blks_read.avg
, since we are reading from disk, isn’t it?
But no, because db.IO.blks_read
are requests from PostgreSQL itself to read data.
When it operates with a Swap, it still thinks that it is working with RAM.
But the ReadIOPS metric comes from the operating system/EBS itself, and it shows all read operations, not just from PostgreSQL processes.
Interestingly, at the time of the problem, db.Cache.blks_hit
dropped to zero. What does this mean? Usually, the backend processes (sessions) did not find data in shared_buffers
.
But knowing that all our transactions stopped, and db.IO.blks_read
also dropped to zero, it’s more likely that PostgreSQL just stopped accessing the cache at all because everyone was waiting for free CPU.
Okay, Google…
What about the Swap?
SWAP, and Enhanced monitoring
What do we have on the graphs in the Monitoring?
This is where Enhanced monitoring will be useful:
Select Swap ( the Manage graphs button), and we see such a picture:
- Free Memory is going down
- Free Swap is going down
- but Swaps in and Swaps out remain unchanged
That is, it looks like the memory is running out – the system is dumping data from RAM to Swap – but at the same time, the Swaps in/out operation itself “did not happen”.
It looks interesting, and it would be worth digging further, but AWS Console is constantly crashing in these graphs:
See OS metrics in Enhanced Monitoring.
However, although we can’t get metrics from Enhanced monitoring directly, they are written to CloudWatch Logs! And from the logs, we can generate any metrics from VictoriaLogs or Loki:
In the logs, we have that the Swap In/Out did take place. In the future, it would be useful to create some metrics from these logs.
To find these logs, go to the CloudWatch Logs, in the Log Groups search for RDSOSMetrics
, and then select a log by the RDS ID:
Operating system process list
A list of processes can also be very useful:
If the free memory starts to drop, go here, look at the Resident memory, find the PID of the process that is consuming memory, and see what kind of request it is:
prod_kraken_db=> SELECT user, pid, query FROM pg_stat_activity WHERE pid = '26421'; user | pid | query ------------------+-------+---------- prod_kraken_user | 26421 | ROLLBACK
We also have all these processes in the logs I mentioned above – but it’s better to check them at the moment when the problem occurs to find out which one was executed.
Because yes, we can enable slow queries logs – but in those logs, we won’t see the PID, and we won’t be able to find out how much memory this request used.
Memory in PostgreSQL
Let’s dig a little deeper into what’s in PostgreSQL memory.
See RDS for PostgreSQL memory and Tuning memory parameters.
Memory in PostgreSQL is divided into two main types: “shared memory” for all processes, and “local memory” – the memory of each backend process (session).
In the shared memory, we have:
shared_buffers
: the main memory where PostgreSQL keeps a cache of data it reads from disk when processing queries – caching table pages and indexes- analog of the JVM Heap Memory (Java Heap)
shared_buffers
by default is 25% of the total RAM, but can be changed
wal_buffers
: already seen above – used to temporarily store WAL records to buffer the transaction before writing to the WAL file
From the shared_buffers
, changed data (dirty pages) is written to disk by two processes:
- Background Writer (
bgwriter
): runs in the background, gradually writing data to disk - Checkpointer (
checkpoint
): forcibly writes all pages during theCHECKPOINT
Process memory has:
work_mem
: is allocated to queries that performORDER BY
,HASH JOIN
, and aggregations- each query gets its own copy of
work_mem
, therefore, with a large number of simultaneous queries, memory can quickly run out - if the
work_mem
process is not enough, PostgreSQL starts writing temporary files to disk (temp_blks_written
), which slows down query execution
- each query gets its own copy of
maintenance_work_mem
: actually, maintenance operations – vacuuming operations, creating indexes, adding foreign keystemp_buffers
: allocated for temporary tables(CREATE TEMP TABLE
).
We can get all the data from pg_settings
as follows:
SELECT name, setting, unit, CASE WHEN unit = '8kB' THEN setting::bigint * 8 WHEN unit = 'kB' THEN setting::bigint ELSE NULL END AS total_kb, pg_size_pretty( CASE WHEN unit = '8kB' THEN setting::bigint * 8 * 1024 WHEN unit = 'kB' THEN setting::bigint * 1024 ELSE NULL END ) AS total_pretty FROM pg_settings WHERE name IN ('shared_buffers', 'work_mem', 'temp_buffers', 'wal_buffers');
We have 238967 shared_buffers
, each of them is 8 KB, for a total of ~1.9 GB.
But this is already on db.m5.large
.
Checking shared_buffers
The Cache hit ratio will show how much data was fetched from memory and how much from the disk itself – although we have this in the db.IO.blks_read.avg
and db.Cache.blks_hit.avg
metrics (or the pg_stat_database_blks_hit
and pg_stat_database_blks_read
metrics in PostgreSQL Exporter):
SELECT blks_read, blks_hit, ROUND(blks_hit::numeric / NULLIF(blks_hit + blks_read, 0), 4) AS cache_hit_ratio FROM pg_stat_database WHERE datname = current_database();
If the cache_hit_ratio
< 0.9, it means that the PostgreSQL cache is not efficient, and too much data is read from disk instead of the cache.
To see how many of the allocated shared_buffers
are currently used and how many are free – here we need the pg_buffercache
extension.
The query:
SELECT COUNT(*) AS total_buffers, SUM(CASE WHEN isdirty THEN 1 ELSE 0 END) AS dirty_buffers, SUM(CASE WHEN relfilenode IS NULL THEN 1 ELSE 0 END) AS free_buffers, SUM(CASE WHEN relfilenode IS NOT NULL THEN 1 ELSE 0 END) AS used_buffers, ROUND(100.0 * SUM(CASE WHEN relfilenode IS NOT NULL THEN 1 ELSE 0 END) / COUNT(*), 2) AS used_percent, ROUND(100.0 * SUM(CASE WHEN relfilenode IS NULL THEN 1 ELSE 0 END) / COUNT(*), 2) AS free_percent FROM pg_buffercache;
We have 238967 buffers in total, of which only 12280, or 5%, are used.
Or another option is to see how many pages are currently in shared_buffers
is:
prod_kraken_db=> SELECT count(*) AS cached_pages, pg_size_pretty(count(*) * 8192) AS cached_size FROM pg_buffercache; cached_pages | cached_size --------------+------------- 117495 | 918 MB
Even though total allocated size for the shared_buffers
is:
prod_kraken_db=> SHOW shared_buffers; shared_buffers ---------------- 939960kB
918 megabytes.
But then why did we see in the previous query that “5% used”?
Because as a result of pg_buffercache
, in the used_buffers
and used_percent
fields only active pages(used
) are considered, i.e. those that either have a file reference(relfilenode
) or have been recently used.
Using EXPLAIN ANALYZE
See EXPLAIN.
EXPLAIN (ANALYZE, BUFFERS
) will show us how much data is in the buffers now, how much data will be read from the disk:
Here:
- shared hit: how many pages were read from the cache
(shared_buffers
) - shared read: how many pages were read from disk (loaded into
shared_buffers
) - shared dirtied: how many pages were modified
The Explain topic is quite interesting and gives us a lot of interesting information, so I wrote about it separately – PostgreSQL: using EXPLAIN and setting up “auto_explain” in AWS RDS.
See the contents of shared_buffers
To get the number of buffers for all tables, we can use such a query:
SELECT c.relname, count(*) AS buffers FROM pg_buffercache b INNER JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) AND b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database())) GROUP BY c.relname ORDER BY 2 DESC LIMIT 10;
Here, 1636 buffers are used for the challenge_progress
table, which gives us:
1636*8/1024 12
12 megabytes.
Or you can get it from the pg_relation_size()
– see System Administration Functions and pg_relation_size():
prod_kraken_db=> SELECT pg_size_pretty(pg_relation_size('challenge_progress')); pg_size_pretty ---------------- 13 MB
Checking work_mem
If the FreeableMemory drops, then perhaps too much work_mem
is being used.
Check how much is allocated to each process:
dev_kraken_db=> SHOW work_mem; work_mem ---------- 4MB
You can check whether the processes have enough work_mem
from the value of temp_blks_written
, because when the memory in work_mem
runs out, the process begins to write data to temporary tables:
And actually that’s probably all.
It is clear that db.t3.small
with two gigabytes was not enough for us.
We have a suspicion which query caused this “chain reaction”, and we saw an “ugly” SELECT
in the slow queries logs, and the developers seemed to have optimized it.
We’ll try to reduce the instance type to the db.t3.medium
with 4 gigabytes of memory, and will see if the problem occurs again.
Monitoring summary
Instead of conclusions, here are some ideas of what should be monitored and what could be improved.
Our alerts
I will outline some alerts that we already have now.
CloudWatch metrics
CloudWatch metrics. We are collecting them to VictoriaMetrics with the YACE exporter.
CPUUtilization
The alert:
- alert: HighCPUUtilization expr: avg(aws_rds_cpuutilization_average{dimension_DBInstanceIdentifier!="", dimension_DBInstanceIdentifier!~"kraken-ops-rds-.*"}[5m]) by (dimension_DBInstanceIdentifier) > 80 for: 5m labels: severity: warning component: devops environment: ops annotations: summary: "High CPU utilization on RDS instance" description: "CPU utilization is above 80% for more than 5 minutes on RDS instance {{ "{{" }} $labels.instance }}."
DBLoadRelativeToNumVCPUs
The alert:
- alert: HighCPULoadPerVCPUWarningAll expr: avg(aws_rds_dbload_relative_to_num_vcpus_average{dimension_DBInstanceIdentifier!="", dimension_DBInstanceIdentifier!~"kraken-ops-rds-.*"}[5m]) by (dimension_DBInstanceIdentifier) > 0.8 for: 5m labels: severity: warning component: devops environment: ops annotations: summary: "High per-core CPU utilization on RDS instance" description: | CPU utilization is above 80% for more than 5 minutes on RDS instance {{ "{{" }} $labels.instance }} *DB instance*: `{{ "{{" }} $labels.dimension_DBInstanceIdentifier }}` *Per-vCPU load*: `{{ "{{" }} $value | humanize }}`
FreeStorageSpace
Not very relevant if you have dynamic storage, but it can be useful.
The alert:
- record: aws:rds:free_storage:gigabytes expr: sum(aws_rds_free_storage_space_average{dimension_DBInstanceIdentifier!=""}) by (dimension_DBInstanceIdentifier) / 1073741824 # ALL - alert: LowFreeStorageSpaceCriticalAll expr: aws:rds:free_storage:gigabytes < 5 for: 5m labels: severity: warning component: devops environment: ops annotations: summary: "Low Disk Space on an RDS instance" description: |- Free storage below 5 GB *DB instance*: `{{ "{{" }} $labels.dimension_DBInstanceIdentifier }}` *Free storage*: `{{ "{{" }} $value | humanize }}`
FreeableMemory
The alert:
- alert: LowFreeableMemoryDev expr: avg(aws_rds_freeable_memory_average{dimension_DBInstanceIdentifier="kraken-ops-rds-dev"}[5m]*0.000001) by (dimension_DBInstanceIdentifier) < 20 for: 5m labels: severity: warning component: backend environment: dev annotations: summary: "High memory usage on RDS instance" description: |- Freeable memory is less than 100mb *DB instance*: `{{ "{{" }} $labels.dimension_DBInstanceIdentifier }}` *Free memory*: `{{ "{{" }} $value | humanize }}`
ReadLatency, ReadIOPS, and WriteLatency with WriteIOPS
Similar metrics, and are useful to monitor.
The alert:
- alert: HighDiskReadLatencyKrakenStaging expr: sum(aws_rds_read_latency_average{dimension_DBInstanceIdentifier="kraken-ops-rds-dev"}) by (dimension_DBInstanceIdentifier) > 0.1 for: 1s labels: severity: warning component: backend environment: dev annotations: summary: "High Disk Read Latency on RDS instance" description: |- Reads from a storage are too slow *DB instance*: `{{ "{{" }} $labels.dimension_DBInstanceIdentifier }}` *Read Latency*: `{{ "{{" }} $value | humanize }}`
SwapUsage
Also must have metric.
The alert:
- record: aws:rds:swap_used:gigabytes expr: sum(aws_rds_swap_usage_average{dimension_DBInstanceIdentifier!=""}) by (dimension_DBInstanceIdentifier) / 1073741824 # ALL - alert: SwapUsedAllWarning expr: sum(aws:rds:swap_used:gigabytes{dimension_DBInstanceIdentifier!="", dimension_DBInstanceIdentifier!~"kraken-ops-rds-.*"}) by (dimension_DBInstanceIdentifier) > 0.8 for: 1s labels: severity: warning component: devops environment: ops annotations: summary: "Swap space use is too high on an RDS instance" description: |- The RDS instance is using more than *0.8 GB* of swap space *DB instance*: `{{ "{{" }} $labels.dimension_DBInstanceIdentifier }}` *Swap used GB*: `{{ "{{" }} $value | humanize }}`
DatabaseConnections
There is a CloudWatch metric, but it just returns the number of connections, and the limit can be different for different types of instances.
So I’ll show you an example of an alert here, but then there will be another one from PostgreSQL Exporter metrics:
# db.t3.micro - 112 max_connections (Backend Dev) # db.t3.small - 225 max_connections (Backend Prod) # db.t3.medium - 450 max_connections # db.t3.large - 901 max_connections # ALL - alert: HighConnectionCountWarning expr: avg(aws_rds_database_connections_average{dimension_DBInstanceIdentifier!="", dimension_DBInstanceIdentifier!~".*kraken.*"}[5m]) by (dimension_DBInstanceIdentifier) > 50 for: 1m labels: severity: warning component: devops environment: ops annotations: summary: "High number of connections on RDS instance" description: |- An RDS instance Connections Pool is almost full. New connections may be rejected. *DB instance*: `{{ "{{" }} $labels.dimension_DBInstanceIdentifier }}` *Instance type*: `db.t3.micro` *Max connections*: `112` *Current connections*: `{{ "{{" }} $value | humanize }}`
Loki Recording Rules metrics
We generate a couple of metrics from the logs of our Backend API, for example:
- record: aws:rds:backend:connection_failed:sum:rate:5m expr: | sum( rate( {app=~"backend-.*"} != "token_email" |= "sqlalchemy.exc.OperationalError" | regexp `.*OperationalError\) (?P<message>connection to server at "(?P<db_server>[^"]+)".*$)` [5m] ) ) by (message, db_server)
And then we have alerts from them:
- alert: BackendRDSConnectionFailed expr: sum(aws:rds:backend:connection_failed:sum:rate:5m{db_server="dev.db.kraken.ops.example.co"}) by (db_server, message) > 0 for: 1s labels: severity: critical component: backend environment: dev annotations: summary: "Connection to RDS server failed" description: |- Backend Pods can't connect to an RDS instance *Database server:*: {{ "{{" }} $labels.db_server }} *Error message*: {{ "{{" }} $labels.message }}
PostgreSQL Exporter metrics
pg_stat_database_numbackends
Again about Connections limit: in the Exporter, we have the pg_settings_max_connections
metric, which indicates the maximum number of connections depending on the type of instance, and pg_stat_database_numbackends
– the number of active sessions (connections).
Accordingly, we can calculate the % of used from max connections.
The only problem is that these metrics have different labels, and I skipped to do any label_replace
, so just added three records
for each environment:
# 'pg_stat_database_numbackends' and 'pg_settings_max_connections' have no common labels # don't want to waste time with 'label_replace' or similar # thus just create different 'records' for Prod and Staging - record: aws:rds:kraken_dev:max_connections_used:percent expr: | ( sum(pg_stat_database_numbackends{datname=~"dev_kraken_db", job="atlas-victoriametrics-postgres-exporter-kraken-dev"}) / sum(pg_settings_max_connections{container=~".*kraken-dev"}) ) * 100 - alert: ExporterHighConnectionPercentBackendDevWarning expr: aws:rds:kraken_dev:max_connections_used:percent > 40 for: 1s labels: severity: warning component: backend environment: dev annotations: summary: "High number of connections on the Backend RDS instance" description: |- RDS instance Connections Pool is almost full. New connections may be rejected. *DB instance*: `kraken-ops-rds-dev` *Connections pool use*: `{{ "{{" }} $value | humanize }}%` grafana_rds_overview_url: 'https://{{ .Values.monitoring.root_url }}/d/ceao6muzwos1sa/kraken-rds?orgId=1&from=now-1h&to=now&timezone=browser&var-query0=&var-db_name=kraken-ops-rds-dev'
pg_stat_activity_max_tx_duration
Alerts when some transactions are taking too long.
I can’t say that this is a very useful metric, because we don’t have a PID and the amount of memory, but it’s enough for now.
Then we can think about custom metrics.
For now, the alert is as follows:
- alert: ExporterTransactionExecutionTimeBackendDevWarning expr: sum(rate(pg_stat_activity_max_tx_duration{datname="dev_kraken_db"}[5m])) by (state, datname) > 0.1 for: 1m labels: severity: warning component: backend environment: dev annotations: summary: "RDS transactions running too long" description: |- Too long duration in seconds active transaction has been running *Database name*: `{{ "{{" }} $labels.datname }}` *State*: `{{ "{{" }} $labels.state }}` *Duration*: `{{ "{{" }} printf "%.2f" $value }}` seconds grafana_rds_overview_url: 'https://{{ .Values.monitoring.root_url }}/d/ceao6muzwos1sa/kraken-rds?orgId=1&from=now-1h&to=now&timezone=browser&var-query0=&var-db_name={{ "{{" }} $labels.datname }}'
Worth adding alerts
Well, there is a lot of stuff here.
PostgreSQL Exporter custom metrics
The main thing I mentioned above is that in PostgreSQL Exporter we can create custom metrics with the results of PostgreSQL queries using config.queries
.
See Create Prometheus integrated Postgres custom metrics.
Although this feature seems to be deprecated.
But even if it is deprecated, you can still go the extra mile and write your own exporter. See Prometheus: Creating a Custom Prometheus Exporter in Python and Prometheus: GitHub Exporter – writing a custom exporter for the GitHub API.
CloudWatch Logs, and Enhanced Monitoring
As I mentioned before, it would be very useful to have our own metrics for Swap, or to have the PID of processes and their resident memory.
Transactions monitoring
I have it on my Grafana dashboard now:
But maybe will add separate alerts – by the number of the “active” transactions, or by “idle in transaction” count.
Basic metrics from PostgreSQL Exporter:
pg_stat_database_xact_commit
andpg_stat_database_xact_rollback
: as we saw in our case, if the value drops, then we have problems – requests are not completedpg_stat_activity
: by thestate
label, we have two main ones:active:
the total number of active requestsidle in transaction
: we also saw in our case that many requests were hanging waiting for completion
pg_locks
: the number of blocks (see pg_locks and Concurrency Control)
WAL monitoring
I also mentioned a few metrics that are available in PostgreSQL Exporter, maybe I’ll add either alerts or graphs to Grafana for them:
pg_stat_archiver_archived_count
: total number of successfully archived WAL files (which will tell us that WAL is working correctly)pg_stat_archiver_failed_count
: number of failed attempts to archive WAL filespg_stat_bgwriter_checkpoint_time
: time spent on executingCHECKPOINT
In the PostgreSQL server itself, we can check it with the pg_stat_wal
view:
SELECT * FROM pg_stat_wal;
wal_records
: number of WAL records written (INSERT
,UPDATE
,DELETE
operations)wal_bytes
: total amount of data (in bytes) written to the WALwal_buffers_full
: the number of times the WAL buffers were completely full, forcing backend processes to write directly to the WAL filewal_write
: the number of times PostgreSQL wrote the WAL to a filewal_write_time
: total time in milliseconds spent on WAL writeswal_sync_time
: total time (in milliseconds) spent onfsync()
(guaranteed write to disk)
shared_buffers monitoring
We still need to think about what metrics we could generate and what graphs or alerts we could come up with.
From what comes to mind:
- monitor shared hit and read: how much data was found in the cache and how much had to be read from disk
buffers_backend
: how many buffers were written directly by backend processes- in a normal situation, all data from dirty pages should be written by
bgwriter
orcheckpoint
- if all
shared_buffers
are busy, andbgwriter
,wal_writer
orcheckpointer
does not have time to transfer data from them to disk, then the clients’ backend processes are forced to transfer data themselves, which slows down the execution of their requests
- in a normal situation, all data from dirty pages should be written by
Check with:
SELECT buffers_backend, buffers_checkpoint, buffers_alloc FROM pg_stat_bgwriter;
Here:
buffers_backend
: how many buffers were written directly by backend processesbuffers_checkpoint
: how many buffers are written duringCHECKPOINT
- if we have a high value here:
- then checkpoints occurs rarely, and writes many pages at once
- or
bgwriter
does not have time to write, andCHECKPOINT
writes everything at once
- if we have a high value here:
buffers_alloc
: how many new buffers areallocated
toshared_buffers
- if we have a high value, then the cache is constantly being pushed out, and PostgreSQL is forced to load pages from disk
Checkpointer monitoring
It also makes sense to keep an eye on the Checkpointer activity:
SELECT checkpoint_write_time, checkpoint_sync_time FROM pg_stat_bgwriter;
Here:
checkpoint_write_time
: time spent on writing changed pages (dirty pages) fromshared_buffers
to the file system; if the value is large, then:shared_buffers
is too large – when checkpointing, you have to write too many pages at once- many operations
(UPDATE
,DELETE
), which leads to a large number of “dirty pages” - or the
checkpoint_timeout
is too big, so many changes are written at once during the checkpoint
checkpoint_sync_time
: time spent on forced writing (performingfsync())
of changed pages to the physical disk; if the value is large, then possible there are disk problems – slow data writing
work_mem monitoring
It also makes sense to look here.
If work_mem
is not enough, processes start writing temp_blks_written
, which, firstly, slows down the execution of queries, and secondly, creates an additional load on the disk.
Check with:
SELECT temp_files, temp_bytes FROM pg_stat_database WHERE datname = current_database();
Useful links
- Common DBA tasks for Amazon RDS for PostgreSQL
- Monitor Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL performance using PGSnapper
- OS metrics in Enhanced Monitoring
- Tuning memory parameters
- Tuning memory parameters for Aurora PostgreSQL
- How can I log execution plans of queries for Amazon RDS PostgreSQL or Aurora PostgreSQL to tune query performance?
- RDS for PostgreSQL wait events
- SQL statistics for RDS PostgreSQL
- Viewing OS metrics in the RDS console
- Monitoring Transaction Logs in PostgreSQL
- Top metrics in PostgreSQL monitoring with Prometheus
- RDS for PostgreSQL memory
- pg_buffercache — inspect PostgreSQL buffer cache state
- Determining the optimal value for shared_buffers using the pg_buffercache extension in PostgreSQL
- How Linux uses RAM and swap
- pgPedia – An encyclopedia of things PostgreSQL-related