Both master and the replica was an instance with 8-core CPU and 32GB of RAM and the utilization was very low despite there are millions of read and write activities. With good queries and stable performance, we decided to replace the Replica #1 with another smaller server with 4-core CPU and 16GB of RAM. And so, the new Replica #2 is up and ready and now serving read queries.
With the new replica is in place and serving client requests, we decided to remove the old replica server the next day at 6pm. And two hours later, every queries to the master and replica went bad/slow. The CPU utilization went higher than before and we noticed this in the next morning.
Since then we have analyzed everything from the log files to the Postgres parameters. I started to change parameters here and there but none of these helped. And of course, I blame the next person which in my case the developers. There are moments where developers created unoptimized queries and deploy them to the production. And everything went crazy.
We struggle with this case for two days without any clue what is happening and how do we fix this.
This database cluster have most write activities compared to our other clusters. And since we want everything to run as smooth and fast as possible, we setup cron jobs to run VACUUM ANALYZE for specific tables at night around 3am. Two days later, I was looking at the log file and see something unusual. The VACUUM ANALYZE job was failed, because of this error:
Subject: Cron <[email protected]> psql -d db1 -c "VACUUM ANALYZE traffic_daily;" Date: Sat, 28 Apr 2018 03:09:45 +0800 (WIB) WARNING: oldest xmin is far in the past HINT: Close open transactions soon to avoid wraparound problems. WARNING: oldest xmin is far in the past HINT: Close open transactions soon to avoid wraparound problems. VACUUM
So I decided to run manually the VACUUM ANALYZE process and received the same error. Checking the tuples statistics, the number shows it has failed for two days because the dead ones are now almost 30% of the live tuples on the table. And now I understand, I’m screwed.
Above error occurred because there are old transactions and not released yet (neither commit nor rollback). But, when I am checking the pg_prepared_xacts there was nothing there.
I am stuck, again.
As I was stated before, replication slot is used in almost all of our entire database clusters. And I remember that we have removed a replica server two days earlier, so I went to pg_replication_slots, and there it is an inactive slot which previously used by the old replica server. The slot is indeed keeping xmin number about two days ago and this prevent VACUUM to run, because it has to retain the transaction.
Deleted the slot and once it is done the autovacuum run as expected. By the time it is completed, the CPU utilization go as low as it should be and the applications queries run as fast as expected. Everyone’s happy and I can go back to sleep in peace.
Lesson learned: NEVER EVER KEEPS AN INACTIVE REPLICATION SLOTS. Because not only it will used up your storage but it will retain the old transactions and prevent your VACUUM to run and clean-up dead tuples.