PostgreSQL Crisis Averted: From 600GB Performance Slump to Resilient System
February 28, 2026, 9:38 pm
A colossal PostgreSQL database, struggling with performance, underwent H3 partitioning. This dramatically improved `VACUUM` times and query speed. An initial shadow table migration aimed for zero downtime. However, a critical flaw emerged: `UNLOGGED` tables failed to replicate due to a PostgreSQL bug, forcing an urgent cross-cluster migration. This intense process highlighted the necessity of meticulous replication checks, custom monitoring tools, and understanding subtle database behaviors. From crisis, a resilient, high-performance system arose.
A sprawling PostgreSQL database reached a breaking point. Its 600-gigabyte OpenStreetMap dataset became unwieldy. `VACUUM` operations, crucial for maintenance, stretched over six hours. Query performance degraded. Disk space neared its limits. Action was imperative. Database partitioning offered a solution.
The core issue stemmed from massive OSM data. Nodes, ways, and way_nodes tables contained billions of rows. Updating a single road meant touching hundreds of millions of records. Data growth was geometric. Such scale crippled standard database operations. `VACUUM` downtime became unacceptable.
H3 geospatial indexing was chosen for partitioning. This Uber-developed hexagonal system divides the globe. Level 0 offered 122 global partitions. This balanced manageability with performance gains. Many partitions remained empty, a known design trade-off. This choice ensured future scalability without overwhelming PostgreSQL with thousands of tables. Empty partitions consumed minimal metadata.
Initial results were transformative. `VACUUM` duration plummeted from over six hours to just 18 minutes. Disk usage dropped from 70% to 48%. Query speeds accelerated dramatically. `SELECT` operations on the H3 key improved from 2.3 seconds to 340 milliseconds. `JOIN` operations between nodes and ways saw an 8.5-second query drop to 1.2 seconds. Automatic `VACUUM` could finally run.
The migration strategy involved "shadow tables." This technique creates new partitioned tables alongside old ones. Data copies occur in batches. An atomic `RENAME` operation then swaps the tables. This promised minimal downtime, estimated at under 100 milliseconds. Batch sizes were crucial. Testing showed 100K rows per batch offered optimal speed and memory balance. Larger batches, like 1M rows, exhausted `work_mem` and triggered disk swaps. Smaller batches added excessive `BEGIN/COMMIT` overhead.
Pre-migration planning uncovered critical details. Three unused indexes were identified and removed. These indexes, unused for over 30 days, freed up approximately 200 gigabytes of disk space. This step was vital for accommodating the shadow tables and temporary space needed for index creation. `pg_stat_user_indexes` provided usage insights.
Standard monitoring tools proved inadequate. Grafana dashboards showed metrics like disk space and query latency. But they lacked operational control. Seeing a deadlock in Grafana meant manual intervention. SSH access, `psql` commands, PID identification, and termination were required. This process was slow and error-prone, especially during off-hours.
A custom tool, `partition-skeleton`, became essential. This web dashboard offered real-time monitoring and active migration control. It allowed pausing, resuming, or rolling back migrations directly from a user interface. It provided storage analysis, active transaction monitoring, and kill switches for problematic PIDs. Automated alerts, integrated with Telegram, paused migration if disk space fell below 10%. This dramatically reduced response times. An early morning deadlock was resolved in seconds, not minutes.
The "successful" shadow table migration introduced an unforeseen crisis. One week later, a routine check revealed empty database replicas. The partitioned tables were not replicating. `pg_class.relpersistence` confirmed the issue: tables were `UNLOGGED`.
During migration, tables were created as `UNLOGGED` for faster `INSERT` operations. This optimized the initial data copy. The plan included an `ALTER TABLE ... SET LOGGED` command post-migration. PostgreSQL returned `SUCCESS`. However, this command, in PostgreSQL versions prior to 17, did not propagate to partitions. The parent table appeared `LOGGED`, but its partitions remained `UNLOGGED`. No warnings or errors were issued. Data was not written to the Write-Ahead Log (WAL). Replicas remained empty.
The `UNLOGGED` bug necessitated another migration. Fixing the tables "in place" was not feasible. The dataset totaled 850GB (600GB data, 250GB indexes). An in-place copy would require double this storage, which was unavailable. Throttled copies would stretch to over 100 hours, unacceptable for a production system.
A cross-cluster migration became the only viable path. A new PostgreSQL 17 cluster was provisioned. Data would transfer from the existing PostgreSQL 14 production cluster. This approach isolated the copy operations from the live system. It allowed parallel processing across partitions.
Generic tools failed. `pg_dump` and `pg_restore` timed out on large partitions due to HAProxy limitations. `pgcopydb` lacked support for PostGIS types like `geometry` and `hstore`. It also struggled with partitioned tables and required identical schemas.
A custom Go utility, `pg-cross-cluster-migrator`, was developed. This tool featured a dual-pool design for source and target clusters. It performed partition-aware copies. Robust data verification included row counts, schema differences, min/max IDs, and MD5 hashing. It supported automatic DDL mapping between differing schemas.
The cross-cluster migration completed in approximately 20 hours. Four parallel workers processed batches of 10,000 rows. The switch-over used atomic HAProxy redirection, achieving zero downtime. All data was verified. Replication functioned correctly. All tables were now properly `LOGGED`.
Several critical lessons emerged. Never trust implicit conversions; always verify `ALTER TABLE ... SET LOGGED` results. Always check database replicas immediately after a major migration. `UNLOGGED` tables combined with partitioning pose a significant risk in older PostgreSQL versions. Cross-cluster migrations are often superior to in-place methods for large datasets. Custom tools are essential for unique, complex database challenges, especially with specialized data types like PostGIS.
Partitioning success hinged on meticulous planning and robust operational control. Disk space planning requires a 2x estimate. `enable_partitionwise_join` must be explicitly enabled for performance benefits. Automated alerts are non-negotiable for critical, long-running processes. The custom `partition-skeleton` tool filled a crucial gap in operational monitoring.
This journey transformed a struggling database into a high-performance, resilient system. It underscored the importance of deep database understanding and proactive engineering in the face of complex infrastructure challenges.
A sprawling PostgreSQL database reached a breaking point. Its 600-gigabyte OpenStreetMap dataset became unwieldy. `VACUUM` operations, crucial for maintenance, stretched over six hours. Query performance degraded. Disk space neared its limits. Action was imperative. Database partitioning offered a solution.
The core issue stemmed from massive OSM data. Nodes, ways, and way_nodes tables contained billions of rows. Updating a single road meant touching hundreds of millions of records. Data growth was geometric. Such scale crippled standard database operations. `VACUUM` downtime became unacceptable.
H3 geospatial indexing was chosen for partitioning. This Uber-developed hexagonal system divides the globe. Level 0 offered 122 global partitions. This balanced manageability with performance gains. Many partitions remained empty, a known design trade-off. This choice ensured future scalability without overwhelming PostgreSQL with thousands of tables. Empty partitions consumed minimal metadata.
Initial results were transformative. `VACUUM` duration plummeted from over six hours to just 18 minutes. Disk usage dropped from 70% to 48%. Query speeds accelerated dramatically. `SELECT` operations on the H3 key improved from 2.3 seconds to 340 milliseconds. `JOIN` operations between nodes and ways saw an 8.5-second query drop to 1.2 seconds. Automatic `VACUUM` could finally run.
The migration strategy involved "shadow tables." This technique creates new partitioned tables alongside old ones. Data copies occur in batches. An atomic `RENAME` operation then swaps the tables. This promised minimal downtime, estimated at under 100 milliseconds. Batch sizes were crucial. Testing showed 100K rows per batch offered optimal speed and memory balance. Larger batches, like 1M rows, exhausted `work_mem` and triggered disk swaps. Smaller batches added excessive `BEGIN/COMMIT` overhead.
Pre-migration planning uncovered critical details. Three unused indexes were identified and removed. These indexes, unused for over 30 days, freed up approximately 200 gigabytes of disk space. This step was vital for accommodating the shadow tables and temporary space needed for index creation. `pg_stat_user_indexes` provided usage insights.
Standard monitoring tools proved inadequate. Grafana dashboards showed metrics like disk space and query latency. But they lacked operational control. Seeing a deadlock in Grafana meant manual intervention. SSH access, `psql` commands, PID identification, and termination were required. This process was slow and error-prone, especially during off-hours.
A custom tool, `partition-skeleton`, became essential. This web dashboard offered real-time monitoring and active migration control. It allowed pausing, resuming, or rolling back migrations directly from a user interface. It provided storage analysis, active transaction monitoring, and kill switches for problematic PIDs. Automated alerts, integrated with Telegram, paused migration if disk space fell below 10%. This dramatically reduced response times. An early morning deadlock was resolved in seconds, not minutes.
The "successful" shadow table migration introduced an unforeseen crisis. One week later, a routine check revealed empty database replicas. The partitioned tables were not replicating. `pg_class.relpersistence` confirmed the issue: tables were `UNLOGGED`.
During migration, tables were created as `UNLOGGED` for faster `INSERT` operations. This optimized the initial data copy. The plan included an `ALTER TABLE ... SET LOGGED` command post-migration. PostgreSQL returned `SUCCESS`. However, this command, in PostgreSQL versions prior to 17, did not propagate to partitions. The parent table appeared `LOGGED`, but its partitions remained `UNLOGGED`. No warnings or errors were issued. Data was not written to the Write-Ahead Log (WAL). Replicas remained empty.
The `UNLOGGED` bug necessitated another migration. Fixing the tables "in place" was not feasible. The dataset totaled 850GB (600GB data, 250GB indexes). An in-place copy would require double this storage, which was unavailable. Throttled copies would stretch to over 100 hours, unacceptable for a production system.
A cross-cluster migration became the only viable path. A new PostgreSQL 17 cluster was provisioned. Data would transfer from the existing PostgreSQL 14 production cluster. This approach isolated the copy operations from the live system. It allowed parallel processing across partitions.
Generic tools failed. `pg_dump` and `pg_restore` timed out on large partitions due to HAProxy limitations. `pgcopydb` lacked support for PostGIS types like `geometry` and `hstore`. It also struggled with partitioned tables and required identical schemas.
A custom Go utility, `pg-cross-cluster-migrator`, was developed. This tool featured a dual-pool design for source and target clusters. It performed partition-aware copies. Robust data verification included row counts, schema differences, min/max IDs, and MD5 hashing. It supported automatic DDL mapping between differing schemas.
The cross-cluster migration completed in approximately 20 hours. Four parallel workers processed batches of 10,000 rows. The switch-over used atomic HAProxy redirection, achieving zero downtime. All data was verified. Replication functioned correctly. All tables were now properly `LOGGED`.
Several critical lessons emerged. Never trust implicit conversions; always verify `ALTER TABLE ... SET LOGGED` results. Always check database replicas immediately after a major migration. `UNLOGGED` tables combined with partitioning pose a significant risk in older PostgreSQL versions. Cross-cluster migrations are often superior to in-place methods for large datasets. Custom tools are essential for unique, complex database challenges, especially with specialized data types like PostGIS.
Partitioning success hinged on meticulous planning and robust operational control. Disk space planning requires a 2x estimate. `enable_partitionwise_join` must be explicitly enabled for performance benefits. Automated alerts are non-negotiable for critical, long-running processes. The custom `partition-skeleton` tool filled a crucial gap in operational monitoring.
This journey transformed a struggling database into a high-performance, resilient system. It underscored the importance of deep database understanding and proactive engineering in the face of complex infrastructure challenges.

