High-Availability Infrastructure: HAProxy, Patroni, and 99.9% Uptime

When your hosting platform serves 500 to 1,000 paying customers and processes roughly half a million dollars in annual revenue, downtime is not an inconvenience -- it is a direct hit to the bottom line and to trust. During my time building and operating the Makdos hosting platform, I designed and implemented a high-availability infrastructure stack that achieved 99.9 percent uptime. Here is a detailed look at the architecture, the tools, and the hard-won lessons behind it.

Why High Availability Was Non-Negotiable

Makdos was not a side project. It was a commercial hosting platform with real customers running real businesses on our infrastructure. A few minutes of database downtime could mean lost transactions, broken customer dashboards, and support tickets that took days to resolve. We needed an infrastructure that could survive node failures, network partitions, and maintenance windows without any customer-visible impact.

The target was 99.9 percent uptime, which translates to roughly 8.7 hours of allowed downtime per year. That sounds generous until you factor in planned maintenance, unexpected hardware failures, and the occasional misconfigured deployment. Every component in the stack had to be designed with redundancy in mind.

The Load Balancing Layer: HAProxy

HAProxy was our front door. Every HTTP and HTTPS request hit HAProxy before reaching any application server. We ran HAProxy in an active-passive configuration with keepalived managing a virtual IP address (VIP). If the primary HAProxy node went down, keepalived would float the VIP to the standby node within seconds.

The HAProxy configuration was tuned for our specific traffic patterns. We handled around 200 concurrent users during peak hours, which is not massive by global standards but required careful connection management. I configured separate backends for the web application, the API, and the admin panel, each with its own health check endpoints and load balancing algorithms.

For the web and API backends, I used round-robin with server weights. Servers that had been recently deployed got lower weights during a warm-up period, which prevented cold caches from causing latency spikes. Health checks ran every two seconds, and a server was marked as down after three consecutive failures.

SSL termination happened at the HAProxy layer using Let's Encrypt certificates managed by Certbot with automated renewal. This kept the application servers simple -- they only needed to handle plain HTTP traffic.

Database High Availability: Patroni and PostgreSQL

The database layer was where things got interesting. PostgreSQL was our primary database, and losing it would mean losing everything. I implemented Patroni to manage PostgreSQL high availability with automatic failover.

Patroni is a template for building a highly available PostgreSQL cluster. It uses a distributed consensus store (we chose etcd) to manage leader election and cluster state. Our setup consisted of three PostgreSQL nodes: one primary and two synchronous standbys.

The etcd cluster ran on three separate nodes, forming its own fault-tolerant consensus group. Patroni watched the health of the PostgreSQL primary and, if it detected a failure, would automatically promote one of the standbys to primary. The promotion process typically completed in under 10 seconds, and because we used synchronous replication, zero data loss was guaranteed during failover.

Setting up Patroni was not straightforward. The documentation is decent, but the edge cases are numerous. I spent considerable time testing failure scenarios: killing the primary node, simulating network partitions between PostgreSQL and etcd, and testing what happened when etcd itself lost quorum. Each scenario revealed configuration tweaks that improved the cluster's resilience.

Connection Pooling: PgBouncer

Between the application servers and the PostgreSQL cluster sat PgBouncer, our connection pooler. PostgreSQL creates a new process for each connection, and with 200 concurrent users plus background workers, we would quickly exhaust the server's resources without pooling.

PgBouncer ran in transaction pooling mode, which meant connections were returned to the pool after each transaction rather than being held for the entire session. This dramatically reduced the number of actual PostgreSQL connections -- we went from needing 300+ connections to operating comfortably with 50.

One subtlety with PgBouncer and Patroni is handling failover. When Patroni promotes a new primary, PgBouncer needs to start sending traffic to the new node. I wrote a callback script that Patroni executed on role changes, which updated PgBouncer's configuration and triggered a reload. This kept the failover seamless from the application's perspective.

Distributed Consensus: etcd

etcd was the backbone of our cluster coordination. Beyond serving Patroni's leader election, we used etcd to store configuration data that needed to be consistent across all nodes: feature flags, rate limiting rules, and service discovery information.

The etcd cluster was sized and tuned carefully. We allocated dedicated SSDs for etcd's write-ahead log, which is critical for performance. The heartbeat interval was set to 500ms with an election timeout of 2500ms, which gave us a good balance between failure detection speed and false positive avoidance.

Monitoring etcd was essential. I set up Prometheus metrics collection and Grafana dashboards that tracked leader elections, proposal failures, and disk fsync latencies. A spike in any of these metrics was an early warning sign of infrastructure issues.

Keepalived and Virtual IPs

Keepalived ran on the HAProxy nodes and on a pair of PgBouncer nodes, managing virtual IP addresses for each service. The VRRP protocol provided sub-second failover for the network layer, which was critical for maintaining the user experience during node failures.

I configured keepalived with custom health check scripts rather than relying on simple port checks. The HAProxy health check script verified that HAProxy was not only running but also had at least one healthy backend server. The PgBouncer health check verified that the pooler could actually reach the PostgreSQL primary.

Putting It All Together

The full request flow looked like this: client traffic hit the keepalived VIP, which pointed to the active HAProxy node. HAProxy terminated SSL, inspected the request, and routed it to the appropriate backend application server. The application server processed the request and, if it needed data, connected to PgBouncer through another keepalived VIP. PgBouncer routed the query to the PostgreSQL primary (or a standby for read-only queries), and the result flowed back through the same path.

Every component in this chain had redundancy. Every failover was automated. And every transition was monitored.

Monitoring and Alerting

Infrastructure without monitoring is just hardware waiting to fail silently. I built a comprehensive monitoring stack using Prometheus for metrics collection, Grafana for visualization, and Alertmanager for notifications.

Key metrics I tracked included HAProxy backend response times, PgBouncer pool utilization, PostgreSQL replication lag, etcd cluster health, and system-level metrics like CPU, memory, and disk I/O. Alerts were configured with severity levels -- a warning for replication lag above 100ms, a critical alert for lag above one second.

Lessons from the Trenches

The biggest lesson was that high availability is not a feature you add -- it is an architecture you design from the start. Retrofitting HA onto a single-server setup is exponentially harder than building it in from day one.

Testing failover scenarios regularly is non-negotiable. We ran monthly "chaos" exercises where we deliberately killed nodes and measured recovery times. Every exercise revealed something new -- a missing callback, a configuration that had drifted, or a monitoring gap.

Documentation was another critical investment. Every runbook, every recovery procedure, and every architecture decision was documented. When an incident happened at 3 AM, I needed to follow a checklist, not debug from scratch.

Finally, simplicity wins. Every additional component in the HA stack is another potential failure point. I resisted the temptation to add more tools and instead focused on making the existing components rock-solid. The stack I described here -- HAProxy, Patroni, PgBouncer, etcd, keepalived -- is well-understood, battle-tested, and maintainable. That matters more than being cutting-edge.