High-Availability Infrastructure: HAProxy, Patroni, and Redundant Services
High-availability infrastructure is less about buying more servers and more about removing single points of failure deliberately. When I built the hosting platform infrastructure, I had to think through failure domains one layer at a time: edge routing, database failover, connection pooling, service discovery, and operational visibility.
The stack centered on HAProxy for traffic routing, Patroni for PostgreSQL failover, PgBouncer for connection pooling, and keepalived for virtual IP management. Each tool solved a specific problem. Together, they created a system that could survive node loss without turning every incident into a full outage.
The Architecture in Layers
The design separated concerns across clear layers:
- HAProxy handled inbound HTTP and TCP routing.
- keepalived exposed a floating virtual IP for failover at the edge.
- Patroni managed PostgreSQL primary election.
- PgBouncer stabilized database connection pressure.
- etcd stored cluster state for Patroni.
That separation matters because it prevents a single tool from becoming responsible for too much behavior.
HAProxy at the Edge
HAProxy was the first place where resilience became visible. The job was simple: route traffic only to healthy backends and fail fast when a service dropped out.
haproxyglobal log /dev/log local0 maxconn 5000 defaults mode http timeout connect 5s timeout client 30s timeout server 30s frontend https_front bind *:443 ssl crt /etc/ssl/private/platform.pem default_backend app_servers backend app_servers option httpchk GET /health http-check expect status 200 balance roundrobin server app1 10.0.0.11:3000 check server app2 10.0.0.12:3000 check
The health check endpoint had to be cheap and honest. If the application could not reach its dependencies, the health endpoint should say so. A misleading health check is worse than no health check at all.
PostgreSQL Failover With Patroni
Database failover is where many “high-availability” designs reveal their weak points. It is not enough to have a replica. You need leader election, replication health monitoring, and a clean path for clients to discover the current primary.
Patroni handled that orchestration cleanly.
yamlscope: hosting-platform
namespace: /service/
name: db-node-1
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.0.21:8008
etcd:
hosts: 10.0.0.31:2379,10.0.0.32:2379,10.0.0.33:2379
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.0.21:5432
data_dir: /var/lib/postgresql/data
authentication:
replication:
username: replicator
password: strong-password
The important design choice was not just enabling Patroni. It was ensuring every client path used a stable service endpoint rather than hard-coding the original primary.
PgBouncer Protected the Database Layer
Application servers can overwhelm PostgreSQL quickly if every request opens a fresh connection. PgBouncer acted as the buffer between bursty app traffic and the database.
ini[databases]
platform = host=postgres-primary port=5432 dbname=platform
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 100
transaction pooling gave me the best balance between throughput and compatibility. Session-level behavior is easier to reason about, but transaction pooling keeps resource usage under much tighter control in multi-tenant systems.
Edge Failover With keepalived
The virtual IP layer removed another single point of failure. If one load balancer died, the floating IP moved to the standby node.
confvrrp_instance VI_1 { state BACKUP interface eth0 virtual_router_id 51 priority 101 advert_int 1 authentication { auth_type PASS auth_pass supersecret } virtual_ipaddress { 10.0.0.10/24 } }
This is one of those pieces that feels invisible when it works correctly. That is exactly what you want.
Health Checks and Failure Semantics
High availability depends on deciding what “healthy” means. If you only check whether a process is running, you miss partial failures. If you check too much, you can create false positives and flap healthy nodes out of rotation.
I treated health checks in three tiers:
- process-level: is the service responding at all?
- dependency-level: can it reach its database, cache, or queue?
- role-level: is it allowed to serve this kind of traffic?
The role-level check mattered most for the database cluster because only the active primary should accept write traffic.
Operational Lessons
The biggest lesson was that high-availability architecture is operational architecture. The configuration snippets are only the visible part. The real work is deciding failover thresholds, rollback behavior, and what the system should do under partial degradation.
I also learned that every HA design needs simple runbooks. During an incident, people do not need a diagram. They need a short set of commands and expectations.
bashpatronictl list curl -sf http://127.0.0.1:8008/health haproxy -c -f /etc/haproxy/haproxy.cfg systemctl status pgbouncer
If the on-call engineer can quickly answer “who is primary, who is serving, and where is traffic going,” recovery gets much faster.