Back to Blog
DevOpsHAProxyPostgreSQLHigh AvailabilityInfrastructure

High-Availability Infrastructure: HAProxy, Patroni, and Redundant Services

Umut Korkmaz2025-02-0110 min read

High-availability infrastructure is less about buying more servers and more about removing single points of failure deliberately. When I built the hosting platform infrastructure, I had to think through failure domains one layer at a time: edge routing, database failover, connection pooling, service discovery, and operational visibility.

The stack centered on HAProxy for traffic routing, Patroni for PostgreSQL failover, PgBouncer for connection pooling, and keepalived for virtual IP management. Each tool solved a specific problem. Together, they created a system that could survive node loss without turning every incident into a full outage.

The Architecture in Layers

The design separated concerns across clear layers:

  1. HAProxy handled inbound HTTP and TCP routing.
  2. keepalived exposed a floating virtual IP for failover at the edge.
  3. Patroni managed PostgreSQL primary election.
  4. PgBouncer stabilized database connection pressure.
  5. etcd stored cluster state for Patroni.

That separation matters because it prevents a single tool from becoming responsible for too much behavior.

HAProxy at the Edge

HAProxy was the first place where resilience became visible. The job was simple: route traffic only to healthy backends and fail fast when a service dropped out.

haproxy
global
  log /dev/log local0
  maxconn 5000

defaults
  mode http
  timeout connect 5s
  timeout client 30s
  timeout server 30s

frontend https_front
  bind *:443 ssl crt /etc/ssl/private/platform.pem
  default_backend app_servers

backend app_servers
  option httpchk GET /health
  http-check expect status 200
  balance roundrobin
  server app1 10.0.0.11:3000 check
  server app2 10.0.0.12:3000 check

The health check endpoint had to be cheap and honest. If the application could not reach its dependencies, the health endpoint should say so. A misleading health check is worse than no health check at all.

PostgreSQL Failover With Patroni

Database failover is where many “high-availability” designs reveal their weak points. It is not enough to have a replica. You need leader election, replication health monitoring, and a clean path for clients to discover the current primary.

Patroni handled that orchestration cleanly.

yaml
scope: hosting-platform
namespace: /service/
name: db-node-1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.0.21:8008

etcd:
  hosts: 10.0.0.31:2379,10.0.0.32:2379,10.0.0.33:2379

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.0.21:5432
  data_dir: /var/lib/postgresql/data
  authentication:
    replication:
      username: replicator
      password: strong-password

The important design choice was not just enabling Patroni. It was ensuring every client path used a stable service endpoint rather than hard-coding the original primary.

PgBouncer Protected the Database Layer

Application servers can overwhelm PostgreSQL quickly if every request opens a fresh connection. PgBouncer acted as the buffer between bursty app traffic and the database.

ini
[databases]
platform = host=postgres-primary port=5432 dbname=platform

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 100

transaction pooling gave me the best balance between throughput and compatibility. Session-level behavior is easier to reason about, but transaction pooling keeps resource usage under much tighter control in multi-tenant systems.

Edge Failover With keepalived

The virtual IP layer removed another single point of failure. If one load balancer died, the floating IP moved to the standby node.

conf
vrrp_instance VI_1 {
  state BACKUP
  interface eth0
  virtual_router_id 51
  priority 101
  advert_int 1
  authentication {
    auth_type PASS
    auth_pass supersecret
  }
  virtual_ipaddress {
    10.0.0.10/24
  }
}

This is one of those pieces that feels invisible when it works correctly. That is exactly what you want.

Health Checks and Failure Semantics

High availability depends on deciding what “healthy” means. If you only check whether a process is running, you miss partial failures. If you check too much, you can create false positives and flap healthy nodes out of rotation.

I treated health checks in three tiers:

  1. process-level: is the service responding at all?
  2. dependency-level: can it reach its database, cache, or queue?
  3. role-level: is it allowed to serve this kind of traffic?

The role-level check mattered most for the database cluster because only the active primary should accept write traffic.

Operational Lessons

The biggest lesson was that high-availability architecture is operational architecture. The configuration snippets are only the visible part. The real work is deciding failover thresholds, rollback behavior, and what the system should do under partial degradation.

I also learned that every HA design needs simple runbooks. During an incident, people do not need a diagram. They need a short set of commands and expectations.

bash
patronictl list
curl -sf http://127.0.0.1:8008/health
haproxy -c -f /etc/haproxy/haproxy.cfg
systemctl status pgbouncer

If the on-call engineer can quickly answer “who is primary, who is serving, and where is traffic going,” recovery gets much faster.