High Availability

Phoenix High Availability (HA) lets a JDBC client transparently target a pair of HBase clusters that mirror the same Phoenix schema, so an operator-driven or fault-driven failover never requires the application to restart, reconnect, or rewrite URLs.

Phoenix 5.3.1 adds graceful failover — an intermediate ACTIVE_TO_STANDBY role that lets writes drain server-side before the peer is promoted — plus support for HBase's MASTER and RPC connection registries (PHOENIX-7493, PHOENIX-7495, PHOENIX-7586).

Concepts

An HA group is a named tuple of two HBase clusters and an HA policy, shared by every client that participates. The current role of each cluster lives in a JSON record in ZooKeeper, replicated to both clusters' ZK ensembles and watched by the client; role changes are picked up automatically.

Cluster roles

Role	Clients can connect?	Meaning
`ACTIVE`	yes	Cluster is serving live reads and writes.
`STANDBY`	yes	Cluster is reachable but not the current primary. `FAILOVER` clients refuse to bind; `PARALLEL` clients still bind.
`ACTIVE_TO_STANDBY`	yes	Transitional state during graceful failover. `FAILOVER` connections are closed; `PARALLEL` clients still bind. Writes may be rejected (see Graceful failover below).
`OFFLINE`	no	Cluster is intentionally taken out of rotation.
`UNKNOWN`	no	Role has not been initialized or the record could not be read.

HA policies

The policy is part of the HA-group record. Clients do not pick it — operators do, when provisioning the group.

FAILOVER — exactly one cluster (the ACTIVE) serves the connection at any moment. The client transparently re-binds on role change.
PARALLEL — every statement is issued to both clusters in parallel, with the faster result returned. Useful when both clusters carry identical data and you want to mask single-cluster tail latency.

Failover sub-policies (FAILOVER only)

Controls how a FAILOVER connection reacts when its bound cluster transitions away from ACTIVE:

`phoenix.ha.failover.policy`	Behavior
`explicit` (default)	Subsequent operations throw `FailoverSQLException`; the application calls `failover()` to rebind to the new `ACTIVE`.
`active`	Connection transparently rebinds to the new `ACTIVE` on the next statement, up to `phoenix.ha.failover.count` attempts (default `3`).

JDBC URL

The HA URL is a bracketed pair of per-cluster endpoints separated by |, optionally followed by a principal:

jdbc:phoenix+zk:[zk1\:2181::/hbase|zk2\:2181::/hbase]:my_principal

The presence of | inside the URL is what triggers the HA code path. The two URLs inside the brackets are always ZooKeeper quorums — Phoenix uses them to read the HA-group record.

The per-cluster connection Phoenix opens underneath may use any of the supported HBase registries (ZK, MASTER, or RPC), based on what the operator configured in the HA-group record. RPC requires HBase 2.5+.

Connecting

Set the HA group name as a JDBC property and open a connection like any other:

Properties props = new Properties();
props.setProperty("phoenix.ha.group.name", "myGroup");
try (Connection conn = DriverManager.getConnection(
        "jdbc:phoenix+zk:[zk1\\:2181::/hbase|zk2\\:2181::/hbase]", props)) {
    // Normal JDBC usage.
}

The returned Connection honors the policy declared in the HA-group record.

Graceful failover

Graceful failover is a two-step demotion of the ACTIVE cluster:

ACTIVE → ACTIVE_TO_STANDBY. The operator flips the source cluster into ACTIVE_TO_STANDBY. FAILOVER clients' wrapped connections to the demoting cluster are closed (subsequent statements raise FailoverSQLException, or rebind on the next statement under the active sub-policy). PARALLEL clients continue to operate against both clusters. On the server side, with phoenix.cluster.role.based.mutation.block.enabled=true, new mutations on the demoting cluster are rejected with MutationBlockedIOException so replication to the peer can drain. No cluster is ACTIVE during this step, so new FAILOVER connections cannot be opened until step 2.
ACTIVE_TO_STANDBY → STANDBY (peer promoted to ACTIVE). Once replication has caught up, the operator demotes the source the rest of the way and promotes the peer. New FAILOVER connections (and active sub-policy retries pending from step 1) now bind to the new ACTIVE; explicit clients call failover() themselves.

Rolling back is supported: an ACTIVE → ACTIVE_TO_STANDBY → ACTIVE sequence restores the source to ACTIVE without further role transitions. PARALLEL clients remain operational throughout; FAILOVER clients reopen connections to the restored ACTIVE once it returns.

Configuration

All HA-related keys. All can be set in hbase-site.xml on the client and/or as JDBC connection properties.

Required

Key	Notes
`phoenix.ha.group.name`	Name of the HA group — must match the operator-provisioned record.

ZooKeeper tuning (client side)

Key	Default
`phoenix.ha.zk.connection.timeout.ms`	`4000`
`phoenix.ha.zk.session.timeout.ms`	`4000`
`phoenix.ha.zk.retry.base.sleep.ms`	`1000`
`phoenix.ha.zk.retry.max`	`5`
`phoenix.ha.zk.retry.max.sleep.ms`	`10000`

Fallback to a single cluster

Key	Default
`phoenix.ha.fallback.enabled`	`true` — if the HA record cannot be read from either ZK, fall back to a single-cluster connection.
`phoenix.ha.fallback.cluster`	(empty) — JDBC URL of the fallback cluster.

Failover behavior

Key	Default
`phoenix.ha.transition.timeout.ms`	`300000` (5 min) — time the client gets to close connections during a role transition.
`phoenix.ha.failover.policy`	`explicit` — or `active` to auto-rebind.
`phoenix.ha.failover.count`	`3` — max auto-rebind attempts for the `active` sub-policy.
`phoenix.ha.failover.timeout.ms`	`10000` — wait timeout for a single failover operation.

Server-side write blocking

Key	Default
`phoenix.cluster.role.based.mutation.block.enabled`	`false` — set `true` in the source cluster's `hbase-site.xml` to reject writes while the cluster is in `ACTIVE_TO_STANDBY`. Readers are unaffected.

Operator workflow

A canonical graceful failover from cluster A → cluster B:

Confirm phoenix.cluster.role.based.mutation.block.enabled=true is set on cluster A and that replication A → B is healthy.
Set A → ACTIVE_TO_STANDBY, B remains STANDBY. Writes to A start being rejected. FAILOVER connections to A are closed; PARALLEL clients continue to operate against both clusters.
Wait for A → B replication lag to reach zero.
Promote: A → STANDBY, B → ACTIVE. active clients transparently rebind; explicit clients receive FailoverSQLException and call failover() themselves.

For an unplanned failover, skip step 2 and go directly to step 4. The ACTIVE_TO_STANDBY role is a graceful-failover convenience, not a correctness requirement.