Clustering

Grey's clustering feature enables distributed health probing across multiple nodes, providing scalability, redundancy, and the ability to probe from different network locations while maintaining centralized monitoring through the web UI. Due to the way we replicate probe information, it is possible to run probes on headless Grey nodes and consume their results via the web UI on a single central instance (which may be configured with some, or none, of the same probes).

Overview

Grey clustering uses a gossip protocol for peer discovery and coordination, with all communication encrypted using AES-256-GCM. When clustering is enabled, probe results are automatically synchronized across the cluster, ensuring that all nodes eventually converge on a common view of your platform health.

Internally, we keep track of aggregate probe health using [CRDTs], which enables us to provide highly available health information while delivering eventual consistency.

[CRDTs]: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)

Quick Start

1. Generate a Shared Cluster Secret

All cluster members must share the same encryption key. You can generate the key using one of the following methods, or simply run Grey with clustering enabled and an invalid key to have one generated for you.

# Using OpenSSL
openssl rand -base64 32

# Using Python
python3 -c "import secrets, base64; print(base64.b64encode(secrets.token_bytes(32)).decode())"

# Using Node.js
require('crypto').randomBytes(32).toString('base64');

# Example output: /pL7XKDj1UrAGjNMv3t9jmb9leDOZT+64KkYE8k7UH8=

2. Configure Grey

Cluster configuration is the same for all Grey instances, however by convention we tend to only enable the ui on primary nodes while workers tend to have this disabled. For reliable operation, we recommend providing at least two initial peers - these will be used to join the cluster, at which point further peers will be automatically discovered.

state: ./state.redb

ui:
  enabled: true
  listen: 0.0.0.0:3000
  title: "Grey Cluster - Primary"

cluster:
  enabled: true
  listen: 0.0.0.0:8888
  peers:
    - 10.0.0.2:8888
    - 10.0.0.3:8888
  secrets:
    - /pL7XKDj1UrAGjNMv3t9jmb9leDOZT+64KkYE8k7UH8=

Configuration Reference

Basic Options

`enabled`

Enable or disable clustering for this node.

cluster:
  enabled: true

listen

The local address and port to bind for cluster communication over UDP.

cluster:
  listen: 0.0.0.0:8888  # Default

peers

Initial peer addresses for cluster discovery. These should be IP addresses which are accessible from the current node (either over a private network, a VPN, or over the public internet).

cluster:
  peers:
    - 10.0.0.2:8888
    - 10.0.0.3:8888

secrets

Base64-encoded 32-byte encryption keys used by the cluster to encrypt gossip messages.

cluster:
  secrets:
    - /pL7XKDj1UrAGjNMv3t9jmb9leDOZT+64KkYE8k7UH8=

Key Rotation

Multiple keys may be provided, in which case the second key in the list will be used for encryption, while all keys will be accepted for decryption. This allows for key rotation in your cluster without downtime.

Warning

NIST SP 800-38D recommends rotating encryption keys prior to 2^32 uses of a given key. Depending on the size of your cluster and your configured gossip_interval and gossip_factor, this may require you to rotate keys relatively frequently. We recommend you implement a key rotation policy using an automated secret management tool such as HashiCorp Vault or AWS KMS to facilitate this.

You can calculate the approximate amount of time before you need to rotate keys using the following code:

cluster_size = 100  # Number of nodes in your cluster
gossip_interval = 5  # gossip_interval in seconds
gossip_factor = 10  # gossip_factor

messages_per_second = 3 * (gossip_factor / gossip_interval) * cluster_size
seconds_until_rotation = (2**32) / messages_per_second
days_until_rotation = seconds_until_rotation / 86400
print(f"Approximate days until key rotation is required: {days_until_rotation:.2f}")

In its default configuration, a 5-node cluster will need to rotate keys approximately every 135 years (so you've got plenty of time) but in a cluster with 100 nodes, a 5-second gossip interval, and a gossip factor of 10, you'll need to rotate keys approximately every 80 days.

To rotate keys without downtime, you must provide three keys in total:

cluster:
    secrets:
        - ${key3}   # New key, used for decryption only
        - ${key2}   # Current key, used for both encryption and decryption
        - ${key1}   # Old key, used for decryption only

When you are ready to rotate to a new encryption key, add the new key to the top of the list and remove the oldest key from the bottom of the list.

cluster:
    secrets:
        - ${key4}   # New key, used for decryption only
        - ${key3}   # Current key, used for both encryption and decryption
        - ${key2}   # Old key, used for decryption only

This approach ensures that every cluster node maintains a copy of the previous and next key at all times, allowing them to communicate with both older and newer configuration versions without interruption.

Tips

If you need to complete a full key rotation (i.e. if one of your nodes has been compromised), you must follow the above process three times; ensuring that the cluster stabilizes between each update.

Advanced Tuning

gossip_interval

How frequently nodes exchange gossip messages. Lower values improve the time taken for the cluster to reach consensus on the state of a given probe, but increase network usage.

cluster:
  gossip_interval: 30s  # Default

gossip_factor

Number of random peers to gossip with per interval. Higher values how quickly the cluster reaches consensus, but also increase network usage.

cluster:
  gossip_factor: 2  # Default

For clusters with N nodes, optimal gossip_factor is typically log₂(N) + 1:

2-4 nodes: gossip_factor = 2
5-8 nodes: gossip_factor = 3
9-16 nodes: gossip_factor = 4

gc_interval

How frequently to run the garbage collector to remove stale peers and expired probes.

cluster:
  gc_interval: 300s  # Default (5 minutes)

gc_probe_expiry

How long to retain information about a probe before it is considered stale and removed in the next garbage collection cycle.

Every time you start Grey, it starts reporting probe state under a new node identifier, so it is normal and expected that application restarts will result in probe state for old instances not being updated and eventually needing to be removed. Once removed, the aggregated probe metrics will be adjusted to account for the loss of this data.

cluster:
  gc_probe_expiry: 7d  # Default (7 days)

gc_peer_expiry

How long to attempt to contact a known peer after it was last seen before considering it to be offline and removing it from the local member list.

We recommend you keep this value relatively low to avoid sending unnecessary broadcasts to inactive peers, however having it set too short can result negatively impact cluster stability under load.

cluster:
  gc_peer_expiry: 30m  # Default (30 minutes)