Automatic, incremental, filesystem-level replication is one of the core building blocks of the cluster. This is the process whereby a read-only copy of a hosted, read-write filesystem is maintained on each of several other nodes (the “slaves”) in the cluster. These copies are kept up-to-date by a change-driven system which monitors writes to the read-write version (the “master”) and distributes them to be applied to the slave copies. Every node in the cluster will typically act as a master for some filesystems and as a slave for others.
The cluster can be configured (see Launchpad) to specify how many read-only copies of a filesystem should be maintained at all times. Any time a new filesystem is added to the cluster, slaves are selected to hold read-only copies of this filesystem. Other events may also cause the cluster to pick new slaves for a filesystem – for example, a configuration change to raise the minimum number of copies, or the failure of a node which was acting as a slave for that particular filesystem.
When a filesystem requires an additional slave, the master for the filesystem will nominate another node (one which is not already a slave for the particular filesystem in question) and being replicating the filesystem to that node. That node is now a slave for the filesystem in question. The master will try to nominate a node with the most free space available out of all the nodes in the cluster. If there is no node which has enough space to receive a copy of the filesystem then the cluster will be unable to satisfy the replication requirements (aka the “redundancy invariant”). At this point, more storage must be provisioned for the cluster, either by extending the existing ZFS pools or by just adding more servers.
A node will attempt to keep some portion of its storage available, i.e. free. Completely filling storage has a strong negative impact on overall filesystem performance. See the maximum_hpool_allocation configuration item.