isle/tasks/bugs/garage-layout-management.md

3.7 KiB

type
task

Problem

There are high-level but extremely problematic issues with how garage layout management is being done.

In general the strategy around layout management is that each host only modifies the cluster layout related to itself, and never touches applied roles of other hosts. This works great for all except one case: a host removing one or more of its allocations.

There are two separate issues which must be dealt with, each related partially to the other.

Draining of garage data

When a garage node is removed from the cluster it first goes into the "draining" state, so that other nodes in the cluster can ensure that the replication factor for each piece of data is met prior to the node being decommissioned.

While the node is in draining state it cannot be used for S3 API calls, as the bucket credentials are no longer present on it.

Configuration change on restart

For hosts whose configuration is managed by daemon.yml it is not necessarily known that a garage node used to exist at all upon restart. The host can't investigate the cluster layout because it won't have a garage instance running, and even if it could it wouldn't be able to bring up a garage node to properly drain the old allocations.

Invalid Solutions

One solution which is tempting but ultimately NOT viable is to make all hosts run at least one garage instance, and if they have no storage allocations to make that instance be a "gateway" instance. This is won't work though, because it would require all hosts to open up the RPC port on their firewall, and firewall management requires extra user involvement.

Another previous solution was to use an "orphan remover" process on each host, where the host would compare the garage cluster layout to the expected layout based on the bootstrap data in the common bucket, and remove any hosts from the layout which shouldn't be there and don't have a garage instance to remove themselves with. This had a bunch of unresolveable race conditions, and it didn't account for draining besides.

Possible Solution

The solution seems to be that the host must maintain two views of its garage allocations: the last known allocation state, and the desired allocation state.

The last known state needs to contain both what state the allocation was in (healthy or draining), along with its directories and capacity. This should get updated anytime the host performs an action which changes it (modifying the cluster layout to add a new instance or move an existing one to draining, or actually removing an instance which is done draining).

The desired state is essentially the network configuration as it is now. This will be used along with the last known state to take actions.

There are a few details to note with this solution:

  • There will need to be a worker which periodically checks the last known state for any nodes which were draining, and if they are done draining then remove them.

  • When the host starts up it should always use the last known state, and only once started up should it go to apply the desired configuration.

  • When choosing an admin endpoint to use the last known state should be used, even though it might result in unexpected behavior from the user's perspective (since the user only knows about the desired state). This applies for RPC endpoints as well.

  • The last/desired states need to be checked for conflicts, and an error emitted in the event that there is one (either returned from SetConfig or Load). This includes a new allocation using the same directory as an old one (based on RPC port), or two allocations using the same RPC port.

  • The nebula firewall must base its opened ports on the last known state rather than desired state.