isle/tasks/bugs/garage-layout-management.md

85 lines
3.7 KiB
Markdown
Raw Normal View History

---
type: task
---
## Problem
There are high-level but extremely problematic issues with how garage layout
management is being done.
In general the strategy around layout management is that each host only modifies
the cluster layout related to itself, and never touches applied roles of other
hosts. This works great for all except one case: a host removing one or more of
its allocations.
There are two separate issues which must be dealt with, each related partially
to the other.
## Draining of garage data
When a garage node is removed from the cluster it first goes into the "draining"
state, so that other nodes in the cluster can ensure that the replication factor
for each piece of data is met prior to the node being decommissioned.
While the node is in draining state it cannot be used for S3 API calls, as the
bucket credentials are no longer present on it.
## Configuration change on restart
For hosts whose configuration is managed by `daemon.yml` it is not necessarily
known that a garage node used to exist at all upon restart. The host can't
investigate the cluster layout because it won't have a garage instance running,
and even if it could it wouldn't be able to bring up a garage node to properly
drain the old allocations.
# Invalid Solutions
One solution which is tempting but ultimately NOT viable is to make all hosts
run at least one garage instance, and if they have no storage allocations to
make that instance be a "gateway" instance. This is won't work though, because
it would require all hosts to open up the RPC port on their firewall, and
firewall management requires extra user involvement.
Another previous solution was to use an "orphan remover" process on each host,
where the host would compare the garage cluster layout to the expected layout
based on the bootstrap data in the common bucket, and remove any hosts from the
layout which shouldn't be there and don't have a garage instance to remove
themselves with. This had a bunch of unresolveable race conditions, and it
didn't account for draining besides.
# Possible Solution
The solution seems to be that the host must maintain two views of its garage
allocations: the last known allocation state, and the desired allocation state.
The last known state needs to contain both what state the allocation was in
(healthy or draining), along with its directories and capacity. This should get
updated anytime the host performs an action which changes it (modifying the
cluster layout to add a new instance or move an existing one to draining, or
actually removing an instance which is done draining).
The desired state is essentially the network configuration as it is now. This
will be used along with the last known state to take actions.
There are a few details to note with this solution:
- There will need to be a worker which periodically checks the last known state
for any nodes which were draining, and if they are done draining then remove
them.
- When the host starts up it should _always_ use the last known state, and only
once started up should it go to apply the desired configuration.
- When choosing an admin endpoint to use the last known state should be used,
even though it might result in unexpected behavior from the user's perspective
(since the user only knows about the desired state). This applies for RPC
endpoints as well.
- The last/desired states need to be checked for conflicts, and an error emitted
in the event that there is one (either returned from SetConfig or Load). This
includes a new allocation using the same directory as an old one (based on RPC
port), or two allocations using the same RPC port.
- The nebula firewall must base its opened ports on the last known state rather
than desired state.