85 lines
3.7 KiB
Markdown
85 lines
3.7 KiB
Markdown
|
---
|
||
|
type: task
|
||
|
---
|
||
|
|
||
|
## Problem
|
||
|
|
||
|
There are high-level but extremely problematic issues with how garage layout
|
||
|
management is being done.
|
||
|
|
||
|
In general the strategy around layout management is that each host only modifies
|
||
|
the cluster layout related to itself, and never touches applied roles of other
|
||
|
hosts. This works great for all except one case: a host removing one or more of
|
||
|
its allocations.
|
||
|
|
||
|
There are two separate issues which must be dealt with, each related partially
|
||
|
to the other.
|
||
|
|
||
|
## Draining of garage data
|
||
|
|
||
|
When a garage node is removed from the cluster it first goes into the "draining"
|
||
|
state, so that other nodes in the cluster can ensure that the replication factor
|
||
|
for each piece of data is met prior to the node being decommissioned.
|
||
|
|
||
|
While the node is in draining state it cannot be used for S3 API calls, as the
|
||
|
bucket credentials are no longer present on it.
|
||
|
|
||
|
## Configuration change on restart
|
||
|
|
||
|
For hosts whose configuration is managed by `daemon.yml` it is not necessarily
|
||
|
known that a garage node used to exist at all upon restart. The host can't
|
||
|
investigate the cluster layout because it won't have a garage instance running,
|
||
|
and even if it could it wouldn't be able to bring up a garage node to properly
|
||
|
drain the old allocations.
|
||
|
|
||
|
# Invalid Solutions
|
||
|
|
||
|
One solution which is tempting but ultimately NOT viable is to make all hosts
|
||
|
run at least one garage instance, and if they have no storage allocations to
|
||
|
make that instance be a "gateway" instance. This is won't work though, because
|
||
|
it would require all hosts to open up the RPC port on their firewall, and
|
||
|
firewall management requires extra user involvement.
|
||
|
|
||
|
Another previous solution was to use an "orphan remover" process on each host,
|
||
|
where the host would compare the garage cluster layout to the expected layout
|
||
|
based on the bootstrap data in the common bucket, and remove any hosts from the
|
||
|
layout which shouldn't be there and don't have a garage instance to remove
|
||
|
themselves with. This had a bunch of unresolveable race conditions, and it
|
||
|
didn't account for draining besides.
|
||
|
|
||
|
# Possible Solution
|
||
|
|
||
|
The solution seems to be that the host must maintain two views of its garage
|
||
|
allocations: the last known allocation state, and the desired allocation state.
|
||
|
|
||
|
The last known state needs to contain both what state the allocation was in
|
||
|
(healthy or draining), along with its directories and capacity. This should get
|
||
|
updated anytime the host performs an action which changes it (modifying the
|
||
|
cluster layout to add a new instance or move an existing one to draining, or
|
||
|
actually removing an instance which is done draining).
|
||
|
|
||
|
The desired state is essentially the network configuration as it is now. This
|
||
|
will be used along with the last known state to take actions.
|
||
|
|
||
|
There are a few details to note with this solution:
|
||
|
|
||
|
- There will need to be a worker which periodically checks the last known state
|
||
|
for any nodes which were draining, and if they are done draining then remove
|
||
|
them.
|
||
|
|
||
|
- When the host starts up it should _always_ use the last known state, and only
|
||
|
once started up should it go to apply the desired configuration.
|
||
|
|
||
|
- When choosing an admin endpoint to use the last known state should be used,
|
||
|
even though it might result in unexpected behavior from the user's perspective
|
||
|
(since the user only knows about the desired state). This applies for RPC
|
||
|
endpoints as well.
|
||
|
|
||
|
- The last/desired states need to be checked for conflicts, and an error emitted
|
||
|
in the event that there is one (either returned from SetConfig or Load). This
|
||
|
includes a new allocation using the same directory as an old one (based on RPC
|
||
|
port), or two allocations using the same RPC port.
|
||
|
|
||
|
- The nebula firewall must base its opened ports on the last known state rather
|
||
|
than desired state.
|