pull/16/head
Brian Picciano 2 years ago
parent ae3cc05ce7
commit 6b56556dba
  1. 258
      static/src/_posts/2022-01-23-the-cryptic-filesystem.md

@ -0,0 +1,258 @@
---
title: >-
The Cryptic Filesystem
description: >-
Hey, I'm brainstorming here!
series: nebula
tags: tech
---
Presently the cryptic-net project has two components: a VPN layer (implemented
using [nebula][nebula], and DNS component which makes communicating across that
VPN a bit nicer. All of this is wrapped up in a nice bow using an AppImage and a
simple process manager. The foundation is laid for adding the next major
component: a filesystem layer.
I've done a lot of research and talking about this layer, and you can see past
posts in this series talking about it. Unfortunately, I haven't really made much
progress on a solution. It really feels like there's nothing out there already
implemented, and we're going to have to do it from scratch.
To briefly recap the general requirements of the cryptic network filesystem
(cryptic-fs), it must have:
* Sharding of the fs dataset, so each node doesn't need to persist the full
dataset.
* Replication factor (RF), so each piece of content must be persisted by at
least N nodes of the clusters.
* Nodes are expected to be semi-permanent. They are expected to be in it for the
long-haul, but they also may flit in and out of existence frequently.
* Each cryptic-fs process should be able to track multiple independent
filesystems, with each node in the cluster not necessarily tracking the same
set of filesystems as the others.
This post is going to be a very high-level design document for what, in my head,
is the ideal implementation of cryptic-fs. _If_ cryptic-fs is ever actually
implemented it will very likely differ from this document in major ways, but one
must start somewhere.
[nebula]: https://github.com/slackhq/nebula
## Merkle DAG
It wouldn't be a modern network filesystem project if there wasn't a [Merkle
DAG][mdag]. The minutia of how a Merkle DAG works isn't super important here,
the important bits are:
* Each file is represented by a content identifier (CID), which is essentially a
consistent hash of the file's contents.
* Each directory is also represented by a CID which is generated by hashing the
CIDs of the directory's files and their metadata.
* Since the root of the filesystem is itself a directory, the entire filesystem
can be represented by a single CID. By tracking the changing root CID all
hosts participating in the network filesystem can cheaply identify the latest
state of the entire filesystem.
A storage system for a Merkle DAG is implemented as a key-value store which maps
CID to directory node or file contents. When nodes in the cluster communicate
about data in the filesystem they will do so using these CIDs; one node might
ask the other "can you give me CID `AAA`", and the other would respond with the
contents of `AAA` without really caring about whether or not that CID points to
a file or directory node or whatever. It's quite a simple system.
As far as actual implementation of the storage component, it's very likely we
could re-use some part of the IPFS code-base rather than implementing this from
scratch.
[mdag]: https://docs.ipfs.io/concepts/merkle-dag/
## Consensus
The cluster of nodes needs to (roughly) agree on some things in order to
function:
* What the current root CID of the filesystem is.
* Which nodes have which CIDs persisted.
These are all things which can change rapidly, and which _every_ node in the
cluster will need to stay up-to-date on. On the other hand, given efficient use
of the boolean tagged CIDs mentioned in the previous section, this is a dataset
which could easily fit in memory even for large filesystems.
I've done a bunch of research here and I'm having trouble finding anything
existing which fits the bill. Most databases expect the set of nodes to be
pretty constant, so that eliminates most of them. Here's a couple of other ideas
I spitballed:
* Taking advantage of the already written [go-ds-crdt][crdt] package which the
[IPFS Cluster][ipfscluster] project uses. My biggest concern with this
project, however, is that the entire history of the CRDT must be stored on
each node, which in our use-case could be a very long history.
* Just saying fuck it and using a giant redis replica-set, where each node in
the cluster is a replica and one node is chosen to be the primary. [Redis
sentinel][sentinel] could be used to decide the current primary. The issue is
that I don't think sentinel is designed to handle hundreds or thousands of
nodes, which places a ceiling on cluster capacity. I'm also not confident that
the primary node could handle hundreds/thousands of replicas syncing from it
nicely; that's not something Redis likes to do.
* Using a blockchain engine like [Tendermint][tendermint] to implement a custom,
private blockchain for the cluster. This could work performance-wise, but I
think it would suffer from the same issue as CRDT.
It seems to me like some kind of WAN-optimized gossip protocol would be the
solution here. Each node already knows which CIDs it itself has persisted, so
what's left is for all nodes to agree on the latest root CID, and to coordinate
who is going to store what long-term.
[crdt]: https://github.com/ipfs/go-ds-crdt
[ipfscluster]: https://cluster.ipfs.io/
[sentinel]: https://redis.io/topics/sentinel
[tendermint]: https://tendermint.com/
### Gossip
The [gossipsub][gossipsub] library which is built into libp2p seems like a good
starting place. It's optimized for WANs and, crucially, is already implemented.
Gossipsub makes use of different topics, onto which peers in the cluster can
publish messages which other peers who are subscribed to those topics will
receive. It makes sense to have a topic-per-filesystem (remember, from the
original requirements, that there can be multiple filesystems being tracked), so
that each node in the cluster can choose for itself which filesystems it cares
to track.
The messages which can get published will be dependent on the different
situations in which nodes will want to communicate, so it's worth enumerating
those.
**Situation #1: Node A wants to obtain a CID**: Node A will send out a
`WHO_HAS:<CID>` message (not the actual syntax) to the topic. Node B (and
possibly others), which has the CID persisted, will respond with `I_HAVE:<CID>`.
The response will be sent directly from B to A, not broadcast over the topic,
since only A cares. The timing of B's response to A could be subject to a delay
based on B's current load, such that another less loaded node might get its
response in first.
From here node A would initiate a download of the CID from B via a direct
connection. If node A has enough space then it will persist the contents of the
CID for the future.
This situation could arise because the user has opened a file in the filesystem
for reading, or has attempted to enumerate the contents of a directory, and the
local storage doesn't already contain that CID.
**Situation #2: Node A wants to delete a CID which it has persisted**: Similar
to #1, Node A needs to first ensure that other nodes have the CID persisted, in
order to maintain the RF across the filesystem. So node A first sends out a
`WHO_HAS:<CID>` message. If >=RF nodes respond with `I_HAVE:<CID>` then node A
can delete the CID from its storage without concern. Otherwise it should not
delete the CID.
**Situation #2a: Node A wants to delete a CID which it has persisted, and which
is not part of the current filesystem**: If the filesystem is in a state where
the CID in question is no longer present in the system, then node A doesn't need
to care about the RF and therefore doesn't need to send any messages.
**Situation #3: Node A wants to update the filesystem root CID**: This is as
simple as sending out a `ROOT:<CID>` message on the topic. Other nodes will
receive this and note the new root.
**Situation #4: Node A wants to know the current filesystem root CID**: Node A
sends out a `ROOT?` message. Other nodes will respond to node A directly telling
it the current root CID.
These describe the circumstances around the messages used across the gossip
protocol in a very shallow way. In order to properly flesh out the behavior of
the consistency mechanism we need to dive in a bit more.
### Optimizations, Replication, and GC
A key optimization worth hitting straight away is to declare that each node will
always immediately persist all directory CIDs whenever a `ROOT:<CID>` message is
received. This will _generally_ only involve a couple of round-trips with the
host which issued the `ROOT:<CID>` message, with opportunity for
parallelization.
This could be a problem if the directory structure becomes _huge_, at which
point it might be worth placing some kind of limit on what percent of storage is
allowed for directory nodes. But really... just have less directories people!
The next thing to dive in on is replication. We've already covered in situation
#1 what happens if a user specifically requests a file. But that's not enough
to ensure the RF of the entire filesystem, as some files might not be requested
by any users except the original user to add the file.
We can note that each node knows when a file has been added to the filesystem,
thanks to each node knowing the full directory tree. So upon seeing that a new
file has been added, a node can issue a `WHO_HAS:<CID>` message for it, and if
less than RF nodes respond then it can persist the CID. This is all assuming
that the node has enough space for the new file.
One wrinkle in that plan is that we don't want all nodes to send the
`WHO_HAS:<CID>` at the same time for the same CID, otherwise they'll all end up
downloading the CID and over-replicating it. A solution here is for each node to
delay it's `WHO_HAS:<CID>` based on how much space it has left for storage, so
nodes with more free space are more eager to pull in new files.
Additionally, we want to have nodes periodically check the replication status of
each CID in the filesystem. This is because nodes might pop in and out of
existence randomly, and the cluster needs to account for that. The way this can
work is that each node periodically picks a CID at random and checks the
replication status of it. If the period between checks is calculated as being
based on number of online nodes in the cluster and the number of CIDs which can
be checked, then it can be assured that all CIDs will be checked within a
reasonable amount of time with minimal overhead.
This dovetails nicely with garbage collection. Given that nodes can flit in and
out of existence, a node might come back from having been down for a time, and
all CIDs it had persisted would then be over-replicated. So the same process
which is checking for under-replicated files will also be checking for
over-replicated files.
### Limitations
This consistency mechanism has a lot of nice properties: it's eventually
consistent, it nicely handles nodes coming in and out of existence without any
coordination between the nodes, and it _should_ be pretty fast for most cases.
However, it has its downsides.
There's definitely room for inconsistency between each node's view of the
filesystem, especially when it comes to the `ROOT:<CID>` messages. If two nodes
issue `ROOT:<CID>` messages at the same time then it's extremely likely nodes
will have a split view of the filesystem, and there's not a great way to
resolve this until another change is made on another node. This is probably the
weakest point of the whole design.
[gossipsub]: https://github.com/libp2p/specs/tree/master/pubsub/gossipsub
## FUSE
The final piece is the FUSE connector for the filesystem, which is how users
actually interact with each filesystem being tracked by their node. This is
actually the easiest component, if we use an idea borrowed from
[Tahoe-LAFS][tahoe], cryptic-fs can expose an SFTP endpoint and that's it.
The idea is that hooking up an existing SFTP implementation to the rest of
cryptic-fs should be pretty straightforward, and then every OS should already
have some kind of mount-SFTP-as-FUSE mechanism already, either built into it or
as an existing application. Exposing an SFTP endpoint also allows a user to
access the cryptic-fs remotely if they want to.
[tahoe]: https://tahoe-lafs.org/trac/tahoe-lafs
## Ok
So all that said, clearly the hard part is the consistency mechanism. It's not
even fully developed in this document, but it's almost there. The next step,
beyond polishing up the consistency mechanism, is going to be roughly figuring
out all the interfaces and types involved in the implementation, planning out
how those will all interact with each other, and then finally an actual
implementation!
Loading…
Cancel
Save