cryptic-fs

2022-01-25 17:30:24 -07:00 · 2022-01-25 17:30:24 -07:00 · 6b56556dba
commit 6b56556dba
parent ae3cc05ce7
1 changed files with 258 additions and 0 deletions
--- a/static/src/_posts/2022-01-23-the-cryptic-filesystem.md
+++ b/static/src/_posts/2022-01-23-the-cryptic-filesystem.md
@ -0,0 +1,258 @@
 ---
 title: >-
    The Cryptic Filesystem
 description: >-
    Hey, I'm brainstorming here!
 series: nebula
 tags: tech
 ---
 Presently the cryptic-net project has two components: a VPN layer (implemented
 using [nebula][nebula], and DNS component which makes communicating across that
 VPN a bit nicer. All of this is wrapped up in a nice bow using an AppImage and a
 simple process manager. The foundation is laid for adding the next major
 component: a filesystem layer.
 I've done a lot of research and talking about this layer, and you can see past
 posts in this series talking about it. Unfortunately, I haven't really made much
 progress on a solution. It really feels like there's nothing out there already
 implemented, and we're going to have to do it from scratch.
 To briefly recap the general requirements of the cryptic network filesystem
 (cryptic-fs), it must have:
 * Sharding of the fs dataset, so each node doesn't need to persist the full
  dataset.
 * Replication factor (RF), so each piece of content must be persisted by at
  least N nodes of the clusters.
 * Nodes are expected to be semi-permanent. They are expected to be in it for the
  long-haul, but they also may flit in and out of existence frequently.
 * Each cryptic-fs process should be able to track multiple independent
  filesystems, with each node in the cluster not necessarily tracking the same
  set of filesystems as the others.
 This post is going to be a very high-level design document for what, in my head,
 is the ideal implementation of cryptic-fs. _If_ cryptic-fs is ever actually
 implemented it will very likely differ from this document in major ways, but one
 must start somewhere.
 [nebula]: https://github.com/slackhq/nebula
 ## Merkle DAG
 It wouldn't be a modern network filesystem project if there wasn't a [Merkle
 DAG][mdag]. The minutia of how a Merkle DAG works isn't super important here,
 the important bits are:
 * Each file is represented by a content identifier (CID), which is essentially a
  consistent hash of the file's contents.
 * Each directory is also represented by a CID which is generated by hashing the
  CIDs of the directory's files and their metadata.
 * Since the root of the filesystem is itself a directory, the entire filesystem
  can be represented by a single CID. By tracking the changing root CID all
  hosts participating in the network filesystem can cheaply identify the latest
  state of the entire filesystem.
 A storage system for a Merkle DAG is implemented as a key-value store which maps
 CID to directory node or file contents. When nodes in the cluster communicate
 about data in the filesystem they will do so using these CIDs; one node might
 ask the other "can you give me CID `AAA`", and the other would respond with the
 contents of `AAA` without really caring about whether or not that CID points to
 a file or directory node or whatever. It's quite a simple system.
 As far as actual implementation of the storage component, it's very likely we
 could re-use some part of the IPFS code-base rather than implementing this from
 scratch.
 [mdag]: https://docs.ipfs.io/concepts/merkle-dag/
 ## Consensus
 The cluster of nodes needs to (roughly) agree on some things in order to
 function:
 * What the current root CID of the filesystem is.
 * Which nodes have which CIDs persisted.
 These are all things which can change rapidly, and which _every_ node in the
 cluster will need to stay up-to-date on. On the other hand, given efficient use
 of the boolean tagged CIDs mentioned in the previous section, this is a dataset
 which could easily fit in memory even for large filesystems.
 I've done a bunch of research here and I'm having trouble finding anything
 existing which fits the bill. Most databases expect the set of nodes to be
 pretty constant, so that eliminates most of them. Here's a couple of other ideas
 I spitballed:
 * Taking advantage of the already written [go-ds-crdt][crdt] package which the
  [IPFS Cluster][ipfscluster] project uses. My biggest concern with this
  project, however, is that the entire history of the CRDT must be stored on
  each node, which in our use-case could be a very long history.
 * Just saying fuck it and using a giant redis replica-set, where each node in
  the cluster is a replica and one node is chosen to be the primary. [Redis
  sentinel][sentinel] could be used to decide the current primary. The issue is
  that I don't think sentinel is designed to handle hundreds or thousands of
  nodes, which places a ceiling on cluster capacity. I'm also not confident that
  the primary node could handle hundreds/thousands of replicas syncing from it
  nicely; that's not something Redis likes to do.
 * Using a blockchain engine like [Tendermint][tendermint] to implement a custom,
  private blockchain for the cluster. This could work performance-wise, but I
  think it would suffer from the same issue as CRDT.
 It seems to me like some kind of WAN-optimized gossip protocol would be the
 solution here. Each node already knows which CIDs it itself has persisted, so
 what's left is for all nodes to agree on the latest root CID, and to coordinate
 who is going to store what long-term.
 [crdt]: https://github.com/ipfs/go-ds-crdt
 [ipfscluster]: https://cluster.ipfs.io/
 [sentinel]: https://redis.io/topics/sentinel
 [tendermint]: https://tendermint.com/
 ### Gossip
 The [gossipsub][gossipsub] library which is built into libp2p seems like a good
 starting place. It's optimized for WANs and, crucially, is already implemented.
 Gossipsub makes use of different topics, onto which peers in the cluster can
 publish messages which other peers who are subscribed to those topics will
 receive. It makes sense to have a topic-per-filesystem (remember, from the
 original requirements, that there can be multiple filesystems being tracked), so
 that each node in the cluster can choose for itself which filesystems it cares
 to track.
 The messages which can get published will be dependent on the different
 situations in which nodes will want to communicate, so it's worth enumerating
 those.
 **Situation #1: Node A wants to obtain a CID**: Node A will send out a
 `WHO_HAS:<CID>` message (not the actual syntax) to the topic. Node B (and
 possibly others), which has the CID persisted, will respond with `I_HAVE:<CID>`.
 The response will be sent directly from B to A, not broadcast over the topic,
 since only A cares. The timing of B's response to A could be subject to a delay
 based on B's current load, such that another less loaded node might get its
 response in first.
 From here node A would initiate a download of the CID from B via a direct
 connection. If node A has enough space then it will persist the contents of the
 CID for the future.
 This situation could arise because the user has opened a file in the filesystem
 for reading, or has attempted to enumerate the contents of a directory, and the
 local storage doesn't already contain that CID.
 **Situation #2: Node A wants to delete a CID which it has persisted**: Similar
 to #1, Node A needs to first ensure that other nodes have the CID persisted, in
 order to maintain the RF across the filesystem. So node A first sends out a
 `WHO_HAS:<CID>` message. If >=RF nodes respond with `I_HAVE:<CID>` then node A
 can delete the CID from its storage without concern. Otherwise it should not
 delete the CID.
 **Situation #2a: Node A wants to delete a CID which it has persisted, and which
 is not part of the current filesystem**: If the filesystem is in a state where
 the CID in question is no longer present in the system, then node A doesn't need
 to care about the RF and therefore doesn't need to send any messages.
 **Situation #3: Node A wants to update the filesystem root CID**: This is as
 simple as sending out a `ROOT:<CID>` message on the topic. Other nodes will
 receive this and note the new root.
 **Situation #4: Node A wants to know the current filesystem root CID**: Node A
 sends out a `ROOT?` message. Other nodes will respond to node A directly telling
 it the current root CID.
 These describe the circumstances around the messages used across the gossip
 protocol in a very shallow way. In order to properly flesh out the behavior of
 the consistency mechanism we need to dive in a bit more.
 ### Optimizations, Replication, and GC
 A key optimization worth hitting straight away is to declare that each node will
 always immediately persist all directory CIDs whenever a `ROOT:<CID>` message is
 received. This will _generally_ only involve a couple of round-trips with the
 host which issued the `ROOT:<CID>` message, with opportunity for
 parallelization.
 This could be a problem if the directory structure becomes _huge_, at which
 point it might be worth placing some kind of limit on what percent of storage is
 allowed for directory nodes. But really... just have less directories people!
 The next thing to dive in on is replication. We've already covered in situation
 #1 what happens if a user specifically requests a file. But that's not enough
 to ensure the RF of the entire filesystem, as some files might not be requested
 by any users except the original user to add the file.
 We can note that each node knows when a file has been added to the filesystem,
 thanks to each node knowing the full directory tree. So upon seeing that a new
 file has been added, a node can issue a `WHO_HAS:<CID>` message for it, and if
 less than RF nodes respond then it can persist the CID. This is all assuming
 that the node has enough space for the new file.
 One wrinkle in that plan is that we don't want all nodes to send the
 `WHO_HAS:<CID>` at the same time for the same CID, otherwise they'll all end up
 downloading the CID and over-replicating it. A solution here is for each node to
 delay it's `WHO_HAS:<CID>` based on how much space it has left for storage, so
 nodes with more free space are more eager to pull in new files.
 Additionally, we want to have nodes periodically check the replication status of
 each CID in the filesystem. This is because nodes might pop in and out of
 existence randomly, and the cluster needs to account for that. The way this can
 work is that each node periodically picks a CID at random and checks the
 replication status of it. If the period between checks is calculated as being
 based on number of online nodes in the cluster and the number of CIDs which can
 be checked, then it can be assured that all CIDs will be checked within a
 reasonable amount of time with minimal overhead.
 This dovetails nicely with garbage collection. Given that nodes can flit in and
 out of existence, a node might come back from having been down for a time, and
 all CIDs it had persisted would then be over-replicated. So the same process
 which is checking for under-replicated files will also be checking for
 over-replicated files.
 ### Limitations
 This consistency mechanism has a lot of nice properties: it's eventually
 consistent, it nicely handles nodes coming in and out of existence without any
 coordination between the nodes, and it _should_ be pretty fast for most cases.
 However, it has its downsides.
 There's definitely room for inconsistency between each node's view of the
 filesystem, especially when it comes to the `ROOT:<CID>` messages. If two nodes
 issue `ROOT:<CID>` messages at the same time then it's extremely likely nodes
 will have a split view of the filesystem, and there's not a great way to
 resolve this until another change is made on another node. This is probably the
 weakest point of the whole design.
 [gossipsub]: https://github.com/libp2p/specs/tree/master/pubsub/gossipsub
 ## FUSE
 The final piece is the FUSE connector for the filesystem, which is how users
 actually interact with each filesystem being tracked by their node. This is
 actually the easiest component, if we use an idea borrowed from
 [Tahoe-LAFS][tahoe], cryptic-fs can expose an SFTP endpoint and that's it.
 The idea is that hooking up an existing SFTP implementation to the rest of
 cryptic-fs should be pretty straightforward, and then every OS should already
 have some kind of mount-SFTP-as-FUSE mechanism already, either built into it or
 as an existing application. Exposing an SFTP endpoint also allows a user to
 access the cryptic-fs remotely if they want to.
 [tahoe]: https://tahoe-lafs.org/trac/tahoe-lafs
 ## Ok
 So all that said, clearly the hard part is the consistency mechanism. It's not
 even fully developed in this document, but it's almost there. The next step,
 beyond polishing up the consistency mechanism, is going to be roughly figuring
 out all the interfaces and types involved in the implementation, planning out
 how those will all interact with each other, and then finally an actual
 implementation!