Yggdrasil: Branching Protocols

What if every storage system spoke the same branching language? Yggdrasil is a protocol stack that brings Git-like semantics (snapshots, branches, merges, history) to heterogeneous storage backends.

In Norse mythology, Yggdrasil is the World Tree connecting nine realms. This library connects storage systems under one unified API.

The problem

Modern data systems are fragmented. Your vector index, your database, your filesystem, your container images - each has its own versioning model (or none at all). When you need reproducible pipelines across these systems, you’re left stitching together incompatible abstractions.

Consider an ML training pipeline:

Each system has different semantics for “create a snapshot” or “roll back to yesterday.” Coordinating them requires custom glue code that’s brittle and hard to reason about.

The solution: shared protocols

Yggdrasil defines a layered protocol stack that any storage system can implement. All operations use value semantics - mutating operations return new system values, never modify in place.

Protocol Operations
Snapshotable snapshot-id, parent-ids, as-of, snapshot-meta
Branchable branches, branch!, checkout, delete-branch!
Graphable history, ancestors, common-ancestor, commit-graph
Mergeable merge!, conflicts, diff
Overlayable overlay, advance!, merge-down!, discard!
Watchable watch!, unwatch! - receives typed events on commit, branch, checkout
GarbageCollectable gc-roots, gc-sweep! - coordinated cross-system mark-and-sweep
Addressable (optional) working-path - filesystem path for current branch (Git, ZFS, Btrfs, OverlayFS)
Committable (optional) commit! - explicit commit, separated from snapshot reads

When multiple systems implement these protocols, you can compose them. Fork a database and a vector index together. Merge changes across both atomically. Query historical state consistently.

Twelve adapters

Adapter System Branching model
Git Version control Native branches/commits
ZFS Filesystem Snapshots + clones
Btrfs Filesystem Subvolumes + snapshots
OverlayFS Filesystem Layered directories
Podman Containers Image layers
IPFS P2P storage Content-addressed commits + IPNS branches
Iceberg Table format Snapshots + native branches
Datahike Database Native COW
LakeFS Data lake Git-like branches
Dolt SQL database Git-like branches
Scriptum Full-text search Lucene segment sharing
Proximum Vector search Merkle-verified snapshots

CompositeSystem: branching multiple systems as one

The most significant recent addition is CompositeSystem - a fiber product (pullback) over shared branch space. Given systems A and B, the composite is the pair (A, B) where both are always on the same branch. All protocol operations apply componentwise.

(require '[yggdrasil.composite :as composite]
         '[yggdrasil.protocols :as p])

;; Compose a database and a search index
(def sys (composite/composite [datahike-sys scriptum-sys]
           :name "my-app"
           :branch :main
           :store-path "/var/lib/yggdrasil/composite"))  ; optional persistence

;; All protocol operations work on both systems simultaneously
(def branched (-> sys
                  (p/branch! :experiment)
                  (p/checkout :experiment)))

;; Commit both atomically - gets a deterministic composite snapshot-id
(def committed (p/commit! branched "experimental run"))

;; Merge back
(def merged (-> committed
                (p/checkout :main)
                (p/merge! :experiment)))

snapshot-id on a composite returns a deterministic UUID derived from the combined state of all sub-systems - the same combination always yields the same ID. History, conflicts, and GC roots are all computed across the full set.

Passing :store-path persists the composite history via a PSS B-tree backed by konserve, so history survives process restarts.

Workspace: HLC-coordinated multi-system operations

The workspace layer adds Hybrid Logical Clock (HLC) coordination across independently managed systems. This enables temporal queries that span system boundaries.

(require '[yggdrasil.workspace :as ws])

(def w (ws/create-workspace {:store-path "/var/lib/yggdrasil/my-app"}))

;; Manage systems - auto-installs commit hooks
;; Datahike uses d/listen (immediate); others fall back to polling
(ws/manage! w datahike-sys)
(ws/manage! w git-sys)

;; Query world state at any wall-clock time
(let [world (ws/as-of-time w (.getTime some-past-date))]
  (doseq [[[system-id branch] entry] world]
    (println system-id branch "was at snapshot" (:snapshot-id entry))))

Each commit in the registry carries an HLC timestamp. as-of-time scans the index and returns the snapshot each system was at for any given moment - across all managed systems consistently.

Typed diffs

diff returns system-specific records:

;; GitDiff: {:snapshot-a, :snapshot-b, :stat, :patch, :files [{:status :added/:modified/:deleted, :path}]}
;; DatahikeDiff: {:from, :to, :added [datoms], :removed [datoms], :summary {:added-datoms n, ...}}
;; DiffError: {:from, :to, :error}

Callers can pattern-match on record type for system-specific handling.

Compliance testing

Every adapter passes the same compliance test suite - consistent behavior is guaranteed across all systems.

(compliance/run-compliance-tests
  {:create-system (fn [] (my-adapter/init! config))
   :mutate        (fn [sys] ...)
   :commit        (fn [sys msg] ...)
   :close!        (fn [sys] ...)})

Why this matters

The practical value comes from being able to treat heterogeneous systems as one versioned unit. An ML pipeline can version its datasets, model weights, and embeddings together under one composite snapshot, making any training run fully reproducible. An agent system can fork its complete environment - database, vector store, working directory - per agent, merge successful experiments back, and discard failures without cleanup. A test suite can fork production state across all systems in milliseconds.

The as-of-time query is particularly useful for audit: “what was the exact state of every system when this decision was made?” answered across heterogeneous backends with causal ordering.

Getting started

See the GitHub repository for installation and adapter-specific setup. Licensed under Apache 2.0.

Part of the Datahike ecosystem

Yggdrasil is the protocol layer that connects:

Yggdrasil provides the shared vocabulary that lets these systems branch together.