Yggdrasil: Branching Protocols
What if every storage system spoke the same branching language? Yggdrasil is a protocol stack that brings Git-like semantics (snapshots, branches, merges, history) to heterogeneous storage backends.
In Norse mythology, Yggdrasil is the World Tree connecting nine realms. This library connects storage systems under one unified API.
The problem
Modern data systems are fragmented. Your vector index, your database, your filesystem, your container images - each has its own versioning model (or none at all). When you need reproducible pipelines across these systems, you’re left stitching together incompatible abstractions.
Consider an ML training pipeline:
- Datasets versioned in LakeFS
- Model weights on a filesystem
- Embeddings in a vector store
- Metadata in a database
Each system has different semantics for “create a snapshot” or “roll back to yesterday.” Coordinating them requires custom glue code that’s brittle and hard to reason about.
The solution: shared protocols
Yggdrasil defines a layered protocol stack that any storage system can implement. All operations use value semantics - mutating operations return new system values, never modify in place.
| Protocol | Operations |
|---|---|
| Snapshotable | snapshot-id, parent-ids, as-of, snapshot-meta |
| Branchable | branches, branch!, checkout, delete-branch! |
| Graphable | history, ancestors, common-ancestor, commit-graph |
| Mergeable | merge!, conflicts, diff |
| Overlayable | overlay, advance!, merge-down!, discard! |
| Watchable | watch!, unwatch! - receives typed events on commit, branch, checkout |
| GarbageCollectable | gc-roots, gc-sweep! - coordinated cross-system mark-and-sweep |
| Addressable (optional) | working-path - filesystem path for current branch (Git, ZFS, Btrfs, OverlayFS) |
| Committable (optional) | commit! - explicit commit, separated from snapshot reads |
When multiple systems implement these protocols, you can compose them. Fork a database and a vector index together. Merge changes across both atomically. Query historical state consistently.
Twelve adapters
| Adapter | System | Branching model |
|---|---|---|
| Git | Version control | Native branches/commits |
| ZFS | Filesystem | Snapshots + clones |
| Btrfs | Filesystem | Subvolumes + snapshots |
| OverlayFS | Filesystem | Layered directories |
| Podman | Containers | Image layers |
| IPFS | P2P storage | Content-addressed commits + IPNS branches |
| Iceberg | Table format | Snapshots + native branches |
| Datahike | Database | Native COW |
| LakeFS | Data lake | Git-like branches |
| Dolt | SQL database | Git-like branches |
| Scriptum | Full-text search | Lucene segment sharing |
| Proximum | Vector search | Merkle-verified snapshots |
CompositeSystem: branching multiple systems as one
The most significant recent addition is CompositeSystem - a fiber product (pullback) over shared branch space. Given systems A and B, the composite is the pair (A, B) where both are always on the same branch. All protocol operations apply componentwise.
(require '[yggdrasil.composite :as composite]
'[yggdrasil.protocols :as p])
;; Compose a database and a search index
(def sys (composite/composite [datahike-sys scriptum-sys]
:name "my-app"
:branch :main
:store-path "/var/lib/yggdrasil/composite")) ; optional persistence
;; All protocol operations work on both systems simultaneously
(def branched (-> sys
(p/branch! :experiment)
(p/checkout :experiment)))
;; Commit both atomically - gets a deterministic composite snapshot-id
(def committed (p/commit! branched "experimental run"))
;; Merge back
(def merged (-> committed
(p/checkout :main)
(p/merge! :experiment)))
snapshot-id on a composite returns a deterministic UUID derived from the combined state of all sub-systems - the same combination always yields the same ID. History, conflicts, and GC roots are all computed across the full set.
Passing :store-path persists the composite history via a PSS B-tree backed by konserve, so history survives process restarts.
Workspace: HLC-coordinated multi-system operations
The workspace layer adds Hybrid Logical Clock (HLC) coordination across independently managed systems. This enables temporal queries that span system boundaries.
(require '[yggdrasil.workspace :as ws])
(def w (ws/create-workspace {:store-path "/var/lib/yggdrasil/my-app"}))
;; Manage systems - auto-installs commit hooks
;; Datahike uses d/listen (immediate); others fall back to polling
(ws/manage! w datahike-sys)
(ws/manage! w git-sys)
;; Query world state at any wall-clock time
(let [world (ws/as-of-time w (.getTime some-past-date))]
(doseq [[[system-id branch] entry] world]
(println system-id branch "was at snapshot" (:snapshot-id entry))))
Each commit in the registry carries an HLC timestamp. as-of-time scans the index and returns the snapshot each system was at for any given moment - across all managed systems consistently.
Typed diffs
diff returns system-specific records:
;; GitDiff: {:snapshot-a, :snapshot-b, :stat, :patch, :files [{:status :added/:modified/:deleted, :path}]}
;; DatahikeDiff: {:from, :to, :added [datoms], :removed [datoms], :summary {:added-datoms n, ...}}
;; DiffError: {:from, :to, :error}
Callers can pattern-match on record type for system-specific handling.
Compliance testing
Every adapter passes the same compliance test suite - consistent behavior is guaranteed across all systems.
(compliance/run-compliance-tests
{:create-system (fn [] (my-adapter/init! config))
:mutate (fn [sys] ...)
:commit (fn [sys msg] ...)
:close! (fn [sys] ...)})
Why this matters
The practical value comes from being able to treat heterogeneous systems as one versioned unit. An ML pipeline can version its datasets, model weights, and embeddings together under one composite snapshot, making any training run fully reproducible. An agent system can fork its complete environment - database, vector store, working directory - per agent, merge successful experiments back, and discard failures without cleanup. A test suite can fork production state across all systems in milliseconds.
The as-of-time query is particularly useful for audit: “what was the exact state of every system when this decision was made?” answered across heterogeneous backends with causal ordering.
Getting started
See the GitHub repository for installation and adapter-specific setup. Licensed under Apache 2.0.
Part of the Datahike ecosystem
Yggdrasil is the protocol layer that connects:
- Datahike - Immutable Datalog database
- Proximum - Version-controlled vector search
- Scriptum - Branching for Apache Lucene
- Stratum - Columnar SQL with CoW snapshots
Yggdrasil provides the shared vocabulary that lets these systems branch together.