Branches as Values, Merges as Queries
May 2026
Snapshotting via copy-on-write is a well-trodden idea. ZFS and btrfs do it at the filesystem block layer; Neon and Aurora do it at the database page layer; Datomic and Datahike do it at the data-model layer. What differs is where the immutability lives, and that determines what you can do with the snapshots once you have them.
In Datahike, the database value itself is immutable. A datom never mutates; a query is always against a specific commit; a branch is a database value you can hand to a function. That last property changes the calculus in three ways.
First, branching is the same primitive as every other transaction. There’s no special bulk-load path, no restore mode, no control-plane operation — just a couple of small writes to storage.
Second, branches are database values you can pass to a query. The same query interface that reads the head of :db reads any historical commit on any branch. No special “as-of” mode, no separate replica.
Third, merging becomes a query. ZFS can clone a snapshot but can’t merge two of them — a filesystem doesn’t understand its own contents well enough to resolve a conflict. Datahike does: branches are database values, Datalog queries take multiple databases as inputs, so “what’s in :feature and not in :db” is a query you write. Filtering, transformation, and conflict resolution are all the same language you query the database with.
The rest walks through datahike.versioning in order, with a brief note at the end on how the same surface shows up in the other bindings.
The storage model
A Datahike database is a persistent sorted set of datoms — five-tuples of [entity attribute value transaction op]. The storage layer is persistent-sorted-set, a B-tree-based immutable data structure designed for on-disk storage of sorted runs of datoms.
What matters for branching is the persistence property: every node is immutable. A transaction that adds, retracts, or modifies datoms walks from root to leaf, creates new nodes along the changed path, and leaves the unchanged subtrees pointing at exactly the same nodes as the prior snapshot. Both the old and new trees are valid; both are queryable; the new tree’s root is the only thing the system needs to know about to read it.
This is the same idea behind Clojure’s persistent vectors and Git’s object store. Datomic introduced it to databases in 2012; Datahike is the open-source descendant. Sharing is at the level of tree nodes: with a branching factor of 512, the tree stays shallow even for very large databases, and a transaction rewrites only the leaf and the few internal nodes on its path. Every other subtree is shared by pointer with the previous snapshot.
Each node is content-addressable — its key in konserve (the storage abstraction) is derived from its contents. konserve maps the same protocol over filesystems, S3, JDBC databases, IndexedDB in browsers, and others. A node written once is never rewritten. The only thing that ever changes is a small map at a well-known key listing the root pointers for the indices in the current snapshot. That map is a commit. A branch is a named pointer at a commit, registered in a :branches set under a known key.
Creating a branch
require('[datahike.api :as d])
d/branch!(conn :db :feature)
(require '[datahike.api :as d])
(d/branch! conn :db :feature)
The system reads the commit-id currently at :db, verifies it points at a real commit, writes a new key mapping :feature → <commit-id>, and updates the :branches set to include :feature. Two key writes in the simple case — plus a CoW-branch operation for any attached secondary index (Lucene full-text, vector indices) that implements the branching protocol.
Wall-clock time depends almost entirely on the storage backend:
- In-memory — sub-millisecond.
- Local filesystem — a few milliseconds, dominated by fsync.
- S3 — 10–100 ms, dominated by the network round-trip; the payloads are tiny.
No tree nodes are copied. :feature and :db reach through the same physical objects in storage. A million-datom branch costs nothing extra at fork time, and a hundred branches are still a hundred small writes — not a hundred database copies.
If the source doesn’t exist, branch! raises :from-branch-does-not-point-to-existing-branch-or-commit. If the target name is already taken, it raises :branch-already-exists. Both are explicit; you don’t get silent overwrites.
Reading from a branch
Branches are first-class. You read them by name (branch-as-db), by commit-id (commit-as-db), or by holding a connection that was opened with a :branch in its config.
def feature-db: d/branch-as-db(conn :feature)
def main-db: d/branch-as-db(conn :db)
d/q('[:find ?e :where [?e :widget/sku]] feature-db)
;; Or pin to a specific historical commit by UUID
def older-db: d/commit-as-db(conn #uuid "b4f2e1c0-2feb-5b61-be14-5590b9e01e48")
(def feature-db (d/branch-as-db conn :feature))
(def main-db (d/branch-as-db conn :db))
(d/q '[:find ?e :where [?e :widget/sku]] feature-db)
;; Or pin to a specific historical commit by UUID
(def older-db (d/commit-as-db conn #uuid "b4f2e1c0-2feb-5b61-be14-5590b9e01e48"))
branch-as-db returns a database value — immutable, ready to query, safe to hold across calls. commit-as-db does the same for any historical commit, whether or not a branch still names it. Both work without an open connection on the target branch.
To write to a branch, connect with :branch in the config and transact normally:
def feature-conn: d/connect(assoc(cfg :branch :feature))
d/transact(feature-conn [{:widget/sku "Z", :widget/weight 99}])
(def feature-conn (d/connect (assoc cfg :branch :feature)))
(d/transact feature-conn [{:widget/sku "Z" :widget/weight 99}])
The write goes to :feature’s head; :db is undisturbed. Each branch has its own writer; transactions on different branches don’t serialize against each other.
The commit graph
Every transaction produces a commit whose :meta :datahike/parents set records its parents. branch! produces single-parent commits (the previous head of the branch). merge! produces commits with multiple parents. Walking back from any commit gives you the lineage.
require('[superv.async :refer [<?? S]]
'[datahike.versioning :refer [branch-history]])
d/commit-id(@conn)
;; => #uuid "b4f2e1c0-…"
d/parent-commit-ids(@conn)
;; => #{#uuid "…"} ; single parent on a normal commit
;; => #{#uuid "…" "…"} ; two (or more) parents on a merge commit
<??(S branch-history(conn))
;; => sequence of stored DB values, in order from the current head back
;; through every ancestor reachable via :datahike/parents
(require '[superv.async :refer [<?? S]]
'[datahike.versioning :refer [branch-history]])
(d/commit-id @conn)
;; => #uuid "b4f2e1c0-…"
(d/parent-commit-ids @conn)
;; => #{#uuid "…"} ; single parent on a normal commit
;; => #{#uuid "…" "…"} ; two (or more) parents on a merge commit
(<?? S (branch-history conn))
;; => sequence of stored DB values, in order from the current head back
;; through every ancestor reachable via :datahike/parents
branch-history is the workhorse for inspection: it walks the parent graph from the connection’s current branch backward and returns each commit as a DB value, with duplicates pruned. Useful for time-travel reports, audit trails, and assembling queries against arbitrary historical states.
Merging: merge-db plus Datalog
This is where the “branches as values” property earns its keep.
merge-db records a new commit on the current branch whose :datahike/parents includes both the previous head and :feature’s head. The tx-data is regular transaction data; Datahike applies it the same way it applies any transaction. The operation is routed through the writer so it serializes cleanly against concurrent transactions on the same branch. (Sync; there’s also d/merge-db! for the async path, intended for go blocks and listener callbacks.)
What merge-db does not do: figure out the tx-data for you.
That’s a feature, not a gap. Because branches are database values and Datalog queries take multiple databases as inputs, the diff between branches is a query:
d/q('[:find ?e ?a ?v
:in $feature $main
:where [$feature ?e ?a ?v _]
[:db/txInstant not= ?a]
not([$main ?e ?a ?v _])]
feature-db main-db)
(d/q '[:find ?e ?a ?v
:in $feature $main
:where
[$feature ?e ?a ?v _]
[(not= :db/txInstant ?a)]
(not [$main ?e ?a ?v _])]
feature-db main-db)
:in $feature $main binds two databases; :where clauses pick which one each pattern matches against. The result is the set of datoms present in :feature but absent in :db — directly transformable to tx-data.
Real merges are more selective. A few patterns that fall out naturally:
Filter by attribute — merge only the schema changes, leave the data behind:
d/q('[:find ?e ?a ?v
:in $feature $main
:where [$feature ?e ?a ?v _]
[contains?(#{:db/ident :db/valueType :db/cardinality} ?a)]
not([$main ?e ?a ?v _])]
feature-db main-db)
(d/q '[:find ?e ?a ?v
:in $feature $main
:where
[$feature ?e ?a ?v _]
[(contains? #{:db/ident :db/valueType :db/cardinality} ?a)]
(not [$main ?e ?a ?v _])]
feature-db main-db)
Last-write-wins on conflicting attributes — for each (e, a), pick the value with the latest transaction time across both branches:
d/q('[:find ?e ?a max(?t) ?v
:in $feature $main
:where or-join([?e ?a ?v ?t] [$feature ?e ?a ?v ?t] [$main ?e ?a ?v ?t])]
feature-db main-db)
(d/q '[:find ?e ?a (max ?t) ?v
:in $feature $main
:where
(or-join [?e ?a ?v ?t]
[$feature ?e ?a ?v ?t]
[$main ?e ?a ?v ?t])]
feature-db main-db)
Application-defined resolution — Datalog predicate clauses can call arbitrary functions, so routing each conflict through a domain resolver fits the same shape:
d/q('[:find ?e ?a ?v-resolved
:in $feature $main ?resolve
:where [$feature ?e ?a ?v-f _]
[$main ?e ?a ?v-m _]
[?v-f not= ?v-m]
[?resolve(?e ?a ?v-f ?v-m) ?v-resolved]]
feature-db main-db your-resolver-fn)
(d/q '[:find ?e ?a ?v-resolved
:in $feature $main ?resolve
:where
[$feature ?e ?a ?v-f _]
[$main ?e ?a ?v-m _]
[(not= ?v-f ?v-m)]
[(?resolve ?e ?a ?v-f ?v-m) ?v-resolved]]
feature-db main-db your-resolver-fn)
Once you have the tx-data — however you computed it — d/merge-db applies it and records the commit with both parents:
d/merge-db(conn
#{:feature}
mapv(fn [[e a v]]:
[:db/add e a v]
end diff-tuples))
(d/merge-db conn #{:feature}
(mapv (fn [[e a v]] [:db/add e a v]) diff-tuples))
branch-history then shows the merge commit; d/parent-commit-ids returns the full parent set.
The takeaway: Datahike doesn’t ship a built-in 3-way merge algorithm because it doesn’t need to. The merge algorithm is whatever Datalog query expresses your domain’s resolution rule. Three-way merge of textual files is hard because text has no semantics; merging datoms is a query because the data already carries its own structure.
This generalizes further than it looks. Martin Kleppmann has shown that CRDTs themselves can be expressed as pure Datalog queries over the operation log. Datahike’s merge model lets you adopt that approach incrementally: start with last-write-wins, add domain-specific resolvers where it matters, formalize as CRDT-shaped queries if you want full convergence guarantees.
Reset: force-branch!
force-branch! is the equivalent of git reset --hard. Pass a database value, a target branch, and the set of parent branches or commit-ids to attribute the new head to:
;; Rewind :feature to a known-good historical commit, treating it
;; as a fresh start from :db.
d/force-branch!(d/commit-as-db(conn #uuid "b4f2e1c0-…") :feature #{:db})
;; Rewind :feature to a known-good historical commit, treating it
;; as a fresh start from :db.
(d/force-branch! (d/commit-as-db conn #uuid "b4f2e1c0-…")
:feature
#{:db})
The branch head is overwritten unconditionally; the previous head becomes unreachable from the branch name. Existing connections to :feature are now stale and must be released and reconnected.
Useful for rolling back a bad branch after experimentation, pinning a branch to a known commit for audit, or rewriting a branch’s lineage when you need to. Use with care — the prior data isn’t deleted (GC controls that) but you’ve removed the named entry point, so if no other branch or commit-id references it, it goes away on the next sweep.
Cleanup: delete-branch! and gc-storage
Removes :feature from the :branches set. The branch’s data stays in konserve, reachable by commit-id, until garbage collection sweeps it — that’s intentional, so you can recover a deleted branch if you change your mind. Live connections to :feature will fail after this; remote readers should release.
You can’t delete :db. It’s the default main branch and removing it would orphan the database; if you want the database gone, delete the database. Other branches are fair game.
Storage reclamation is a separate, explicit step:
require('[superv.async :refer [<?? S]])
;; Default: only reclaim space from deleted branches.
<??(S d/gc-storage(conn))
;; With a cutoff date: keep snapshots newer than the date plus all
;; branch heads; delete intermediate snapshots older than the date.
let [thirty-days-ago new java.util.Date(System/currentTimeMillis() - 30 * 24 * 60 * 60 * 1000)]:
<??(S d/gc-storage(conn thirty-days-ago))
end
(require '[superv.async :refer [<?? S]])
;; Default: only reclaim space from deleted branches.
(<?? S (d/gc-storage conn))
;; With a cutoff date: keep snapshots newer than the date plus all
;; branch heads; delete intermediate snapshots older than the date.
(let [thirty-days-ago (java.util.Date. (- (System/currentTimeMillis)
(* 30 24 60 60 1000)))]
(<?? S (d/gc-storage conn thirty-days-ago)))
Two things worth knowing about how gc-storage interacts with branch history:
Branch heads are always kept, regardless of cutoff. Every live branch’s current head survives every GC run; GC only removes the intermediate snapshots between commits — the dots between branch heads on the graph, not the latest dot on any branch.
Intermediate commits become unreachable below the cutoff. A 7-day cutoff means branch-history walks only return commits within that window plus the current heads, and d/commit-as-db lookups for older UUIDs fail because the snapshot is gone. The cutoff should also comfortably exceed your longest-running reader’s lifetime — Datahike’s distributed readers walk storage directly without coordinating with a writer, so a snapshot vanishing mid-query is a real failure mode. You’re trading old audit history (and reader safety) for disk space; pick the window to match your readers, compliance posture, and storage budget.
Without a date, d/gc-storage is always safe — it only reclaims storage from deleted branches. Datahike also ships an experimental online-GC mode that runs incrementally during transactions on single-branch databases; offline d/gc-storage is what you reach for in multi-branch setups.
For how gc-storage composes with purge (GDPR-driven datom deletion) and the broader governance story, see Data Governance in Versioned Systems.
What this unlocks
A handful of workflows that branching makes affordable:
- AI agent sandboxes. Spin up fifty branches, each agent gets its own database to mutate. Merge what works, drop the rest.
- Schema migration tests in CI. Branch from prod, apply the migration, run the regression suite, throw the branch away. The next CI run starts from the same prod commit.
- Editorial workflows. Editors stage changes on a branch, reviewers query the staging branch, approve, merge.
- Multi-tenant snapshots. Each tenant gets a branch of a shared base. Tenant-specific overrides live on their branch; base updates merge cleanly.
- Time-travel debugging. When a bug shows up, branch from the current head, apply experimental fixes on the branch, and walk historical commits via
commit-as-dbto find when the offending state appeared.
None of these require special infrastructure. The same primitives that read the database also read every branch.
Across the other bindings
The versioning API is part of the Clojure API spec, and the Java, JavaScript / TypeScript, Python (pydatahike), C (libdatahike), and CLI (dthk) bindings are all auto-generated from it. Java surfaces it as Datahike.branchAsync / branchAsDb / mergeDb; JavaScript as d.branchBang / branchAsDb / mergeDb; equivalent forms in the others. The dthk CLI also supports the more general Datalog-driven merge workflow via dthk query with multi-source input and dthk transact — see the CLI doc for an example.
In SQL via pg-datahike, the read side is wired through session variables and a datahike.* function namespace: SET datahike.branch = 'feature', SET datahike.commit_id = '<uuid>', plus datahike.branches(), datahike.create_branch(), datahike.delete_branch(). Write-side merge-db over SQL is on the roadmap. See Datahike Speaks Postgres for the full pgwire surface.
For how the same branching model extends beyond Datahike — to Stratum (SQL / columnar), vector and full-text indices, and other systems via a shared protocol — see Yggdrasil: Branching Protocols.
Known limitations
- Multi-branch purge is expensive.
purgeremoves datoms from the current branch; if you need them gone from every branch that referenced them (for GDPR or similar), the operation walks each branch independently. See Data Governance in Versioned Systems. - No built-in 3-way merge. Datahike doesn’t ship one because the right resolution rule is domain-specific. The Datalog patterns above cover the common shapes.
- pg-datahike write-side
merge-dbis not yet exposed over SQL. Reads against any branch work; writes always land on the connection’s default branch in 0.1. - Branch-diff is O(differing datoms). The query walks both trees. For a 100M-datom database with a small diff, this is fast; for a diff that spans most of the tree, plan accordingly.
Try it
The branching API is in datahike.api (with branch-history still in datahike.versioning). For SQL access, see pg-datahike and the wire-protocol writeup. Repo: github.com/replikativ/datahike.
Feedback to contact@datahike.io or open an issue.