The Git Model for Databases

Every commit is a snapshot. Branches are cheap. Merging is a first-class operation. Developers internalize this model for code—but it applies equally well to data.

Databases as values

In a traditional database, you interact through a connection. The data may change between queries; the database is a service, not a thing you hold.

Datahike inverts this. Dereference a connection (@conn) and you get a database value: a snapshot frozen at a particular transaction. That value won’t change. Pass it to a function, store it, compare it to another snapshot. Two threads reading the same snapshot always agree—no locks, no coordination. And because a snapshot is just a value, you can hand it to any number of workers across threads, processes, or machines. Read scaling is built in: spin up more readers, not more database connections.

Structural sharing

If every write produces a new snapshot, won’t you run out of memory? No—because snapshots share structure. When you transact new data, Datahike creates new tree nodes only for changed portions; everything else is reused. A million-row database with one updated row shares 99.99% of its structure with the previous version.

This is the same trick that powers Clojure’s persistent vectors and git’s object store. Overhead is logarithmic, not linear.

Branching

Fork a database, make changes in isolation, merge back when ready. Unlike git (which merges text files), database merges operate on datoms with application-defined conflict resolution.

This enables workflows that are awkward otherwise: feature branches for data migrations, parallel experiments with different schemas, per-tenant forks sharing a common ancestor. It’s also how coding assistants use git worktrees to isolate their edits—the same model applies to data.

History that persists

Most databases offer snapshot isolation through MVCC, but those snapshots are ephemeral—garbage collected after the transaction commits. You can’t query “what was the value last Tuesday?”

Datahike keeps history by default. Every past state is addressable. Query as-of a specific instant, diff two snapshots, audit when a fact changed. Useful for debugging, compliance, and any system that needs to explain itself.

The tradeoff

Immutability isn’t free. Write amplification is real: inserting a row touches multiple tree nodes. Storage grows with history, though compaction can prune what you don’t need.

In practice, this cost is amortized. You don’t create a snapshot for every fact added during a bulk load. The underlying data structures support transient modes—mutable during a batch, immutable at the boundary. Snapshots are created only when a batch commits, and only those become visible to external readers. The system can adaptively coarse-grain batches to balance write throughput against snapshot granularity.

For systems that value auditability, reproducibility, and coordination-free reads, this model beats the connection-oriented one we inherited from the 1970s.