The Git Model for Databases
Every commit is a snapshot. Branches are cheap. Merging is a first-class operation. Developers internalize this model for code - but it applies equally well to data.
Databases as values
In a traditional database, you interact through a connection. The data may change between queries; the database is a service, not a thing you hold.
Datahike inverts this. Dereference a connection (@conn) and you get a database value: a snapshot frozen at a particular transaction. That value won’t change. Pass it to a function, store it, compare it to another snapshot. Two threads reading the same snapshot always agree - no locks, no coordination. And because a snapshot is just a value, you can hand it to any number of workers across threads, processes, or machines. Read scaling is built in: spin up more readers, not more database connections.
Structural sharing
If every write produces a new snapshot, won’t you run out of memory? No - because snapshots share structure. When you transact new data, Datahike creates new tree nodes only for changed portions; everything else is reused. A million-row database with one updated row shares 99.99% of its structure with the previous version.
This is the same trick that powers Clojure’s persistent vectors and git’s object store. Overhead is logarithmic, not linear.
Branching
Fork a database, make changes in isolation, merge back when ready. Unlike git (which merges text files), database merges operate on datoms with application-defined conflict resolution.
This enables workflows that are awkward otherwise: feature branches for data migrations, parallel experiments with different schemas, per-tenant forks sharing a common ancestor. It’s also how coding assistants use git worktrees to isolate their edits - the same model applies to data.
History that persists
Most databases offer snapshot isolation through MVCC, but those snapshots are ephemeral - garbage collected after the transaction commits. You can’t query “what was the value last Tuesday?”
Datahike keeps history by default. Every past state is addressable. Query as-of a specific instant, diff two snapshots, audit when a fact changed. Useful for debugging, compliance, and any system that needs to explain itself.
The tradeoff
Immutability isn’t free. Write amplification is real: inserting a row touches multiple tree nodes. Storage grows with history, though compaction can prune what you don’t need.
In practice, this cost is amortized. You don’t create a snapshot for every fact added during a bulk load. The underlying data structures support transient modes - mutable during a batch, immutable at the boundary. Snapshots are created only when a batch commits, and only those become visible to external readers. The system can adaptively coarse-grain batches to balance write throughput against snapshot granularity.
For systems that value auditability, reproducibility, and coordination-free reads, this model beats the connection-oriented one we inherited from the 1970s.