Why We Built Datahike

February 2026

I’ve been working toward this for over a decade. It started with Votorola - collaborative liquid democracy software - where I first needed to distribute a memory model across systems. That led me to Clojure, which led me to a question that I’ve been chasing ever since: how do you build data infrastructure that doesn’t lose history?

Most databases are designed for transactional business logic: process an order, update an account, move on. But many of the systems we’re building today are different. They run for weeks or months, accumulate knowledge, and need to reason about their own past. A database that overwrites state on every write doesn’t support that well.

This is the story of why we built Datahike, and why I think immutable, versioned data is the right foundation for systems that need to last.

The problem with mutable state

In 2013, I started replikativ to explore distributed, cross-platform replication systems. The core challenge was always synchronization: how do you keep data consistent across nodes without losing the ability to reason about history? But the deeper I got, the more I realized the problem wasn’t distribution - it was mutability.

When data changes in place, you lose the ability to ask “what did the system know last Tuesday?” You can’t fork an experiment, try something, and merge it back. You can’t audit what happened, because the evidence has been overwritten.

In functional programming we solved this decades ago. Data structures are immutable - values don’t change, you get new values. Programs become easier to reason about and test. I kept wondering why databases didn’t work the same way.

Finding the pieces

The answer, it turned out, was that they could - but the pieces weren’t assembled yet. Datomic had shown the way: immutable, versioned data with time travel. But Datomic was closed source and designed for centralized deployment. I wanted something open, distributed by design, and built for systems that live everywhere - from edge devices to cloud clusters.

We needed the right combination of query engine, index structure, and persistence.

1. A mature query engine

Nikita Prokopov’s DataScript provided this. It was an in-memory Datalog database with five years of development, a robust query engine, and a clean, well-designed codebase. The only problem: it was purely in-memory. No durability.

2. A functional, persistent index

We initially experimented with David Greenberg’s Hitchhiker Tree, which taught us a lot about immutable indexing. It combines B+ tree query performance with append-only write semantics - great for logs and write-heavy workloads. But for database indices, we prefer optimal read performance. The Hitchhiker Tree trades some read speed for write performance, which wasn’t the right trade-off for our use case.

So we extended persistent-sorted-set, a functionally persistent sorted set optimized for database indices. It gives us excellent read performance while maintaining immutable semantics and efficient structural sharing. When you “update” the index, you don’t mutate nodes in place - you create new nodes that share structure with the old ones. The old version still exists, unchanged.

3. The glue to put them together

This is where Datahike came in. We forked DataScript, adjusted persistent-sorted-set, added storage backends (file, SQL, LMDB, S3, GCS and more via Konserve), and kept going. Konrad Kühne and our former team at Lambdaforge UG contributed substantially in the early years - adding history indices, time travel support, and helping Datahike achieve temporal query parity with Datomic. Together we built out schema flexibility and the protocols that make Datahike extensible.

The realization: databases should be values

Here’s the thing that took me years to fully appreciate: in Datahike, a database is a value, not a service.

In a traditional database, you connect to a server. The data changes between queries. You’re always interacting with “the database” as a mutable thing.

In Datahike, you dereference a connection and get a database value: a snapshot frozen at a particular transaction. That value won’t change. You can pass it to a function. Store it. Compare it to another snapshot. Two threads reading the same database value always see the same thing - no locks, no coordination needed.

This matters because it makes the database composable. You can hold a snapshot in a variable, hand it to a worker, serialize it, or compare two versions structurally. Read scaling becomes trivial: spin up more readers, not more database connections.

But the real power is what this enables.

Git semantics for data

Once you have immutable snapshots, you can do things that are awkward or impossible with traditional databases:

Branching: Fork a database, make changes in isolation, merge back when ready. Unlike git (which merges text files), database merges operate on datoms with application-defined conflict resolution. This enables feature branches for data migrations, parallel experiments with different schemas, per-tenant forks sharing a common ancestor.

Time travel: Query any past state. Not “last 7 days” - any specific instant. Diff two snapshots to see exactly what changed. Audit when a fact was added or retracted.

Reproducibility: Capture a snapshot, store it, query it later. Same snapshot always yields same results. This is essential for ML experiments, compliance systems, or anything that needs to explain its decisions.

Why this matters for AI

During my PhD, I developed inference systems that accumulate evidence over time. Probabilistic programs build up distributions, revise beliefs, maintain uncertainty. They need to fork hypotheses, evaluate alternatives, and keep track of the path that led to each conclusion. The database backing such a system needs to support that natively - not as a bolt-on.

The same applies to any long-running system that accumulates knowledge: agent pipelines, compliance systems, scientific workflows. They all benefit from being able to fork state safely, roll back when something goes wrong, and answer “what did this system know when it made that decision?”

Datahike provides this: knowledge survives restarts, you can fork and merge, every past state is queryable, and the same query on the same snapshot always returns the same result.

What we’ve built

From those early experiments, Datahike has grown into more than just a database:

Core database: Immutable Datalog with pluggable storage
Proximum: Version-controlled vector indexing for semantic search
Scriptum: Git-like branching for full-text search (Apache Lucene extension)
Yggdrasil: Protocols unifying branching semantics across storage systems

Each piece applies the same underlying idea: data should be immutable and versioned by default. We’re not done. Datalog is our starting point, and we’re working toward a broader programming model where persistent, versioned state is the default across distributed environments.

Where we’re going

I’m bootstrapping a company on top of Datahike. We’re looking for collaborators who want to push distributed immutable systems forward, and for early customers who need versioned data infrastructure in production.

This work has always been collaborative. Konrad Kühne and our early team helped shape Datahike’s foundation. The broader open source community continues to push it forward through issues, PRs, and production deployments.

If you’re building something where audit, reproducibility, or long-term memory matter, I’d like to hear about it.

Christian Weilbach

Founder and maintainer