Datahike Notes

Anomaly Detection Belongs in Your Database

Mon, 13 Apr 2026 00:00:00 GMT

Anomaly Detection Belongs in Your Database

Every analytical database can aggregate, filter, and join. None of them can tell you “something is wrong with this data” as a first-class operation.

The standard workflow today: query your warehouse, serialize millions of rows into a DataFrame, import scikit-learn, fit an IsolationForest, write results back. You now maintain two systems, two runtimes, and a serialization boundary that adds seconds of latency per round-trip. For a fraud detection pipeline running against live transactions, those seconds matter. For a data engineer who just wants to flag outliers in a SELECT statement, the entire Python detour is unnecessary friction.

We built anomaly detection directly into Stratum — not as a UDF shim that calls Python under the hood, but as a native SIMD-accelerated implementation that runs inside the query engine. Train a model, score your data, all from SQL — no Python, no Clojure, no external runtime.

SELECT * FROM transactions
WHERE ANOMALY_SCORE('fraud_model') > 0.7;

No data leaves the database. No serialization. The query planner pushes down predicates and prunes chunks before the model ever sees a row. Scoring a single transaction takes 6 microseconds. A batch of 1,000 incoming transactions: 1.6 milliseconds. That’s fast enough to sit in the hot path of a payment gateway — not as a batch job that runs after the fact, but as a synchronous check before the transaction clears.

Why isolation forests

Most “anomaly detection in SQL” tutorials teach you to compute z-scores: (value - AVG(value)) / STDDEV(value) > 3. This works for Gaussian-distributed single columns. It fails everywhere else.

Real anomalies are multivariate. A transaction amount of $500 is normal. A frequency of 20 per hour is normal. Both together, at 3am, to a merchant in a country where the cardholder has never transacted — that’s the signal. Z-scores can’t see it. Neither can IQR-based methods or simple threshold rules. You need a model that captures the joint structure of your data.

Isolation forests (Liu, Ting & Zhou, 2008) take a fundamentally different approach. Instead of modeling what “normal” looks like — a density estimate, a distribution fit, a cluster boundary — they directly measure how easy it is to isolate a point from everything else. Build a tree of random splits across random features. Anomalous points, being few and different, get isolated in fewer splits. Normal points, packed into dense regions, require many splits to separate.

The properties that make this algorithm uniquely suited to a columnar database:

No assumptions. Z-scores assume Gaussian distributions. DBSCAN assumes density clusters. Isolation forests are non-parametric — they work on any distribution shape, any number of dimensions, without tuning.

Subsampling. Each tree is trained on only 256 randomly sampled rows, regardless of total dataset size. Training 100 trees on 10M rows takes 6ms — it reads 25,600 rows total. This is the key insight from the original paper: anomalies are so different that a tiny sample is enough to characterize them.

Linear scoring. Scoring each row means traversing 100 trees of depth ≤8. That’s 800 comparisons per row — branch-free, cache-friendly, and trivially parallelizable. Stratum’s implementation packs each tree node into a single long (split feature in upper 32 bits, split value as float in lower 32), traverses with branchless node = 2*node + 1 + cmp, and processes rows in morsel-driven parallel batches sized to fit L1 cache.

Multivariate by construction. Every tree split randomly selects a feature. The ensemble naturally captures cross-feature interactions without the user specifying which features correlate.

Unsupervised. No labels needed. You don’t need a curated training set of “known fraud” — the algorithm finds whatever doesn’t fit the bulk distribution. This matters because in practice, labeled anomaly data is expensive, incomplete, and often biased toward known attack patterns.

What the landscape looks like

We surveyed what major analytical databases offer for built-in anomaly detection:

DuckDB has no native capability. The closest is anofox-tabular, a third-party community extension (BSL-licensed) that adds isolation forests to DuckDB. We read through the implementation — it’s feature-rich (Extended IF, SCiForest, categorical columns, density scoring), but architecturally very different from what we built. anofox-tabular retrains the forest on every query — there’s no model persistence, so you can’t train once and score cheaply at query time. Its C++ implementation is scalar (no SIMD), single-threaded (no parallelism in build or score), and uses recursive traversal with std::vector allocations at every tree node. It also copies all data from DuckDB’s columnar format into its own data structures before running. The README describes “vectorized C++17” which likely refers to DuckDB’s general execution model rather than the isolation forest code itself. For small datasets (the test suite uses 5-51 rows) none of this matters. For scoring a million rows inline with a query, or scoring 1,000 transactions in the hot path of a payment system, the architectural choices compound. We haven’t benchmarked head-to-head, but the design differences — flat packed arrays vs. nested vectors, morsel-driven parallelism vs. single-threaded, persistent models vs. retrain-per-query — point to a substantial gap at scale.

ClickHouse has seriesOutliersDetectTukey — a univariate IQR method for time-series. Useful for simple threshold alerts, but it’s one column at a time, one statistical method, no learning. Cloudflare built their anomaly detection platform on ClickHouse but implemented the actual detection logic (HBOS) in external microservices — ClickHouse stores and aggregates the data, it doesn’t run the models.

TimescaleDB has an open issue proposing ARIMA and DBSCAN anomaly detection. It remains unimplemented.

PostgreSQL MADlib offers in-database ML, but it’s a heavy extension that hasn’t seen active development recently.

The pattern is consistent: analytical databases treat anomaly detection as somebody else’s problem. The “solution” is always to export data to a separate ML runtime.

The cost of exporting

This isn’t just about convenience. The export-to-Python pattern has structural costs that compound in production:

Latency. Serializing 1M rows from a database into Python’s heap takes seconds. Add model inference, write-back, and you’re looking at minutes for a pipeline that should be a query. For fraud detection or infrastructure monitoring, that latency window is when damage happens.

Memory duplication. The data exists in the database AND in Python’s process. For large datasets, this means either paying for 2x RAM or batching with additional orchestration complexity.

Operational surface area. You now maintain a database AND a Python environment with scikit-learn, NumPy, and their transitive dependencies. Version pinning, compatibility testing, deployment coordination. Every additional system boundary is a place where things break.

Security perimeter. Moving data out of the database means it leaves whatever access controls, encryption, and audit logging the database provides. For regulated industries, this is a compliance headache.

Lost optimization. When anomaly scoring is a SQL function, the query engine can apply zone-map pruning, skip entire chunks where min/max statistics prove no rows will match downstream filters, and fuse the scoring into the execution pipeline. An external Python process sees a flat array with no metadata.

How it works in Stratum

SQL interface

Stratum speaks the PostgreSQL wire protocol. Connect with psql, DBeaver, JDBC, or any PostgreSQL client — then train and query models entirely from SQL:

-- Train a model directly from SQL
CREATE MODEL fraud_model
  TYPE ISOLATION_FOREST
  OPTIONS (n_trees = 200, sample_size = 256, contamination = 0.05)
  AS SELECT amount, freq FROM transactions;

The AS SELECT query defines the training data — any valid SELECT works, including WHERE filters and JOINs. Column names become the model’s feature names. Once created, the model remembers its features — you don’t need to repeat them:

-- Short form: model knows its features from training
SELECT *, ANOMALY_SCORE('fraud_model') AS score
FROM transactions;

-- All four functions support both forms
SELECT *, ANOMALY_PREDICT('fraud_model') AS is_anomaly FROM transactions;
SELECT *, ANOMALY_PROBA('fraud_model') AS prob FROM transactions;
SELECT *, ANOMALY_CONFIDENCE('fraud_model') AS conf FROM transactions;

Need to score on different columns, computed expressions, or join results? Use the long form with explicit arguments (mapped positionally to the model’s features):

-- Explicit columns
SELECT *, ANOMALY_SCORE('fraud_model', amount, freq) AS score
FROM transactions;

-- Score on expressions
SELECT *, ANOMALY_SCORE('fraud_model', amount * 100, LOG(freq)) AS score
FROM transactions;

-- Score across JOINs
SELECT t.*, ANOMALY_SCORE('fraud_model', t.amount, r.rate) AS score
FROM transactions t JOIN rates r ON t.currency = r.code;

Model management is also SQL-native:

SHOW MODELS;                    -- list all registered models
DESCRIBE MODEL fraud_model;     -- features, hyperparameters, threshold
DROP MODEL fraud_model;         -- remove a model
DROP MODEL IF EXISTS old_model; -- remove only if it exists

The anomaly functions look and compose like any other SQL expression — filter on them, aggregate them, join them.

Clojure API

For programmatic workflows — custom training pipelines, model rotation, or embedding Stratum as a library — there’s a direct Clojure API:

What is this syntax?

require('[stratum.api :as st])

;; Your data — plain Java arrays
def amounts: double-array([10 15 12 11 14 200 13 11 300 12])
def freqs: double-array([5 6 4 5 7 1 5 4 1 6])

;; Train: 100 trees, 256 samples each, expect ~5% anomalies
def model: st/train-iforest({:from {:amount amounts, :freq freqs}, :contamination 0.05})

;; Score: double[] in [0, 1] — higher = more anomalous
st/iforest-score(model {:amount amounts, :freq freqs})

;; Binary prediction: long[] with 1 = anomaly, 0 = normal
st/iforest-predict(model {:amount amounts, :freq freqs})

;; Confidence: how much do the trees agree? [0, 1]
st/iforest-predict-confidence(model {:amount amounts, :freq freqs})

(require '[stratum.api :as st])

;; Your data — plain Java arrays
(def amounts (double-array [10 15 12 11 14 200 13 11 300 12]))
(def freqs   (double-array [ 5  6  4  5  7   1  5  4   1  6]))

;; Train: 100 trees, 256 samples each, expect ~5% anomalies
(def model (st/train-iforest {:from {:amount amounts :freq freqs}
                              :contamination 0.05}))

;; Score: double[] in [0, 1] — higher = more anomalous
(st/iforest-score model {:amount amounts :freq freqs})

;; Binary prediction: long[] with 1 = anomaly, 0 = normal
(st/iforest-predict model {:amount amounts :freq freqs})

;; Confidence: how much do the trees agree? [0, 1]
(st/iforest-predict-confidence model {:amount amounts :freq freqs})

Scores integrate directly with the query engine — they’re just another column:

What is this syntax?

def scores: st/iforest-score(model data)
st/q({:from assoc(data :score scores)
      :where [[:> :score 0.7]]
      :group [:region]
      :agg [[:avg :score] [:count]]
      :having [[:> :avg 0.5]]
      :order [[:avg :desc]]})

(def scores (st/iforest-score model data))
(st/q {:from   (assoc data :score scores)
       :where  [[:> :score 0.7]]
       :group  [:region]
       :agg    [[:avg :score] [:count]]
       :having [[:> :avg 0.5]]
       :order  [[:avg :desc]]})

Online adaptation

Data distributions shift. Fraud patterns evolve. A model trained last month may not catch today’s anomalies. Retraining from scratch is wasteful when only the recent distribution has changed.

iforest-rotate replaces the oldest k trees with new ones trained on fresh data. The original model is unchanged — copy-on-write semantics mean you can keep the old model for comparison:

What is this syntax?

;; Replace 10% of trees with new ones trained on this week's data
def updated-model: st/iforest-rotate(model this-week-data)

;; Score with recency bias: newer trees weighted higher
st/iforest-score-weighted(updated-model data 0.98)

;; Replace 10% of trees with new ones trained on this week's data
(def updated-model (st/iforest-rotate model this-week-data))

;; Score with recency bias: newer trees weighted higher
(st/iforest-score-weighted updated-model data 0.98)

This is a lightweight operation — training 10 new trees on 256 samples each costs microseconds. The resulting model maintains sensitivity to historical patterns (90 original trees) while adapting to recent distribution changes (10 new trees). In our temporal evaluation with synthetic concept drift (outlier region shifting at the midpoint), the rotating model maintains AUC above 0.95 across all segments where a static model degrades to 0.75.

Performance

Measured on an Intel Core Ultra 7 258V (8 cores, Lunar Lake), JDK 25, 100 trees with sample size 256:

Batch scoring (online processing)

Batch size	Latency	Use case
1 row	6 μs	Single transaction check
10 rows	19 μs	Micro-batch
100 rows	163 μs	API batch
1,000 rows	1.6 ms	Payment gateway batch
10,000 rows	16 ms	Bulk ingest check

At 6 microseconds per row, anomaly scoring adds negligible overhead to any transaction processing pipeline. A payment gateway checking 1,000 transactions per batch stays under 2ms — well within the latency budget that even real-time payment systems allow for fraud checks.

Full-table scoring (analytics)

Operation	1M rows	10M rows
Train (100 trees × 256 samples)	~1ms	6ms
Score (parallel, 8 cores)	448ms	4.6s
Score (single-threaded)	~1.7s	17s
Model memory	~2.5 MB (100 trees × 511 nodes × 8 bytes)

Training is near-instant because it only reads 25,600 rows total (256 per tree), regardless of dataset size. Scoring scales linearly and parallelizes across cores with morsel-driven execution — each morsel sized to fit L1 cache for branchless tree traversal.

The isolation forest validates against standard ODDS benchmark datasets (Shuttle, Http, ForestCover, Mammography, CreditCard) with AUC-ROC scores matching or exceeding scikit-learn’s implementation at equivalent hyperparameters. The benchmark suite includes a head-to-head comparison with PyOD that you can run yourself: clj -M:iforest pyod.

Under the hood

The tree structure is packed for cache efficiency. Each node is a single long:

Internal nodes: split feature index (upper 32 bits) + split value as float (lower 32 bits)
Leaf nodes: path length adjustment stored as Double.doubleToRawLongBits(c(leafSize))
Trees are contiguous in memory: forest[tree × maxNodes + nodeIdx]

Scoring traverses each tree with a branchless comparison — node = 2*node + 1 + (val >= splitVal ? 1 : 0) — no branch misprediction, no pointer chasing. The anomaly score is 2^(-E(h(x)) / c(ψ)) where E(h(x)) is the mean path length across all trees and c(ψ) is the expected path length of an unsuccessful BST search, normalizing scores to [0, 1].

Parallelism uses the same morsel-driven architecture as the rest of the query engine: the ForkJoinPool processes rows in 64K-row morsels, each morsel’s feature data fitting in L1 cache. No lock contention — each thread accumulates independently into its own score region.

The confidence metric (predict-confidence) uses the coefficient of variation of per-tree path lengths. When trees agree on a point’s isolation depth, confidence is high. When they disagree — the point sits near a decision boundary — confidence is low. This gives you a principled way to triage uncertain predictions rather than trusting every score blindly.

What this enables

Online payment fraud detection. At 6μs per transaction, anomaly scoring can sit directly in the payment authorization path — not as a post-hoc batch job, but as a synchronous check before the charge clears. Train on your historical transaction data, register the model, and every SELECT against the transactions table can include ANOMALY_SCORE inline. For batch settlement processing, 1,000 transactions score in 1.6ms. The model stays in-process — no network hop to an external ML service, no serialization overhead, no additional point of failure in the payment critical path.

Data quality monitoring. Run ANOMALY_SCORE over your staging table before promoting to production. Flag rows that don’t fit the historical distribution. Catch data pipeline bugs before they propagate.

IoT sensor monitoring. Train on a baseline period of normal sensor readings. Score incoming data. When vibration, temperature, and power consumption are each individually normal but their combination is anomalous, the isolation forest catches it — z-scores don’t.

Versioned anomaly detection. Because Stratum datasets are immutable values with copy-on-write branching, you can score against historical snapshots. “What would this model have flagged last quarter?” is a query, not a data engineering project.

Try it yourself

Start the demo server — it loads 100K taxi ride rows and a pre-trained anomaly model:

java --add-modules jdk.incubator.vector \
     --enable-native-access=ALL-UNNAMED \
     -jar stratum-standalone.jar --demo

Connect with any PostgreSQL client and run real anomaly queries immediately:

psql -h localhost -p 5432 -U stratum

-- Find the most anomalous taxi rides
SELECT fare_amount, tip_amount, pickup_hour,
       ANOMALY_SCORE('taxi_anomaly', fare_amount, tip_amount,
                     total_amount, passenger_count, pickup_hour) AS score
FROM taxi
WHERE ANOMALY_SCORE('taxi_anomaly', fare_amount, tip_amount,
                    total_amount, passenger_count, pickup_hour) > 0.7
ORDER BY score DESC
LIMIT 20;

-- Binary prediction: which rides are anomalous?
SELECT fare_amount, tip_amount,
       ANOMALY_PREDICT('taxi_anomaly', fare_amount, tip_amount,
                       total_amount, passenger_count, pickup_hour) AS is_anomaly
FROM taxi
WHERE ANOMALY_PREDICT('taxi_anomaly', fare_amount, tip_amount,
                      total_amount, passenger_count, pickup_hour) = 1;

-- How confident is the model about each prediction?
SELECT fare_amount,
       ANOMALY_SCORE('taxi_anomaly', fare_amount, tip_amount,
                     total_amount, passenger_count, pickup_hour) AS score,
       ANOMALY_CONFIDENCE('taxi_anomaly', fare_amount, tip_amount,
                          total_amount, passenger_count, pickup_hour) AS confidence
FROM taxi
ORDER BY score DESC
LIMIT 10;

The demo dataset includes synthetic anomalies — high fares with zero tips late at night — that the model detects out of the box. But the model also finds natural outliers in the data: unusual combinations of fare, tip, passenger count, and hour that don’t match the bulk distribution.

Getting started with your own data

Start the server (requires JDK 21+):

java --add-modules jdk.incubator.vector \
     --enable-native-access=ALL-UNNAMED \
     -jar stratum-standalone.jar

Then connect with any PostgreSQL client and do everything from SQL:

-- Load your data
CREATE TABLE transactions (amount DOUBLE PRECISION, freq BIGINT, hour BIGINT);
INSERT INTO transactions VALUES (10.0, 5, 14), (15.0, 6, 9), ...;

-- Or query directly from files
SELECT * FROM read_csv('/path/to/transactions.csv');

-- Train a model
CREATE MODEL fraud_model
  TYPE ISOLATION_FOREST
  OPTIONS (n_trees = 200, contamination = 0.05)
  AS SELECT amount, freq, hour FROM transactions;

-- Score your data
SELECT *, ANOMALY_SCORE('fraud_model', amount, freq, hour) AS score
FROM transactions
ORDER BY score DESC;

For programmatic workflows, Stratum also has a Clojure API for model training, online rotation, and integration with the query engine. Add to deps.edn:

What is this syntax?

{:deps {org.replikativ/stratum {:mvn/version "RELEASE"}}}

{:deps {org.replikativ/stratum {:mvn/version "RELEASE"}}}

Source and full documentation: github.com/replikativ/stratum. The anomaly detection guide has the complete API reference.

Feedback welcome on Clojurians #datahike or email.

Memory That Collaborates

Wed, 25 Mar 2026 23:00:00 GMT

Memory That Collaborates

March 2026

When two teams need to combine data, the usual answer is infrastructure: an ETL pipeline, an API, a message bus. Each adds latency, maintenance burden, and a new failure mode. The data moves because the systems can’t share it in place.

There’s a simpler model. If your database is an immutable value in storage, then anyone who can read the storage can query it. No server to run, no API to negotiate, no data to copy. And if your query language supports multiple inputs, you can join databases from different teams in a single expression.

This is how Datahike works. It isn’t a feature we bolted on - it intentionally falls out of two properties fundamental to the architecture.

Databases are values

In a traditional database, you query through a connection to a running server. The data may change between queries. The database is a service, not something you hold.

Datahike inverts this. Dereference a connection (@conn) and you get an immutable database value - a snapshot frozen at a specific transaction. It won’t change. Pass it to a function, hold it in a variable, hand it to another thread. Two concurrent readers holding the same snapshot always agree, without locks or coordination.

This is an idea Rich Hickey introduced with Datomic in 2012: separate process (writes, managed by a single writer) from perception (reads, which are just values). The insight was that a correct implementation of perception does not require coordination.

Datomic’s indices live in storage, but its transactor holds an in-memory overlay of recent index segments that haven’t been flushed yet. Readers typically need to coordinate with the transactor to get a complete, current view. The storage alone isn’t enough.

Datahike removes that dependency. The writer flushes to storage on every transaction, so storage is always authoritative. Any process that can read the store sees the full, current database - no overlay, no transactor connection needed. To understand why this works, you need to see how the data is structured.

Trees in storage

Datahike keeps its indices in a persistent sorted set - a B-tree variant where nodes are immutable. Every node is stored as a key-value pair in konserve, which abstracts over storage backends: S3, filesystem, JDBC, IndexedDB.

When a transaction adds data, Datahike doesn’t modify existing nodes. It creates new nodes for the changed path from leaf to root, while the unchanged subtrees are shared with the previous version. This is structural sharing - the same technique behind Clojure’s persistent vectors and Git’s object store.

A concrete example: a database with a million datoms might have a B-tree with thousands of nodes. A transaction that adds ten datoms rewrites perhaps a dozen nodes along the affected paths. The new tree root points to these new nodes and to the thousands of unchanged nodes from before. Both the old and new snapshots are valid, complete trees. They just share most of their structure.

The crucial property: every node is written once and never modified. Its key can be content-addressed. This means nodes can be cached aggressively, replicated independently, and read by any process that has access to the storage - without coordinating with the process that wrote them. (For more on how structural sharing, branching, and the tradeoffs work, see The Git Model for Databases.)

The distributed index space

This is where it comes together.

When you call @conn, Datahike fetches one key from the konserve store: the branch head (e.g. :db). This returns a small map containing root pointers for each index, schema metadata, and the current transaction ID. Nothing else is loaded - the database value you receive is a lazy handle into the tree.

When a query traverses the index, each node is fetched on demand from storage and cached in a local LRU. Subsequent queries hitting the same nodes pay no I/O.

That’s the entire read path. No server process mediating access, no connection protocol, no port to expose. The indices live in storage, and any process that can read the storage can load the branch head, traverse the tree, and run queries. We call this the distributed index space.

Two processes reading the same database fetch the same immutable nodes independently. They don’t know about each other. A writer publishes new snapshots by writing new tree nodes, then atomically updating the branch head. Readers that dereference afterward see the new snapshot. Readers holding an earlier snapshot continue undisturbed - their nodes are immutable and won’t be garbage collected while reachable.

Joining across databases

Because databases are values and Datalog natively supports multiple input sources, the next step is natural: join databases from different teams, different storage backends, or different points in time - in a single query.

Team A maintains a product catalog on S3. Team B maintains inventory on a separate bucket. A third team joins them without either team doing anything:

What is this syntax?

def catalog: d/connect({:store {:backend :s3, :bucket "team-a"}})
def inventory: d/connect({:store {:backend :s3, :bucket "team-b"}})

d/q('[:find ?name ?price ?stock
      :in $cat $inv
      :where [$cat ?p :product/sku ?sku]
             [$cat ?p :product/name ?name]
             [$cat ?p :product/price ?price]
             [$inv ?i :stock/sku ?sku]
             [$inv ?i :stock/count ?stock]
             [?stock > 0]]
  @catalog @inventory)

(def catalog   (d/connect {:store {:backend :s3 :bucket "team-a"}}))
(def inventory (d/connect {:store {:backend :s3 :bucket "team-b"}}))

(d/q '[:find ?name ?price ?stock
       :in $cat $inv
       :where [$cat ?p :product/sku   ?sku]
              [$cat ?p :product/name  ?name]
              [$cat ?p :product/price ?price]
              [$inv ?i :stock/sku     ?sku]
              [$inv ?i :stock/count   ?stock]
              [(> ?stock 0)]]
  @catalog @inventory)

Each @ dereference fetches a branch head from its respective S3 bucket and returns an immutable database value. The query engine joins them locally. There is no server coordinating between the two, no data copied.

And because both are values, you can mix snapshots from different points in time:

What is this syntax?

;; Last quarter's catalog crossed with current inventory
def old-catalog: d/as-of(@catalog #inst "2025-11-01")

d/q('[:find ?name ?stock
      :in $cat $inv
      :where [$cat ?p :product/sku ?sku]
             [$cat ?p :product/name ?name]
             [$inv ?i :stock/sku ?sku]
             [$inv ?i :stock/count ?stock]]
  old-catalog @inventory)

;; Last quarter's catalog crossed with current inventory
(def old-catalog (d/as-of @catalog #inst "2025-11-01"))

(d/q '[:find ?name ?stock
       :in $cat $inv
       :where [$cat ?p :product/sku  ?sku]
              [$cat ?p :product/name ?name]
              [$inv ?i :stock/sku    ?sku]
              [$inv ?i :stock/count  ?stock]]
  old-catalog @inventory)

The old snapshot and the current one are both just values. The query engine doesn’t care when they’re from. This is useful for audits, regulatory reproducibility, and debugging: “what would this report have shown against last quarter’s data?”

From storage to browsers

So far, “storage” has meant S3 or a filesystem. But konserve also has an IndexedDB backend, which means the same model works in a browser. Using Kabel WebSocket sync and konserve-sync, a browser client replicates a database locally into IndexedDB. Queries run against the local replica with zero network round-trips. Updates sync differentially - only changed tree nodes are transmitted, the same structural sharing that makes snapshots cheap on the server makes sync cheap over the wire.

Try it

A complete cross-database join, runnable in a Clojure REPL:

What is this syntax?

require('[datahike.api :as d])

;; Two independent databases
def catalog-cfg: {:store {:backend :memory, :id java.util.UUID/randomUUID()}, :schema-flexibility :read}
def inventory-cfg: {:store {:backend :memory, :id java.util.UUID/randomUUID()}, :schema-flexibility :read}

d/create-database(catalog-cfg)
d/create-database(inventory-cfg)

def catalog: d/connect(catalog-cfg)
def inventory: d/connect(inventory-cfg)

;; Team A: products
d/transact(catalog
  [{:product/sku "W001", :product/name "Widget", :product/price 9.99}
   {:product/sku "G002", :product/name "Gadget", :product/price 24.5}
   {:product/sku "T003", :product/name "Thingamajig", :product/price 3.75}])

;; Team B: stock levels
d/transact(inventory
  [{:stock/sku "W001", :stock/count 140}
   {:stock/sku "G002", :stock/count 0}
   {:stock/sku "T003", :stock/count 58}])

;; Join: in-stock products with price
d/q('[:find ?name ?price ?stock
      :in $cat $inv
      :where [$cat ?p :product/sku ?sku]
             [$cat ?p :product/name ?name]
             [$cat ?p :product/price ?price]
             [$inv ?i :stock/sku ?sku]
             [$inv ?i :stock/count ?stock]
             [?stock > 0]]
  @catalog @inventory)
;; => #{["Widget" 9.99 140] ["Thingamajig" 3.75 58]}

(require '[datahike.api :as d])

;; Two independent databases
(def catalog-cfg  {:store {:backend :memory
                           :id (java.util.UUID/randomUUID)}
                   :schema-flexibility :read})
(def inventory-cfg {:store {:backend :memory
                            :id (java.util.UUID/randomUUID)}
                    :schema-flexibility :read})

(d/create-database catalog-cfg)
(d/create-database inventory-cfg)

(def catalog  (d/connect catalog-cfg))
(def inventory (d/connect inventory-cfg))

;; Team A: products
(d/transact catalog
  [{:product/sku "W001" :product/name "Widget"      :product/price 9.99}
   {:product/sku "G002" :product/name "Gadget"      :product/price 24.50}
   {:product/sku "T003" :product/name "Thingamajig" :product/price 3.75}])

;; Team B: stock levels
(d/transact inventory
  [{:stock/sku "W001" :stock/count 140}
   {:stock/sku "G002" :stock/count 0}
   {:stock/sku "T003" :stock/count 58}])

;; Join: in-stock products with price
(d/q '[:find ?name ?price ?stock
       :in $cat $inv
       :where [$cat ?p :product/sku   ?sku]
              [$cat ?p :product/name  ?name]
              [$cat ?p :product/price ?price]
              [$inv ?i :stock/sku     ?sku]
              [$inv ?i :stock/count   ?stock]
              [(> ?stock 0)]]
  @catalog @inventory)
;; => #{["Widget" 9.99 140] ["Thingamajig" 3.75 58]}

Replace :memory with :s3, :file, or :jdbc and the same code works across storage backends. The databases don’t need to share a backend - join an S3 database against a local file store in the same query.

Datahike Speaks Postgres

Thu, 07 May 2026 00:00:00 GMT

Datahike Speaks Postgres

May 2026

Open psql. Connect. Run a query. Switch branches. Run it again — same connection, same wire protocol, different version of the database.

$ psql postgresql://localhost:5432/inventory
inventory=> SELECT count(*) FROM widget;
 count
-------
  4218

inventory=> SET datahike.branch = 'pricing-experiment';
SET
inventory=> SELECT count(*) FROM widget;
 count
-------
  4221

inventory=> RESET datahike.branch;
SET

That’s not a feature toggle on a Postgres replica. It’s the same database — addressed through standard pgwire — viewed through two different commits. The implementation is pg-datahike, a beta we’re shipping today.

What it is

pg-datahike embeds a PostgreSQL-compatible adapter inside a Datahike process: wire protocol, SQL translator, virtual pg_* and information_schema catalogs, constraint enforcement, schema hints. Clients that speak Postgres talk to Datahike without a Postgres install — pgjdbc, Hibernate, SQLAlchemy, Odoo 19, and Metabase bootstrap unmodified against it. The migration path is round-trippable: pg_dump output replays into pg-datahike via psql, and the standalone jar dumps Datahike databases back out as portable PG SQL. Detailed test results at the end of this post.

A 60-second tour

The operator runs one jar. Everything else is psql.

$ java -jar pg-datahike-VERSION-standalone.jar
pg-datahike VERSION ready on 127.0.0.1:5432
  backend:  file (~/.local/share/pg-datahike)
  history:  off
  CREATE DATABASE:  enabled
  databases: ["datahike"]

Connect with: psql -h 127.0.0.1 -p 5432 -U datahike datahike
Press Ctrl+C to stop.

JDK 17+ is the only prerequisite; the jar is on GitHub releases. --memory for an ephemeral run; --help covers the rest.

The rest is psql — provision a fresh database, populate it, pin a session to a historical commit, drop it.

$ psql postgresql://localhost:5432/datahike

datahike=> CREATE DATABASE inventory;
CREATE DATABASE
datahike=> \c inventory
You are now connected to database "inventory".

inventory=> CREATE TABLE widget (sku TEXT PRIMARY KEY, weight INT);
CREATE TABLE
inventory=> INSERT INTO widget VALUES ('A', 10), ('B', 20);
INSERT 0 2
inventory=> SELECT datahike.commit_id();
                commit_id
---------------------------------------
 b4f2e1c0-2feb-5b61-be14-5590b9e01e48      ← copy this

inventory=> INSERT INTO widget VALUES ('C', 30);
INSERT 0 1
inventory=> SELECT count(*) FROM widget;
 count
-------
     3

inventory=> SET datahike.commit_id = 'b4f2e1c0-2feb-5b61-be14-5590b9e01e48';
SET
inventory=> SELECT count(*) FROM widget;     -- the database before the third insert
 count
-------
     2

inventory=> RESET datahike.commit_id;
SET
inventory=> \c datahike
datahike=> DROP DATABASE inventory;
DROP DATABASE

SET datahike.commit_id pins the session to a historical commit; everything else is plain Postgres. Sixty seconds, one jar, no Postgres install, no Clojure.

Architecture in one minute

What happens when you SET datahike.branch = 'feature'?

Datahike stores its database as a tree of immutable nodes in konserve, a key-value abstraction over filesystems, S3, JDBC, IndexedDB, and others. Every transaction writes new nodes for changed paths and shares unchanged subtrees with the previous version — the trick behind Clojure’s persistent vectors and Git’s object store. A commit is a small map listing the root pointers for each index; a branch is a named pointer at a commit.

So on SET datahike.branch = 'feature', the handler updates a session variable, and the next query loads that branch’s commit pointer from konserve, walks the tree, returns rows. No coordination with a transactor; storage is the source of truth. SET datahike.commit_id = '<uuid>' works the same way one level deeper — the session points at a specific commit instead of a branch head.

Two consequences worth flagging:

Branching is one konserve write. Creating a branch from any commit is constant time, regardless of database size, because structural sharing means the new branch points at existing nodes.
Reads don’t go through a transactor. Every node is content-addressable; any process that can read the storage can run queries against it. In principle, read fanout is bounded by storage bandwidth, not replica capacity — we’ll publish numbers in a follow-up. See Memory That Collaborates for more.

Integration patterns

1. Multi-database server

A single start-server call serves many Datahike connections. Clients route on the JDBC URL’s database name:

What is this syntax?

pg/start-server({"prod" prod-conn
                 "staging" staging-conn
                 "reports" reports-conn}
  {:port 5432})

(pg/start-server {"prod"    prod-conn
                  "staging" staging-conn
                  "reports" reports-conn}
                 {:port 5432})

Same shape on the standalone jar with repeatable --db flags: java -jar pg-datahike.jar --db prod --db staging --db reports.

jdbc:postgresql://localhost:5432/prod      → prod-conn
jdbc:postgresql://localhost:5432/staging   → staging-conn
jdbc:postgresql://localhost:5432/nonsuch   → 3D000 invalid_catalog_name

SELECT current_database() returns the connected name; pg_database enumerates the registry. Useful for multi-tenant deployments, or when ops wants one pgwire endpoint serving many independent stores.

2. Schema hints

Existing Datahike schemas don’t always look the way you’d want them to over SQL. :datahike.pg/* meta-attributes customize the SQL view without touching the underlying schema:

What is this syntax?

pg/set-hint!(conn :person/full_name {:column "name"})
pg/set-hint!(conn :person/ssn {:hidden true})
pg/set-hint!(conn :person/company {:references :company/id})

(pg/set-hint! conn :person/full_name {:column "name"})           ; rename the column
(pg/set-hint! conn :person/ssn       {:hidden true})             ; exclude from SQL
(pg/set-hint! conn :person/company   {:references :company/id})  ; FK target

After set-hint!, SELECT name FROM person works, ssn is invisible to SELECT * and information_schema.columns, and JOIN company c ON p.company = c.id resolves on Datahike’s native ref semantics.

3. Time-travel via SET

Datahike’s temporal primitives are exposed as session variables. The client doesn’t need to know what as-of means — it just sets a variable:

SET datahike.as_of   = '2024-01-15T00:00:00Z';  -- d/as-of
SET datahike.since   = '2024-01-01T00:00:00Z';  -- d/since
SET datahike.history = 'true';                  -- d/history
RESET datahike.as_of;

Every subsequent query in the session sees the chosen view. A reporting tool that doesn’t know about Datahike can produce point-in-time reports by setting one variable.

4. Git-like branching

Branching is cheap in Datahike: every transaction produces a new immutable commit, so a branch is just a named pointer at a commit UUID. Creation is O(1) — one konserve write, no data copy, no WAL replay. pgwire exposes the read side and the admin operations through standard PG mechanisms:

-- Introspect
SELECT datahike.branches();
SELECT datahike.current_branch();
SELECT datahike.commit_id();

-- Admin (konserve-level writes — they don't go through the tx writer)
SELECT datahike.create_branch('preview', 'db');     -- 'db' is Datahike's default branch name
SELECT datahike.create_branch('from-cid', '69ea6ee1-…');
SELECT datahike.delete_branch('preview');

-- Session view: three cuts on the same immutable log.
-- They compose — a feature branch's state as of yesterday is two SETs.
SET datahike.branch    = 'feature';
SET datahike.commit_id = '69ea6ee1-2feb-5b61-be14-5590b9e01e48';
SET datahike.as_of     = '2024-01-15T00:00:00Z';

Or pin a branch at connect time via the JDBC URL:

jdbc:postgresql://localhost:5432/prod:feature   → prod-conn, pinned to :feature
jdbc:postgresql://localhost:5432/prod           → prod-conn, default branch

SET datahike.commit_id = '<uuid>' is Datahike-unique: no other PG-compatible database lets a session pin to an exact commit identifier.

We’ll cover the structural-sharing model that makes branching this cheap in a follow-up post — including how it works across all the Datahike bindings, not just pgwire.

5. SQL-driven database provisioning

Set a :database-template on the server and pgwire clients self-provision and tear down databases over plain SQL. The template is a partial Datahike config; each CREATE DATABASE produces a fresh store with a generated UUID:

What is this syntax?

pg/start-server({"datahike" boot-conn}
  {:port 5432 :database-template {:store {:backend :memory} :schema-flexibility :write :keep-history? true}})

(pg/start-server {"datahike" boot-conn}
                 {:port 5432
                  :database-template {:store {:backend :memory}
                                      :schema-flexibility :write
                                      :keep-history? true}})

WITH clauses override the template per-database, and the SQL surface accepts both standard PG forms:

CREATE DATABASE myapp;                              -- inherits the template
CREATE DATABASE histdb WITH KEEP_HISTORY = true;    -- override per database
CREATE DATABASE memdb  WITH (BACKEND = 'memory',    -- Yugabyte-style paren form
                             INDEX   = 'persistent-set');
DROP DATABASE myapp;
DROP DATABASE IF EXISTS old_one;

Accepted WITH keys map case-insensitively to Datahike config:

`WITH` option	Datahike config	Notes
`BACKEND`	`[:store :backend]`	`'memory'`, `'file'` built-in; `'jdbc'`, `'s3'`, `'redis'`, `'lmdb'`, `'rocksdb'`, `'dynamodb'` via external konserve libraries
`STORE_ID`	`[:store :id]`	Defaults to a fresh UUID per `CREATE`
`PATH`	`[:store :path]`	File backend; `{{name}}` interpolation supported
`HOST` / `PORT` / `USER` / `PASSWORD` / `DBNAME`	`[:store :*]`	`jdbc` / `redis` backends
`SCHEMA_FLEXIBILITY`	`:schema-flexibility`	`'read'` or `'write'`
`KEEP_HISTORY`	`:keep-history?`
`INDEX`	`:index`	`'persistent-set'` → `:datahike.index/persistent-set`
`OWNER` / `TEMPLATE` / `ENCODING` / `LOCALE` / `TABLESPACE` / …	—	Postgres-only; silently accepted with a NOTICE so `pg_dump` round-trips work

The standalone jar enables this by default (use --no-create-database to disable). Embedded servers opt in via :database-template (or explicit :on-create-database / :on-delete-database hooks). Without one, CREATE / DROP DATABASE return SQLSTATE 0A000 feature_not_supported; mismatched preconditions return the standard PG SQLSTATEs.

Migrating from PostgreSQL

Wire compatibility extends to pg_dump SQL on both sides. Three workflows.

Real PostgreSQL → pg-datahike

pg_dump output replays straight into pg-datahike via psql or any JDBC client. Schema-side coverage: CREATE TABLE with FK constraints, CREATE SEQUENCE, DEFAULT nextval(…), CREATE TYPE … AS ENUM, CREATE DOMAIN, partitioned tables. Data-side: INSERT (single + multi-VALUES) and COPY … FROM stdin (text and CSV).

Run with the :pg-dump compat preset to silently accept constructs pg-datahike doesn’t model — triggers, functions, materialized views, ALTER OWNER:

java -jar pg-datahike.jar --compat pg-dump
psql -h localhost -p 5432 -U datahike -d datahike -f my_pg_dump.sql

Validated end-to-end against Chinook (15.6k rows, 11 tables, FKs, NUMERIC, TIMESTAMP) — full byte-identical bidirectional roundtrip — and Pagila (50k rows, 22 tables, ENUM, DOMAIN, partitioning, triggers, functions) — schema parses end-to-end, data loads.

pg-datahike → portable PG SQL

The standalone jar’s dump subcommand walks a Datahike database and emits pg_dump-shaped SQL. The output replays into either pg-datahike or real PostgreSQL via psql:

java -jar pg-datahike.jar dump --data-dir DIR --db NAME --out out.sql
java -jar pg-datahike.jar dump --config datahike-config.edn --copy

Flags cover INSERT-vs-COPY output, schema-only / data-only, and table exclusion. --config accepts a full Datahike config EDN, so any konserve backend works; store-id is auto-discovered.

What the resulting Datahike schema looks like

A native Datahike database — created with d/transact, never touched by SQL — also dumps as clean PG SQL. The inverse mapping is well-defined:

:db.unique/identity → PRIMARY KEY NOT NULL
:db.unique/value → UNIQUE
:db.cardinality/many T → T[] with PG array literals
:db.type/ref → bigint (the entity id; opt in to FK constraints with set-hint! :references)

So whether you start from a real PostgreSQL dump or from native Datahike, both sides translate cleanly through the same shape. The resulting schema is correct and queryable as both SQL relations and Datalog datoms. It isn’t always what you’d hand-design for entity-shaped Datalog queries — many apps stay with the relational shape, others evolve incrementally as they reach for Datalog’s strengths (pull patterns, rules, multi-source joins).

What it isn’t

This is a 0.1 beta and we want to be specific about the gaps:

PL/pgSQL, stored functions, triggers, rules, and materialized views are accepted under the :pg-dump compat preset (loaded but not executed); strict mode rejects them
No LISTEN / NOTIFY
No COPY … TO STDOUT (COPY … FROM stdin is supported in text and CSV formats)
FK ON DELETE enforced for NO ACTION / RESTRICT / CASCADE; SET NULL / SET DEFAULT and any ON UPDATE action are rejected at DDL
Single public schema — CREATE SCHEMA is silently accepted but a no-op
Cursor materialization is eager (entire result set held in memory)
No deferrable constraints
Generated columns parse but aren’t enforced
Writes always land on the connection’s default branch in 0.1, even when SET datahike.branch is active. Reads respect the pinned branch; writes don’t yet. Use datahike.versioning/branch! and merge! from Clojure for branch-targeted writes, or open a second connection on /<db>:<branch>.
Constraint enforcement is one-directional. SQL constraints declared via DDL (NOT NULL, CHECK, UNIQUE, FK RESTRICT) are enforced by the pgwire handler; direct (d/transact) writes from Clojure bypass them because Datahike’s schema doesn’t yet carry the constraint vocabulary. A future release will lift enforcement into the tx layer so both paths are gated.
Bulk-insert throughput is ~5,000 rows/sec on JDBC batch (Pagila replays in ~12s, Chinook in ~3s) — Datahike maintains EAVT/AEVT/AVET live, so a 10-column row costs ~10× a single index write. Tuned bulk paths in vanilla PG (COPY, pg_restore -j) are an order of magnitude faster, partly via deferred index construction; an analogous bulk-load fast path is a future item. Large migrations are overnight-cutover territory today.

The conformance posture is: pass for the workloads we’ve measured against, fail fast and loud everywhere else. We’d rather reject a stored procedure than execute it incorrectly.

Where this fits

If you’ve used Neon or Xata, the goal will look familiar — branchable Postgres. The mechanism is different. Their branches are control-plane operations: call the API, get a new compute instance over copy-on-write storage. pg-datahike’s branches are session-level — SET datahike.branch = 'feature' inside an open psql connection switches what you’re reading. No provisioning, no compute. An agent or a query planner can switch branches mid-session.

Commit pinning — SET datahike.commit_id = '<uuid>' — is the part where we don’t know of a peer. Neon’s time-travel is bounded by a 6h–1d restore window; pg-datahike pins to any historical commit, indefinitely. We have not seen another PG-compatible database expose this directly through the wire protocol.

Dolt is the closest in spirit — git-like semantics, commit pinning, time-travel — but Dolt is MySQL with a custom storage engine. pg-datahike rides on the standard Postgres wire protocol; every PG client works without modification.

The honest tradeoff: we are a compatibility layer over Datahike’s storage, not a fork of Postgres. Some features tied to the Postgres codebase — PL/pgSQL, the extension ecosystem, procedural languages — aren’t on our roadmap today. If you need those, use Postgres. If your bottleneck is versioning, branching, or reproducibility, this gets you there without leaving the wire protocol your tools already speak.

Datahike has been a Datalog database with a Clojure API and growing language bindings; pg-datahike isn’t a separate database, just another front end on the same store. There’s a sibling: Stratum, a SIMD-accelerated columnar engine that speaks the same wire protocol over an analytical column store with the same fork-as-pointer semantics. Both fit into a shared branching model — see Yggdrasil: Branching Protocols for how a Datahike database, a Stratum dataset, and a vector index can fork together at a single snapshot.

The rest of this post is for callers who do speak Clojure — the same data accessible as relations and as datoms, in-process queries that skip the wire, embedded mode without TCP, and configuration knobs that aren’t exposed over SQL.

Bidirectional view

The pgwire layer is a view onto Datahike’s datom store, not a separate representation. Tables you create over SQL show up as normal Datahike schemas, queryable from Clojure with (d/q …). Existing Datahike schemas show up as SQL tables with no setup.

What is this syntax?

;; Plain Datahike schema, transacted from Clojure
d/transact(conn
  [{:db/ident :person/id :db/valueType :db.type/long
    :db/cardinality :db.cardinality/one :db/unique :db.unique/identity}
   {:db/ident :person/name :db/valueType :db.type/string
    :db/cardinality :db.cardinality/one}])

d/transact(conn [{:person/id 1, :person/name "Alice"}])

;; Plain Datahike schema, transacted from Clojure
(d/transact conn
  [{:db/ident :person/id   :db/valueType :db.type/long
    :db/cardinality :db.cardinality/one :db/unique :db.unique/identity}
   {:db/ident :person/name :db/valueType :db.type/string
    :db/cardinality :db.cardinality/one}])

(d/transact conn [{:person/id 1 :person/name "Alice"}])

-- Same database, over psql:
SELECT * FROM person;
--   id |  name
--  ----+-------
--    1 | Alice

The reverse holds too — CREATE TABLE over pgwire transacts a normal Datahike schema, and the next (d/q …) from Clojure sees the rows you just inserted. There is no shadow representation, no separate metadata. One datom store, two query languages.

Using the library directly

Two ways to skip the standalone jar — start a server from your own JVM application, or bypass the wire layer entirely.

Start a server in-process

What is this syntax?

;; deps.edn
{:deps {org.replikativ/datahike {:mvn/version "LATEST"}
        org.replikativ/pg-datahike {:mvn/version "LATEST"}}}

;; deps.edn
{:deps {org.replikativ/datahike    {:mvn/version "LATEST"}
        org.replikativ/pg-datahike {:mvn/version "LATEST"}}}

What is this syntax?

require('[datahike.api :as d] '[datahike.pg :as pg])

let [boot {:store {:backend :memory, :id random-uuid()}, :schema-flexibility :write}]:
  d/create-database(boot)
  pg/start-server({"datahike" d/connect(boot)} {:port 5432, :database-template {:store {:backend :memory}, :schema-flexibility :write, :keep-history? true}})
end
;; => :running on :5432

(require '[datahike.api :as d]
         '[datahike.pg  :as pg])

(let [boot {:store {:backend :memory :id (random-uuid)}
            :schema-flexibility :write}]
  (d/create-database boot)
  (pg/start-server {"datahike" (d/connect boot)}
                   {:port 5432
                    :database-template {:store {:backend :memory}
                                        :schema-flexibility :write
                                        :keep-history? true}}))
;; => :running on :5432

Same pgwire surface, in-process. The integration patterns earlier in this post are the embedded-library API; the standalone jar wraps the same calls behind CLI flags.

Bypass the wire entirely

Tests and in-process applications don’t need the wire layer at all:

What is this syntax?

def h: pg/make-query-handler(conn)
h.execute("CREATE TABLE person (id INT PRIMARY KEY, name TEXT)")
h.execute("INSERT INTO person VALUES (1, 'Alice')")
h.execute("SELECT * FROM person")

(def h (pg/make-query-handler conn))
(.execute h "CREATE TABLE person (id INT PRIMARY KEY, name TEXT)")
(.execute h "INSERT INTO person VALUES (1, 'Alice')")
(.execute h "SELECT * FROM person")

Same SQL surface, no socket. Useful for property-based testing of SQL workloads, or for embedding the SQL interface inside a Clojure or ClojureScript application without exposing a port.

Permissive vs. strict compat

By default the handler rejects unsupported DDL — GRANT, REVOKE, CREATE POLICY, ROW LEVEL SECURITY, CREATE EXTENSION, COPY — with SQLSTATE 0A000 feature_not_supported. Most ORMs emit some of these unconditionally. Two ways to relax:

What is this syntax?

;; silently accept every auth/RLS/extension no-op (Hibernate, Odoo)
pg/make-query-handler(conn {:compat :permissive})

;; accept specific kinds only
pg/make-query-handler(conn {:silently-accept #{:grant :policy}})

;; silently accept every auth/RLS/extension no-op (Hibernate, Odoo)
(pg/make-query-handler conn {:compat :permissive})

;; accept specific kinds only
(pg/make-query-handler conn {:silently-accept #{:grant :policy}})

The named presets in datahike.pg.server/compat-presets cover the common ORM patterns.

SQL or Datalog?

Both interfaces see the same datoms, the same indexes, the same history. The choice is about how the query reaches the engine.

Reach for SQL when callers don’t share a runtime with the database — services over the wire, analysts in Metabase, tools that only speak the wire protocol — or when you want existing tooling: ORMs, migration runners, BI dashboards.

Reach for Datalog when the query runs in the same process as the database. Datahike’s Datalog API is a Clojure function: pass values in, get values out, no parsing, no serialization, no socket. Even pg-datahike’s embedded mode (the make-query-handler path shown above) still goes through the SQL parser and the translator; Datalog skips both. You can invoke arbitrary Clojure functions inside predicates, return live data structures without copying, and join across multiple databases on different storage backends in a single query.

The two paths compose. DDL via Flyway over SQL, then reads in Datalog from your Clojure backend. Or: Datahike schema in Clojure, ORM-driven CRUD over SQL. Both stay coherent because they’re views of the same datom store.

Compatibility evidence

We test pg-datahike against the same suites the Postgres ecosystem uses on itself. If a suite passes here, the apps that depend on it generally work here.

Layer	Test suite	Result	What this proves
JDBC driver	pgjdbc 42.7.5 — `ResultSetTest`	80 / 80	Cursors, type decoding, and metadata behave the way every JVM Postgres client expects.
Java ORM	Hibernate 6 — `DatahikeHibernateTest`	13 / 13	JPA stacks — Spring, Quarkus, Jakarta — talk to pg-datahike the same way they talk to Postgres.
Python ORM	SQLAlchemy 2.0 dialect	16 / 16 across 7 phases	The Python data ecosystem — Django, Flask, FastAPI, Airflow, dbt — connects via the standard dialect path.
SQL semantics	sqllogictest	779 assertions, 61 files	Cases derived from PostgreSQL's regression suite, expressed in the sqllogictest format SQLite, CockroachDB, and DuckDB use for their own correctness work.
Real application	Odoo 19 — `--init=base --test-tags=:TestORM`	11 / 11 cases, ~38k queries, zero translator errors	A 200-table ERP with one of the most demanding open-source ORM layers boots and passes its own test suite.
BI tool	Metabase native SQL	20-probe MBQL sweep	Schema introspection, prepared statements, and result handling work for the paths real BI tools depend on.
Migration roundtrip	Chinook + Pagila `pg_dump` fixtures	Chinook: byte-equal roundtrip. Pagila: schema parses, data loads.	A real Postgres database can be exported, replayed in pg-datahike, and dumped back — schema and data preserved through the round-trip.
Internal	Unit suite	544 tests, 1603 assertions	Standard regression coverage.

Per-commit suites run on CircleCI. Odoo, Metabase, and psql / libpq (\d, \dt, \df family) are run on a manual harness before each release. A dedicated compatibility page with linked test artifacts and a published gaps registry is in flight.

Try it

Download the jar from GitHub releases, java -jar pg-datahike-VERSION-standalone.jar, point psql at it. To embed in a JVM app, the coordinate is org.replikativ/pg-datahike on Clojars. Repo, docs, and issues at github.com/replikativ/pg-datahike; feedback to contact@datahike.io.

A follow-up post will cover the structural-sharing model that makes branching O(1), what merge! does, and the same workflow across every Datahike binding (Clojure, Java, JavaScript, Python, the C library, the CLI, and SQL). Subscribe to the RSS feed.

Stratum: SQL that branches

Thu, 26 Feb 2026 00:00:00 GMT

Stratum: SQL that branches

A few years ago I hit a wall I suspect many data engineers know. I had a million-row analytical dataset and I wanted to run an experiment: modify a few pricing assumptions, re-run a set of aggregation queries, compare the results against the original. Simple enough - except in a mutable database, “compare against the original” means either keeping a copy of the data or hoping nothing changed. Neither scales.

Datahike solves this for entity-level data. Its storage is EAVT-indexed - like Datomic, tuned for entity traversal and point lookups. That’s the right structure for a system-of-record, but not for scanning 10M rows to compute a GROUP BY with SIMD. Stratum explores the columnar alternative: the same CoW branching semantics, but over column-oriented storage optimized for analytical scans. SQL is the natural interface for this access pattern - something Datahike doesn’t yet have. The longer-term plan is integration: Stratum’s columnar engine and SQL support as a query path within Datahike’s Datalog planner.

The core insight is that a columnar dataset is just a value. Make it immutable with structural sharing and you get git-like semantics for free: fork a dataset in O(1), modify branches independently, time-travel to any snapshot, persist named commits to storage. Then add SIMD execution via the Java Vector API, and it turns out you can beat DuckDB on most single-threaded analytical queries from pure JVM code - no native compilation, no JNI.

The SQL interface

Stratum speaks the PostgreSQL wire protocol. The quickest entry point is the standalone server:

java --add-modules jdk.incubator.vector \
     --enable-native-access=ALL-UNNAMED \
     - jar stratum-standalone.jar \
     --index orders:/data/orders.csv

Any PostgreSQL client connects immediately - psql, DBeaver, JDBC, psycopg2:

psql - h localhost - p 5432 - U stratum

-- Standard analytical SQL
SELECT region,
       SUM(amount * discount) AS revenue,
       COUNT(*)               AS orders
FROM orders
WHERE ship_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY region
ORDER BY revenue DESC;

-- Query CSV and Parquet files inline - auto-indexed on first access
SELECT payment_type,
       AVG(tip_amount),
       PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY tip_amount)
FROM read_csv('/data/taxi.csv')
GROUP BY payment_type;

Full DML: SELECT, INSERT, UPDATE, DELETE, UPSERT (INSERT ON CONFLICT). CTEs, correlated subqueries, window functions (ROW_NUMBER, RANK, LAG, LEAD, running aggregates), joins (INNER/LEFT/RIGHT/FULL with multi-column keys), set operations (UNION/INTERSECT/EXCEPT). Aggregates: SUM, COUNT, AVG, MIN, MAX, STDDEV, VARIANCE, CORR, MEDIAN, PERCENTILE_CONT, APPROX_QUANTILE, COUNT(DISTINCT). CASE WHEN, COALESCE, date functions, LIKE/ILIKE, FILTER clause. Full SQL reference →

How the engine works

Every column is split into fixed-size chunks. Each chunk carries pre-computed statistics: minimum, maximum, sum, count. This unlocks two significant optimizations.

Zone-map pruning. Each chunk carries pre-computed min, max, sum, and count statistics. DuckDB stores only min and max per segment, used for predicate filter pushdown - skipping segments that can’t contain rows matching a WHERE clause. Both engines do this. What DuckDB doesn’t pre-compute is per-segment SUM or COUNT, so unfiltered aggregates like COUNT(*), SUM(price), or AVG(price) require a full data scan in DuckDB. In Stratum, these are answered by traversing the pre-computed metadata at tree nodes - no row data touched. SELECT AVG(price) FROM orders on 10M rows: Stratum 0.1ms, DuckDB 7.1ms.

Fused SIMD execution. Most columnar engines evaluate predicates in one pass, then apply the result mask during a separate aggregation pass. Stratum fuses these into a single loop: predicates and accumulation run simultaneously via Java Vector API VectorMask chains, processing four doubles or longs per SIMD cycle. No intermediate arrays, no second pass, no extra allocation.

The Vector API (JDK 21+) provides DoubleVector and LongVector operations backed by AVX-512 on x86 and SVE on ARM. The bet was that the JVM incubator API had matured enough to compete with native code on analytical workloads without the deployment complexity of a native library. The benchmarks suggest that bet paid off.

Performance

Single-threaded comparison vs DuckDB v1.4.4 (JDBC in-process) on 10M rows, Intel Core Ultra 7 258V, JVM 25. Median of 10 iterations, 5 warmup:

Query	Stratum	DuckDB	Ratio
TPC-H Q6 (filter + sum-product)	13ms	28ms	2.2x faster
Filtered COUNT (NEQ pred)	3ms	12ms	4.0x faster
TPC-H Q1 (7 aggs, 4 groups)	75ms	93ms	1.2x faster
H2O Q3 (100K string groups)	71ms	362ms	5.1x faster
H2O Q10 (10M groups, 6 cols)	832ms	7056ms	8.5x faster
LIKE '%search%'	47ms	240ms	5.1x faster
AVG(LENGTH(URL))	38ms	170ms	4.5x faster
H2O Q6 (STDDEV group-by)	30ms	81ms	2.7x faster
H2O Q9 (CORR)	61ms	134ms	2.2x faster
MEDIAN(price)	68ms	158ms	2.3x faster
ROW_NUMBER window	316ms	426ms	1.3x faster

Stratum wins 35 of 46 queries at 10M rows (single-threaded, median of 10 runs). DuckDB wins on sparse-selectivity filters, window-based top-N, high-cardinality hash group-by at scale (1M+ unique groups where hash tables become DRAM-bound), and global COUNT(DISTINCT). Full methodology and raw results: benchmark docs.

DuckDB is an excellent system. The point is that pure JVM code can compete with a mature native engine on the workloads that matter most, while adding semantics DuckDB doesn’t have.

Branching: where it diverges

This is the part that doesn’t exist anywhere else.

Each column is backed by a chunked B-tree (PersistentColumnIndex) that implements Clojure’s IPersistentCollection and IEditableCollection protocols. When you call (st/fork ds), you get a new dataset that shares all unchanged chunks with the original. No data is copied - just a new root pointer into a shared tree. Mutations through the transient protocol only write-copy the chunks they touch. A billion-row dataset costs essentially nothing to fork.

What is this syntax?

require('[stratum.api :as st]
  '[konserve.file-store :as fs]
  '[clojure.core.async :refer [<!!]])

;; Open storage, load the orders dataset (10M rows)
def store: <!!(fs/new-fs-store("/data/stratum"))
def orders: <!!(st/load(store "orders"))

;; Fork in O(1) - structural sharing, zero data copied
def experiment: st/fork(orders)

;; Persist the fork as a named branch
<!!(st/sync!(experiment store "experiment"))

;; Query both branches via SQL - pass column data as table map
st/q("SELECT SUM(price * qty) FROM t" {"t" st/columns(orders)})
;; => {:SUM(price * qty) 4821903.40}   ← main branch

st/q("SELECT SUM(price * qty) FROM t" {"t" st/columns(experiment)})
;; => {:SUM(price * qty) 4401238.66}   ← experiment branch

;; Time-travel: load any historical branch by name
def baseline: <!!(st/load(store "orders-baseline"))

(require '[stratum.api :as st]
         '[konserve.file-store :as fs]
         '[clojure.core.async :refer [<!!]])

;; Open storage, load the orders dataset (10M rows)
(def store  (<!! (fs/new-fs-store "/data/stratum")))
(def orders (<!! (st/load store "orders")))

;; Fork in O(1) - structural sharing, zero data copied
(def experiment (st/fork orders))

;; Persist the fork as a named branch
(<!! (st/sync! experiment store "experiment"))

;; Query both branches via SQL - pass column data as table map
(st/q "SELECT SUM(price * qty) FROM t" {"t" (st/columns orders)})
;; => {:SUM(price * qty) 4821903.40}   ← main branch

(st/q "SELECT SUM(price * qty) FROM t" {"t" (st/columns experiment)})
;; => {:SUM(price * qty) 4401238.66}   ← experiment branch

;; Time-travel: load any historical branch by name
(def baseline (<!! (st/load store "orders-baseline")))

From the server side, register-live-table! lets you expose named branches as separate SQL tables - query them with plain SQL over the PostgreSQL connection without touching the Clojure API.

The practical uses this unlocks:

Reproducible experiments: fork a dataset, run your pipeline on the fork, compare results against the original without managing separate data copies or locking the source
Audit trails: every query result is tied to a specific database state - you can always recover the exact snapshot that produced a given answer
What-if analysis: branch before a bulk UPDATE, run your scenario, inspect the diff, discard - the original is untouched
Zero-ETL: Datahike is the system-of-record; Stratum queries the same versioned snapshots directly, no extraction pipeline needed

For Clojure developers

If you’re coming from the Clojure ecosystem, Stratum datasets behave like ordinary Clojure values. They implement IPersistentCollection, ILookup, IEditableCollection - tablecloth and tech.ml.dataset work with them directly as column maps. You can query with SQL strings or a Clojure DSL that composes programmatically:

What is this syntax?

require('[stratum.api :as st])

;; DSL - composable, no string manipulation
st/q({:from {:price prices, :qty quantities, :region regions}
      :where [[:> :price 100]]
      :group [:region]
      :agg [[:sum [:* :price :qty]] [:count]]})

;; SQL string - same engine underneath
st/q(
  "SELECT region, SUM(price * qty), COUNT(*)\n       FROM orders WHERE price > 100 GROUP BY region" {"orders" {:price prices, :qty quantities, :region regions}})

(require '[stratum.api :as st])

;; DSL - composable, no string manipulation
(st/q {:from   {:price prices :qty quantities :region regions}
       :where  [[:> :price 100]]
       :group  [:region]
       :agg    [[:sum [:* :price :qty]]
                [:count]]})

;; SQL string - same engine underneath
(st/q "SELECT region, SUM(price * qty), COUNT(*)
       FROM orders WHERE price > 100 GROUP BY region"
      {"orders" {:price prices :qty quantities :region regions}})

The DSL is useful when building queries programmatically - no string interpolation, no injection risk, results are plain Clojure maps.

The origin

This work started with Votorola, a collaborative democracy project that needed distributed state. The limitations of imperative systems led to Clojure, then to replikativ for distributed replication, then to Datahike for immutable entity-level storage. Each step sharpened the same conviction: mutability is the core problem. When data changes in place, you lose history, auditability, and the ability to reason about what a system knew at any point in time.

My PhD work on simulator-based inference at UBC’s PLAI lab reinforced this. Probabilistic systems need to fork hypotheses, accumulate evidence, and explain their reasoning - tracking not just the current state but the path that led to it. Stratum is the analytics piece of the infrastructure we’re building for that.

The ecosystem

Datahike - immutable Datalog database: system-of-record for structured entity data
Stratum - SIMD-accelerated columnar SQL: analytics and scans over those same snapshots
Proximum - version-controlled vector search (HNSW)
Scriptum - git-like branching for full-text search (Lucene)
Yggdrasil - unified branching across all of the above

Via Yggdrasil you can fork a Datahike database, a Stratum dataset, and a Proximum index together - consistent snapshots across SQL, Datalog, and vector search at the same point in time.

Getting started

What is this syntax?

;; deps.edn - check https://clojars.org/org.replikativ/stratum for latest
{:deps {org.replikativ/stratum {:mvn/version "0.1.7"}}}

:jvm-opts

["--add-modules=jdk.incubator.vector" "--enable-native-access=ALL-UNNAMED"]

;; deps.edn - check https://clojars.org/org.replikativ/stratum for latest
{:deps {org.replikativ/stratum {:mvn/version "0.1.7"}}}

:jvm-opts ["--add-modules=jdk.incubator.vector"
           "--enable-native-access=ALL-UNNAMED"]

# Standalone server with built-in demo tables (lineitem, taxi -100K rows each)
java --add-modules jdk.incubator.vector \
     --enable-native-access=ALL-UNNAMED \
     - jar stratum-standalone.jar --demo

Requirements: JDK 21+

Source: github.com/replikativ/stratum

Documentation: Query DSL · SQL Interface · Dataset API · Anomaly Detection · Benchmarks · Architecture

Source and documentation: github.com/replikativ/stratum. Feedback welcome on the Clojurians #datahike channel.

The Git Model for Databases

Tue, 06 Jan 2026 00:00:00 GMT

The Git Model for Databases

Every commit is a snapshot. Branches are cheap. Merging is a first-class operation. Developers internalize this model for code - but it applies equally well to data.

Databases as values

In a traditional database, you interact through a connection. The data may change between queries; the database is a service, not a thing you hold.

Datahike inverts this. Dereference a connection (@conn) and you get a database value: a snapshot frozen at a particular transaction. That value won’t change. Pass it to a function, store it, compare it to another snapshot. Two threads reading the same snapshot always agree - no locks, no coordination. And because a snapshot is just a value, you can hand it to any number of workers across threads, processes, or machines. Read scaling is built in: spin up more readers, not more database connections.

If every write produces a new snapshot, won’t you run out of memory? No - because snapshots share structure. When you transact new data, Datahike creates new tree nodes only for changed portions; everything else is reused. A million-row database with one updated row shares 99.99% of its structure with the previous version.

This is the same trick that powers Clojure’s persistent vectors and git’s object store. Overhead is logarithmic, not linear.

Branching

Fork a database, make changes in isolation, merge back when ready. Unlike git (which merges text files), database merges operate on datoms with application-defined conflict resolution.

This enables workflows that are awkward otherwise: feature branches for data migrations, parallel experiments with different schemas, per-tenant forks sharing a common ancestor. It’s also how coding assistants use git worktrees to isolate their edits - the same model applies to data.

History that persists

Most databases offer snapshot isolation through MVCC, but those snapshots are ephemeral - garbage collected after the transaction commits. You can’t query “what was the value last Tuesday?”

Datahike keeps history by default. Every past state is addressable. Query as-of a specific instant, diff two snapshots, audit when a fact changed. Useful for debugging, compliance, and any system that needs to explain itself.

The tradeoff

Immutability isn’t free. Write amplification is real: inserting a row touches multiple tree nodes. Storage grows with history, though compaction can prune what you don’t need.

In practice, this cost is amortized. You don’t create a snapshot for every fact added during a bulk load. The underlying data structures support transient modes - mutable during a batch, immutable at the boundary. Snapshots are created only when a batch commits, and only those become visible to external readers. The system can adaptively coarse-grain batches to balance write throughput against snapshot granularity.

For systems that value auditability, reproducibility, and coordination-free reads, this model beats the connection-oriented one we inherited from the 1970s.

Versioned Analytics for Regulated Industries

Mon, 06 Apr 2026 23:00:00 GMT

Versioned Analytics for Regulated Industries

Financial regulation — Basel III, MiFID II, Solvency II, SOX — requires that risk calculations, credit decisions, and compliance reports be reproducible. Not just the code, but the exact data state that produced them. When an auditor asks “show me the data behind this risk number from six months ago,” the answer can’t be “we’ll try to reconstruct it.”

Version control solved this problem for source code decades ago. But analytical data infrastructure never caught up. Data warehouses don’t version tables. Temporal tables track row-level changes but don’t compose across tables or systems. Manual snapshots are expensive, fragile, and don’t support branching for scenario analysis.

Stratum brings the git model to analytical data: every write creates an immutable, content-addressed snapshot. Old states remain accessible by commit UUID. Branches are O(1). And via Yggdrasil, you can tie entity databases, analytical datasets, and search indices into a single consistent, auditable snapshot.

The problem

A typical analytical pipeline at a regulated institution:

Transactional data flows into a warehouse (nightly ETL or streaming)
Analysts run GROUP BY / SUM / STDDEV queries for risk models and reports
Results feed regulatory submissions — capital adequacy, liquidity coverage, market risk
Months later, an auditor asks: “What data produced risk report X on date Y?”

Step 4 is where things break. The warehouse has been mutated since then. Maybe there’s a backup, maybe not. Reconstructing the exact state requires replaying ETL from source systems — if those logs still exist.

Even if you can reconstruct the data, you can’t prove it’s the same data. There’s no cryptographic link between the report and the state that produced it. The best you can offer is procedural trust: “our backup process is reliable, and we believe this is what the data looked like.” That’s a weak foundation for regulatory compliance.

Immutable snapshots as audit anchors

With Stratum, every table is a copy-on-write value. Writes create new snapshots; old snapshots remain addressable by commit UUID or branch name. The underlying storage is a content-addressed Merkle tree — each snapshot’s identity is derived from a hash of its data, providing a cryptographic chain of custody from report to source.

What is this syntax?

require('[stratum.api :as st])

;; Load the current production state
def trades: st/load(store "trades" {:branch "production"})

;; Run today's risk calculation
def risk-report: st/q({:from trades, :group [:desk :currency], :agg [[:sum :notional] [:stddev :pnl] [:count]]})

;; The commit UUID is your audit anchor — store it alongside the report
;; Six months later, reproduce exactly:
def historical-trades: st/load(store "trades" {:as-of #uuid "a1b2c3d4-..."})

def historical-report: st/q({:from historical-trades, :group [:desk :currency], :agg [[:sum :notional] [:stddev :pnl] [:count]]})
;; Identical results, guaranteed by content addressing

(require '[stratum.api :as st])

;; Load the current production state
(def trades (st/load store "trades" {:branch "production"}))

;; Run today's risk calculation
(def risk-report
  (st/q {:from trades
         :group [:desk :currency]
         :agg [[:sum :notional] [:stddev :pnl] [:count]]}))

;; The commit UUID is your audit anchor — store it alongside the report
;; Six months later, reproduce exactly:
(def historical-trades
  (st/load store "trades" {:as-of #uuid "a1b2c3d4-..."}))

(def historical-report
  (st/q {:from historical-trades
         :group [:desk :currency]
         :agg [[:sum :notional] [:stddev :pnl] [:count]]}))
;; Identical results, guaranteed by content addressing

Or via SQL — connect any PostgreSQL client:

-- Today's report
SELECT desk, currency, SUM(notional), STDDEV(pnl), COUNT(*)
FROM trades GROUP BY desk, currency;

-- Historical report: same query, different snapshot
-- resolved server-side via branch/commit configuration

Once committed, data cannot be modified — every state is a value, addressable by its content hash. Historical snapshots load lazily from storage on demand, so keeping years of history doesn’t mean paying for it in memory. And because snapshots are immutable values, multiple analysts can query the same or different points in time concurrently without coordination or locks.

Scenario analysis with branching

Beyond audit compliance, regulated institutions need scenario analysis. Basel III stress testing requires banks to evaluate capital adequacy under hypothetical adverse conditions — equity drawdowns, interest rate shocks, credit spread widening. Traditional approaches involve copying production data into staging environments, running scenarios, comparing results, and cleaning up. That process is slow, expensive, and error-prone.

With copy-on-write branching, forking a dataset is O(1) regardless of size. A 100-million-row table branches in microseconds because the fork is just a new root pointer into the shared tree. Only chunks that are actually modified get copied.

What is this syntax?

;; Fork production data for stress testing — O(1) regardless of table size
def stress-scenario: st/fork(trades)

;; Apply adverse conditions — only modified chunks are copied
;; e.g. via SQL: UPDATE trades SET price = price * 0.7
;;               WHERE asset_class = 'equity'

;; Compare risk metrics: production vs stressed
def baseline-risk: st/q({:from trades, :group [:desk], :agg [[:stddev :pnl] [:sum :notional]]})

def stressed-risk: st/q({:from stress-scenario, :group [:desk], :agg [[:stddev :pnl] [:sum :notional]]})

;; Run as many scenarios as needed — each is an independent branch
;; Baseline, adverse, severely adverse, custom scenarios
;; all sharing unmodified data via structural sharing

;; Fork production data for stress testing — O(1) regardless of table size
(def stress-scenario (st/fork trades))

;; Apply adverse conditions — only modified chunks are copied
;; e.g. via SQL: UPDATE trades SET price = price * 0.7
;;               WHERE asset_class = 'equity'

;; Compare risk metrics: production vs stressed
(def baseline-risk
  (st/q {:from trades
         :group [:desk]
         :agg [[:stddev :pnl] [:sum :notional]]}))

(def stressed-risk
  (st/q {:from stress-scenario
         :group [:desk]
         :agg [[:stddev :pnl] [:sum :notional]]}))

;; Run as many scenarios as needed — each is an independent branch
;; Baseline, adverse, severely adverse, custom scenarios
;; all sharing unmodified data via structural sharing

Each branch is fully isolated: modifications to the stress scenario can’t touch production data. You can maintain dozens of concurrent scenarios without multiplying storage costs — they share all unmodified data. When you stop referencing a branch, mark-and-sweep GC reclaims the storage. No staging environments, no cleanup scripts.

This also applies to model validation. When a risk model is updated, you can run the new model against historical snapshots and compare its outputs to the original model’s results — same data, different code, verifiable divergence.

Cross-system consistency

A real regulatory pipeline isn’t just one analytical table. Entity data (customers, counterparties, legal entities) lives in a transactional database. Analytical views (positions, P&L, exposures) live in a columnar engine. Compliance documents and communications live in a search index. For an audit to be meaningful, all of these need to be at the same point in time.

Yggdrasil provides a shared branching protocol across these heterogeneous systems. You can compose a Datahike entity database, a Stratum analytical dataset, and a Scriptum search index into a single composite system — branching, snapshotting, and time-traveling all of them together.

What is this syntax?

require('[yggdrasil.core :as ygg])

;; Compose entity database + analytics + search into one system
def system: ygg/composite-system({:entities datahike-conn, :analytics stratum-store, :search scriptum-index})

;; Branch the entire system for an investigation
ygg/branch!(system "investigation-2026-Q1")

;; Every component is now at the same logical point in time
;; Query across all three with a single consistent snapshot

(require '[yggdrasil.core :as ygg])

;; Compose entity database + analytics + search into one system
(def system
  (ygg/composite-system
    {:entities datahike-conn    ;; customer records, counterparties
     :analytics stratum-store   ;; trade data, positions, P&L
     :search scriptum-index}))  ;; compliance documents, communications

;; Branch the entire system for an investigation
(ygg/branch! system "investigation-2026-Q1")

;; Every component is now at the same logical point in time
;; Query across all three with a single consistent snapshot

When an auditor needs the full picture — the trade data, the customer entity that placed the trade, and the compliance documents reviewed at the time — they get a single consistent view across all systems, tied to one branch identifier. No manual coordination, no hoping the timestamps line up.

Compliance lifecycle

Immutable systems raise an obvious question: what about GDPR right-to-erasure, or data retention policies that require deletion?

Immutability doesn’t mean data can never be removed — it means deletion is explicit and verifiable rather than implicit and unauditable. The Datahike ecosystem supports purge operations that remove specific data from all indices and all historical snapshots. Mark-and-sweep garbage collection, coordinated across systems via Yggdrasil, reclaims storage from unreachable snapshots.

This is actually a stronger compliance story than mutable databases offer. In a mutable system, you DELETE a row and trust that the storage layer eventually overwrites it — but you can’t prove it’s gone from backups, replicas, or caches. With explicit purge on content-addressed storage, you can verify that the data no longer exists in any reachable snapshot.

Production-ready performance

Versioning and immutability don’t come at the cost of query speed. Stratum uses SIMD-accelerated execution via the Java Vector API, fused filter-aggregate pipelines, and zone-map pruning to skip entire data chunks. It runs standard OLAP benchmarks competitively with engines like DuckDB — while also providing branching, time travel, and content addressing that pure analytical engines don’t.

Full SQL is supported via the PostgreSQL wire protocol: aggregates, window functions, joins, CTEs, subqueries. Connect with psql, JDBC, DBeaver, or any PostgreSQL-compatible client. See the Stratum technical deep-dive for architecture details and benchmark methodology.

Getting started

Stratum runs as an in-process Clojure library or a standalone SQL server. Requires JDK 21+.

What is this syntax?

{:deps {org.replikativ/stratum {:mvn/version "RELEASE"}}}

{:deps {org.replikativ/stratum {:mvn/version "RELEASE"}}}

If you’re building analytical infrastructure in a regulated environment — or exploring how versioned data can simplify your compliance story — get in touch. We work with teams in finance, insurance, and healthcare to design data architectures where auditability is built in, not bolted on.

Why Search Needs Versioning

Mon, 05 Jan 2026 00:00:00 GMT

Why Search Needs Versioning

Search indexes are almost always mutable. You insert documents or embeddings, update them, delete them - the index reflects current state. This is fine when you’ll never need to query or audit past states, but breaks when retrieval feeds into reasoning.

Once search results enter an LLM’s context window or guide an agent’s action, the index is effectively memory. If that memory overwrites itself on every update, you can’t reproduce or audit past retrieval results. This applies to vector search and full-text search alike.

The problem

A retrieval-augmented system in production: embeddings are indexed, queries retrieve context, responses are generated. A week later, someone asks why the system returned a particular result. In a mutable index, there’s no answer. The index changed. The embedding model may have been updated. The retrieval state that produced that response no longer exists.

This isn’t theoretical. Any system where retrieval influences outcomes - recommendations, classifications, agent decisions - has this problem. The less human oversight there is, the more it matters.

Proximum: git semantics for search

Proximum applies the same copy-on-write model that powers Datahike and Clojure to HNSW (Hierarchical Navigable Small World) vector indexes. Every insert returns a new index version. Previous versions remain valid and queryable:

// Create and populate
var idx = ProximumVectorStore.builder()
    .dimensions(1536)
    .storagePath("/var/data/vectors")
    .build();

idx.addBatch(embeddings, ids);
idx.sync().join();  // persist and wait for completion
UUID v1 = idx.getCommitId();

idx.addBatch(moreEmbeddings, moreIds);
idx.sync().join();
UUID v2 = idx.getCommitId();

// Both versions remain searchable
var storeConfig = Map.of("backend", ":file", "path", "/var/data/vectors");
var oldIndex = ProximumVectorStore.connectCommit(storeConfig, v1);
oldIndex.search(query, 10);  // original state
idx.search(query, 10);       // current state

The branch() operation is O(1) - it shares structure with the original. Two branches diverge independently without copying data. This makes A/B testing embeddings, bisecting regressions, and maintaining reproducible baselines cheap.

How it works

The core data structure is a PersistentEdgeIndex: chunked copy-on-write arrays that hold HNSW graph edges. Layer 0 (the dense bottom layer) uses fixed-size chunks; upper layers use sparse per-node arrays. When you modify the graph, only affected chunks are copied. Unchanged structure is shared.

Vectors themselves live in a memory-mapped store backed by Konserve, so the same index can be persisted to disk, S3, or any pluggable backend. The combination gives you SIMD-accelerated search with full version history and portable storage.

Scriptum: git semantics for full-text search

The same versioning principles apply to traditional full-text search. Scriptum brings copy-on-write branching to Apache Lucene by sharing immutable segment files across branches:

// Create an index
BranchIndexWriter main = BranchIndexWriter.create(
    Path.of("/var/data/search"), "main");

Document doc = new Document();
doc.add(new TextField("content", "searchable text", Field.Store.YES));
main.addDocument(doc);
main.commit("Initial index");

// Fork a branch (3-5ms regardless of index size)
BranchIndexWriter experiment = main.fork("experiment");

// Branches evolve independently
experiment.addDocument(anotherDoc);
experiment.commit("Experimental changes");

// Time travel - query past state
DirectoryReader historical = main.openReaderAt(1);

// Merge back when ready
main.mergeFrom(experiment);

Forking is near-instant because Scriptum copies only new data - segment files are shared read-only. New writes create branch-specific segments. The BranchedDirectory overlay pattern routes reads to the base index while capturing writes in the branch overlay.

This gives you the same capabilities for keyword search, faceted navigation, and document retrieval that Proximum provides for vector search: reproducible queries, safe experimentation, and full audit history.

What this enables

With versioned indexes you can run the same query against the same index state and get the same results, which makes evaluation of embedding models and ranking algorithms reproducible. You can fork an index to test a new chunking strategy or analyzer configuration without risking production state. You can query the index as it existed at any past instant to answer “what could the system have retrieved when it made that decision?” And because a snapshot is a value, you can hand it to any number of reader threads or processes without coordination or locking.

The cost

Immutable indexes have write amplification: inserting a vector touches multiple graph edges, each potentially triggering chunk copies. Storage grows with history.

In practice, this cost is amortized. You don’t create a snapshot for every vector added during a bulk load. The PersistentEdgeIndex supports transient mode - mutable during batch insert, immutable at the boundary. Snapshots are created only when a batch commits, and only those become visible to readers. The system can adaptively coarse-grain batches to balance throughput against snapshot granularity.

If your search needs to be reproducible and auditable, versioned indexes are a good fit. Proximum handles vector search, Scriptum handles full-text. Both use the same copy-on-write approach.

Why We Built Datahike

Sat, 14 Feb 2026 00:00:00 GMT

Why We Built Datahike

February 2026

I’ve been working toward this for over a decade. It started with Votorola - collaborative liquid democracy software - where I first needed to distribute a memory model across systems. That led me to Clojure, which led me to a question that I’ve been chasing ever since: how do you build data infrastructure that doesn’t lose history?

Most databases are designed for transactional business logic: process an order, update an account, move on. But many of the systems we’re building today are different. They run for weeks or months, accumulate knowledge, and need to reason about their own past. A database that overwrites state on every write doesn’t support that well.

This is the story of why we built Datahike, and why I think immutable, versioned data is the right foundation for systems that need to last.

The problem with mutable state

In 2013, I started replikativ to explore distributed, cross-platform replication systems. The core challenge was always synchronization: how do you keep data consistent across nodes without losing the ability to reason about history? But the deeper I got, the more I realized the problem wasn’t distribution - it was mutability.

When data changes in place, you lose the ability to ask “what did the system know last Tuesday?” You can’t fork an experiment, try something, and merge it back. You can’t audit what happened, because the evidence has been overwritten.

In functional programming we solved this decades ago. Data structures are immutable - values don’t change, you get new values. Programs become easier to reason about and test. I kept wondering why databases didn’t work the same way.

Finding the pieces

The answer, it turned out, was that they could - but the pieces weren’t assembled yet. Datomic had shown the way: immutable, versioned data with time travel. But Datomic was closed source and designed for centralized deployment. I wanted something open, distributed by design, and built for systems that live everywhere - from edge devices to cloud clusters.

We needed the right combination of query engine, index structure, and persistence.

1. A mature query engine

Nikita Prokopov’s DataScript provided this. It was an in-memory Datalog database with five years of development, a robust query engine, and a clean, well-designed codebase. The only problem: it was purely in-memory. No durability.

2. A functional, persistent index

We initially experimented with David Greenberg’s Hitchhiker Tree, which taught us a lot about immutable indexing. It combines B+ tree query performance with append-only write semantics - great for logs and write-heavy workloads. But for database indices, we prefer optimal read performance. The Hitchhiker Tree trades some read speed for write performance, which wasn’t the right trade-off for our use case.

So we extended persistent-sorted-set, a functionally persistent sorted set optimized for database indices. It gives us excellent read performance while maintaining immutable semantics and efficient structural sharing. When you “update” the index, you don’t mutate nodes in place - you create new nodes that share structure with the old ones. The old version still exists, unchanged.

3. The glue to put them together

This is where Datahike came in. We forked DataScript, adjusted persistent-sorted-set, added storage backends (file, SQL, LMDB, S3, GCS and more via Konserve), and kept going. Konrad Kühne and our former team at Lambdaforge UG contributed substantially in the early years - adding history indices, time travel support, and helping Datahike achieve temporal query parity with Datomic. Together we built out schema flexibility and the protocols that make Datahike extensible.

The realization: databases should be values

Here’s the thing that took me years to fully appreciate: in Datahike, a database is a value, not a service.

In a traditional database, you connect to a server. The data changes between queries. You’re always interacting with “the database” as a mutable thing.

In Datahike, you dereference a connection and get a database value: a snapshot frozen at a particular transaction. That value won’t change. You can pass it to a function. Store it. Compare it to another snapshot. Two threads reading the same database value always see the same thing - no locks, no coordination needed.

This matters because it makes the database composable. You can hold a snapshot in a variable, hand it to a worker, serialize it, or compare two versions structurally. Read scaling becomes trivial: spin up more readers, not more database connections.

But the real power is what this enables.

Git semantics for data

Once you have immutable snapshots, you can do things that are awkward or impossible with traditional databases:

Branching: Fork a database, make changes in isolation, merge back when ready. Unlike git (which merges text files), database merges operate on datoms with application-defined conflict resolution. This enables feature branches for data migrations, parallel experiments with different schemas, per-tenant forks sharing a common ancestor.

Time travel: Query any past state. Not “last 7 days” - any specific instant. Diff two snapshots to see exactly what changed. Audit when a fact was added or retracted.

Reproducibility: Capture a snapshot, store it, query it later. Same snapshot always yields same results. This is essential for ML experiments, compliance systems, or anything that needs to explain its decisions.

Why this matters for AI

During my PhD, I developed inference systems that accumulate evidence over time. Probabilistic programs build up distributions, revise beliefs, maintain uncertainty. They need to fork hypotheses, evaluate alternatives, and keep track of the path that led to each conclusion. The database backing such a system needs to support that natively - not as a bolt-on.

The same applies to any long-running system that accumulates knowledge: agent pipelines, compliance systems, scientific workflows. They all benefit from being able to fork state safely, roll back when something goes wrong, and answer “what did this system know when it made that decision?”

Datahike provides this: knowledge survives restarts, you can fork and merge, every past state is queryable, and the same query on the same snapshot always returns the same result.

What we’ve built

From those early experiments, Datahike has grown into more than just a database:

Core database: Immutable Datalog with pluggable storage
Proximum: Version-controlled vector indexing for semantic search
Scriptum: Git-like branching for full-text search (Apache Lucene extension)
Yggdrasil: Protocols unifying branching semantics across storage systems

Each piece applies the same underlying idea: data should be immutable and versioned by default. We’re not done. Datalog is our starting point, and we’re working toward a broader programming model where persistent, versioned state is the default across distributed environments.

Where we’re going

I’m bootstrapping a company on top of Datahike. We’re looking for collaborators who want to push distributed immutable systems forward, and for early customers who need versioned data infrastructure in production.

This work has always been collaborative. Konrad Kühne and our early team helped shape Datahike’s foundation. The broader open source community continues to push it forward through issues, PRs, and production deployments.

If you’re building something where audit, reproducibility, or long-term memory matter, I’d like to hear about it.

Christian Weilbach

Founder and maintainer

Yggdrasil - Branching Protocols

Sat, 24 Jan 2026 00:00:00 GMT

Yggdrasil: Branching Protocols

What if every storage system spoke the same branching language? Yggdrasil is a protocol stack that brings Git-like semantics (snapshots, branches, merges, history) to heterogeneous storage backends.

In Norse mythology, Yggdrasil is the World Tree connecting nine realms. This library connects storage systems under one unified API.

The problem

Modern data systems are fragmented. Your vector index, your database, your filesystem, your container images - each has its own versioning model (or none at all). When you need reproducible pipelines across these systems, you’re left stitching together incompatible abstractions.

Consider an ML training pipeline:

Datasets versioned in LakeFS
Model weights on a filesystem
Embeddings in a vector store
Metadata in a database

Each system has different semantics for “create a snapshot” or “roll back to yesterday.” Coordinating them requires custom glue code that’s brittle and hard to reason about.

The solution: shared protocols

Yggdrasil defines a layered protocol stack that any storage system can implement. All operations use value semantics - mutating operations return new system values, never modify in place.

Protocol	Operations
Snapshotable	`snapshot-id`, `parent-ids`, `as-of`, `snapshot-meta`
Branchable	`branches`, `branch!`, `checkout`, `delete-branch!`
Graphable	`history`, `ancestors`, `common-ancestor`, `commit-graph`
Mergeable	`merge!`, `conflicts`, `diff`
Overlayable	`overlay`, `advance!`, `merge-down!`, `discard!`
Watchable	`watch!`, `unwatch!` - receives typed events on commit, branch, checkout
GarbageCollectable	`gc-roots`, `gc-sweep!` - coordinated cross-system mark-and-sweep
Addressable (optional)	`working-path` - filesystem path for current branch (Git, ZFS, Btrfs, OverlayFS)
Committable (optional)	`commit!` - explicit commit, separated from snapshot reads

When multiple systems implement these protocols, you can compose them. Fork a database and a vector index together. Merge changes across both atomically. Query historical state consistently.

Twelve adapters

Adapter	System	Branching model
Git	Version control	Native branches/commits
ZFS	Filesystem	Snapshots + clones
Btrfs	Filesystem	Subvolumes + snapshots
OverlayFS	Filesystem	Layered directories
Podman	Containers	Image layers
IPFS	P2P storage	Content-addressed commits + IPNS branches
Iceberg	Table format	Snapshots + native branches
Datahike	Database	Native COW
LakeFS	Data lake	Git-like branches
Dolt	SQL database	Git-like branches
Scriptum	Full-text search	Lucene segment sharing
Proximum	Vector search	Merkle-verified snapshots

CompositeSystem: branching multiple systems as one

The most significant recent addition is CompositeSystem - a fiber product (pullback) over shared branch space. Given systems A and B, the composite is the pair (A, B) where both are always on the same branch. All protocol operations apply componentwise.

What is this syntax?

require('[yggdrasil.composite :as composite] '[yggdrasil.protocols :as p])

;; Compose a database and a search index
def sys: composite/composite([datahike-sys scriptum-sys] :name "my-app" :branch :main :store-path "/var/lib/yggdrasil/composite")

;; All protocol operations work on both systems simultaneously
def branched: sys .> p/branch!(:experiment) .> p/checkout(:experiment)

;; Commit both atomically - gets a deterministic composite snapshot-id
def committed: p/commit!(branched "experimental run")

;; Merge back
def merged: committed .> p/checkout(:main) .> p/merge!(:experiment)

(require '[yggdrasil.composite :as composite]
         '[yggdrasil.protocols :as p])

;; Compose a database and a search index
(def sys (composite/composite [datahike-sys scriptum-sys]
           :name "my-app"
           :branch :main
           :store-path "/var/lib/yggdrasil/composite"))  ; optional persistence

;; All protocol operations work on both systems simultaneously
(def branched (-> sys
                  (p/branch! :experiment)
                  (p/checkout :experiment)))

;; Commit both atomically - gets a deterministic composite snapshot-id
(def committed (p/commit! branched "experimental run"))

;; Merge back
(def merged (-> committed
                (p/checkout :main)
                (p/merge! :experiment)))

snapshot-id on a composite returns a deterministic UUID derived from the combined state of all sub-systems - the same combination always yields the same ID. History, conflicts, and GC roots are all computed across the full set.

Passing :store-path persists the composite history via a PSS B-tree backed by konserve, so history survives process restarts.

Workspace: HLC-coordinated multi-system operations

The workspace layer adds Hybrid Logical Clock (HLC) coordination across independently managed systems. This enables temporal queries that span system boundaries.

What is this syntax?

require('[yggdrasil.workspace :as ws])

def w: ws/create-workspace({:store-path "/var/lib/yggdrasil/my-app"})

;; Manage systems - auto-installs commit hooks
;; Datahike uses d/listen (immediate); others fall back to polling
ws/manage!(w datahike-sys)
ws/manage!(w git-sys)

;; Query world state at any wall-clock time
let [world ws/as-of-time(w some-past-date.getTime())]:
  doseq [[[system-id branch] entry] world]:
    println(system-id branch "was at snapshot" :snapshot-id(entry))
  end
end

(require '[yggdrasil.workspace :as ws])

(def w (ws/create-workspace {:store-path "/var/lib/yggdrasil/my-app"}))

;; Manage systems - auto-installs commit hooks
;; Datahike uses d/listen (immediate); others fall back to polling
(ws/manage! w datahike-sys)
(ws/manage! w git-sys)

;; Query world state at any wall-clock time
(let [world (ws/as-of-time w (.getTime some-past-date))]
  (doseq [[[system-id branch] entry] world]
    (println system-id branch "was at snapshot" (:snapshot-id entry))))

Each commit in the registry carries an HLC timestamp. as-of-time scans the index and returns the snapshot each system was at for any given moment - across all managed systems consistently.

Typed diffs

diff returns system-specific records:

What is this syntax?

;; GitDiff: {:snapshot-a, :snapshot-b, :stat, :patch, :files [{:status :added/:modified/:deleted, :path}]}
;; DatahikeDiff: {:from, :to, :added [datoms], :removed [datoms], :summary {:added-datoms n, ...}}
;; DiffError: {:from, :to, :error}

;; GitDiff: {:snapshot-a, :snapshot-b, :stat, :patch, :files [{:status :added/:modified/:deleted, :path}]}
;; DatahikeDiff: {:from, :to, :added [datoms], :removed [datoms], :summary {:added-datoms n, ...}}
;; DiffError: {:from, :to, :error}

Callers can pattern-match on record type for system-specific handling.

Compliance testing

Every adapter passes the same compliance test suite - consistent behavior is guaranteed across all systems.

What is this syntax?

compliance/run-compliance-tests({:create-system fn []:
  my-adapter/init!(config)
end
                                 :mutate fn [sys]:
  ...
end
                                 :commit fn [sys msg]:
  ...
end
                                 :close! fn [sys]:
  ...
end})

(compliance/run-compliance-tests
  {:create-system (fn [] (my-adapter/init! config))
   :mutate        (fn [sys] ...)
   :commit        (fn [sys msg] ...)
   :close!        (fn [sys] ...)})

Why this matters

The practical value comes from being able to treat heterogeneous systems as one versioned unit. An ML pipeline can version its datasets, model weights, and embeddings together under one composite snapshot, making any training run fully reproducible. An agent system can fork its complete environment - database, vector store, working directory - per agent, merge successful experiments back, and discard failures without cleanup. A test suite can fork production state across all systems in milliseconds.

The as-of-time query is particularly useful for audit: “what was the exact state of every system when this decision was made?” answered across heterogeneous backends with causal ordering.

Getting started

See the GitHub repository for installation and adapter-specific setup. Licensed under Apache 2.0.

Part of the Datahike ecosystem

Yggdrasil is the protocol layer that connects:

Datahike - Immutable Datalog database
Proximum - Version-controlled vector search
Scriptum - Branching for Apache Lucene
Stratum - Columnar SQL with CoW snapshots

Yggdrasil provides the shared vocabulary that lets these systems branch together.

Datahike Notes

Anomaly Detection Belongs in Your Database

Anomaly Detection Belongs in Your Database

Why isolation forests

What the landscape looks like

The cost of exporting

How it works in Stratum

SQL interface

Clojure API

Online adaptation

Performance

Under the hood

What this enables

Try it yourself

Getting started with your own data

Memory That Collaborates

Memory That Collaborates

Databases are values

Trees in storage

The distributed index space

Joining across databases

From storage to browsers

Try it

Datahike Speaks Postgres

Datahike Speaks Postgres

What it is

A 60-second tour

Architecture in one minute

Integration patterns

1. Multi-database server

2. Schema hints

3. Time-travel via SET

4. Git-like branching

5. SQL-driven database provisioning

Migrating from PostgreSQL

Real PostgreSQL → pg-datahike

pg-datahike → portable PG SQL

What the resulting Datahike schema looks like

What it isn’t

Where this fits

Bidirectional view

Using the library directly

Start a server in-process

Bypass the wire entirely

Permissive vs. strict compat

SQL or Datalog?

Compatibility evidence

Try it

Stratum: SQL that branches

Stratum: SQL that branches

The SQL interface

How the engine works

Performance

Branching: where it diverges

For Clojure developers

The origin

The ecosystem

Getting started

The Git Model for Databases

The Git Model for Databases

Databases as values

Structural sharing

Branching

History that persists

The tradeoff

Versioned Analytics for Regulated Industries

Versioned Analytics for Regulated Industries

The problem

Immutable snapshots as audit anchors

Scenario analysis with branching

Cross-system consistency

Compliance lifecycle

Production-ready performance

Getting started

Why Search Needs Versioning

Why Search Needs Versioning

The problem

Proximum: git semantics for search

How it works

Scriptum: git semantics for full-text search