<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Datahike Notes</title><description>Technical notes on versioned data infrastructure</description><link>https://datahike.io/</link><item><title>Anomaly Detection Belongs in Your Database</title><link>https://datahike.io/notes/anomaly-detection-in-your-database/</link><guid isPermaLink="true">https://datahike.io/notes/anomaly-detection-in-your-database/</guid><description>Why we built SIMD-accelerated isolation forests directly into Stratum&apos;s SQL engine — and why exporting to Python is the wrong default.</description><pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;div class=&quot;container page prose&quot;&gt;
&lt;h1 id=&quot;anomaly-detection-belongs-in-your-database&quot;&gt;Anomaly Detection Belongs in Your Database&lt;/h1&gt;
&lt;p&gt;Every analytical database can aggregate, filter, and join. None of them can tell you “something is wrong with this data” as a first-class operation.&lt;/p&gt;
&lt;p&gt;The standard workflow today: query your warehouse, serialize millions of rows into a DataFrame, import scikit-learn, fit an &lt;code&gt;IsolationForest&lt;/code&gt;, write results back. You now maintain two systems, two runtimes, and a serialization boundary that adds seconds of latency per round-trip. For a fraud detection pipeline running against live transactions, those seconds matter. For a data engineer who just wants to flag outliers in a &lt;code&gt;SELECT&lt;/code&gt; statement, the entire Python detour is unnecessary friction.&lt;/p&gt;
&lt;p&gt;We built anomaly detection directly into &lt;a href=&quot;/stratum&quot;&gt;Stratum&lt;/a&gt; — not as a UDF shim that calls Python under the hood, but as a native SIMD-accelerated implementation that runs inside the query engine. Train a model, score your data, all from SQL — no Python, no Clojure, no external runtime.&lt;/p&gt;
&lt;img src=&quot;/images/anomaly-detection-explainer.svg&quot; alt=&quot;Infographic: comparing the Python export pipeline (seconds of latency, 2x memory) with Stratum&amp;#x27;s in-database approach (6 microseconds per transaction), and showing how isolation forests detect anomalies by isolating outliers in fewer tree splits&quot; style=&quot;width: 100%; margin: 2rem 0;&quot;&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transactions&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ANOMALY_SCORE(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;fraud_model&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;7&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;No data leaves the database. No serialization. The query planner pushes down predicates and prunes chunks before the model ever sees a row. Scoring a single transaction takes 6 microseconds. A batch of 1,000 incoming transactions: 1.6 milliseconds. That’s fast enough to sit in the hot path of a payment gateway — not as a batch job that runs after the fact, but as a synchronous check before the transaction clears.&lt;/p&gt;
&lt;h2 id=&quot;why-isolation-forests&quot;&gt;Why isolation forests&lt;/h2&gt;
&lt;p&gt;Most “anomaly detection in SQL” tutorials teach you to compute z-scores: &lt;code&gt;(value - AVG(value)) / STDDEV(value) &gt; 3&lt;/code&gt;. This works for Gaussian-distributed single columns. It fails everywhere else.&lt;/p&gt;
&lt;p&gt;Real anomalies are multivariate. A transaction amount of $500 is normal. A frequency of 20 per hour is normal. Both together, at 3am, to a merchant in a country where the cardholder has never transacted — that’s the signal. Z-scores can’t see it. Neither can IQR-based methods or simple threshold rules. You need a model that captures the joint structure of your data.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf&quot;&gt;Isolation forests&lt;/a&gt; (Liu, Ting &amp;#x26; Zhou, 2008) take a fundamentally different approach. Instead of modeling what “normal” looks like — a density estimate, a distribution fit, a cluster boundary — they directly measure how easy it is to &lt;em&gt;isolate&lt;/em&gt; a point from everything else. Build a tree of random splits across random features. Anomalous points, being few and different, get isolated in fewer splits. Normal points, packed into dense regions, require many splits to separate.&lt;/p&gt;
&lt;p&gt;The properties that make this algorithm uniquely suited to a columnar database:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No assumptions.&lt;/strong&gt; Z-scores assume Gaussian distributions. DBSCAN assumes density clusters. Isolation forests are non-parametric — they work on any distribution shape, any number of dimensions, without tuning.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Subsampling.&lt;/strong&gt; Each tree is trained on only 256 randomly sampled rows, regardless of total dataset size. Training 100 trees on 10M rows takes 6ms — it reads 25,600 rows total. This is the key insight from the original paper: anomalies are &lt;em&gt;so&lt;/em&gt; different that a tiny sample is enough to characterize them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Linear scoring.&lt;/strong&gt; Scoring each row means traversing 100 trees of depth ≤8. That’s 800 comparisons per row — branch-free, cache-friendly, and trivially parallelizable. Stratum’s implementation packs each tree node into a single &lt;code&gt;long&lt;/code&gt; (split feature in upper 32 bits, split value as float in lower 32), traverses with branchless &lt;code&gt;node = 2*node + 1 + cmp&lt;/code&gt;, and processes rows in morsel-driven parallel batches sized to fit L1 cache.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Multivariate by construction.&lt;/strong&gt; Every tree split randomly selects a feature. The ensemble naturally captures cross-feature interactions without the user specifying which features correlate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Unsupervised.&lt;/strong&gt; No labels needed. You don’t need a curated training set of “known fraud” — the algorithm finds whatever doesn’t fit the bulk distribution. This matters because in practice, labeled anomaly data is expensive, incomplete, and often biased toward known attack patterns.&lt;/p&gt;
&lt;h2 id=&quot;what-the-landscape-looks-like&quot;&gt;What the landscape looks like&lt;/h2&gt;
&lt;p&gt;We surveyed what major analytical databases offer for built-in anomaly detection:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DuckDB&lt;/strong&gt; has no native capability. The closest is &lt;a href=&quot;https://github.com/DataZooDE/anofox-tabular&quot;&gt;anofox-tabular&lt;/a&gt;, a third-party community extension (BSL-licensed) that adds isolation forests to DuckDB. We read through the implementation — it’s feature-rich (Extended IF, SCiForest, categorical columns, density scoring), but architecturally very different from what we built. anofox-tabular retrains the forest on every query — there’s no model persistence, so you can’t train once and score cheaply at query time. Its C++ implementation is scalar (no SIMD), single-threaded (no parallelism in build or score), and uses recursive traversal with &lt;code&gt;std::vector&lt;/code&gt; allocations at every tree node. It also copies all data from DuckDB’s columnar format into its own data structures before running. The README describes “vectorized C++17” which likely refers to DuckDB’s general execution model rather than the isolation forest code itself. For small datasets (the test suite uses 5-51 rows) none of this matters. For scoring a million rows inline with a query, or scoring 1,000 transactions in the hot path of a payment system, the architectural choices compound. We haven’t benchmarked head-to-head, but the design differences — flat packed arrays vs. nested vectors, morsel-driven parallelism vs. single-threaded, persistent models vs. retrain-per-query — point to a substantial gap at scale.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ClickHouse&lt;/strong&gt; has &lt;code&gt;seriesOutliersDetectTukey&lt;/code&gt; — a univariate IQR method for time-series. Useful for simple threshold alerts, but it’s one column at a time, one statistical method, no learning. Cloudflare &lt;a href=&quot;https://blog.cloudflare.com/lessons-learned-from-scaling-up-cloudflare-anomaly-detection-platform/&quot;&gt;built their anomaly detection platform&lt;/a&gt; on ClickHouse but implemented the actual detection logic (HBOS) in external microservices — ClickHouse stores and aggregates the data, it doesn’t run the models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;TimescaleDB&lt;/strong&gt; has an &lt;a href=&quot;https://github.com/timescale/timescaledb-toolkit/issues/45&quot;&gt;open issue&lt;/a&gt; proposing ARIMA and DBSCAN anomaly detection. It remains unimplemented.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;PostgreSQL MADlib&lt;/strong&gt; offers in-database ML, but it’s a heavy extension that hasn’t seen active development recently.&lt;/p&gt;
&lt;p&gt;The pattern is consistent: analytical databases treat anomaly detection as somebody else’s problem. The “solution” is always to export data to a separate ML runtime.&lt;/p&gt;
&lt;h2 id=&quot;the-cost-of-exporting&quot;&gt;The cost of exporting&lt;/h2&gt;
&lt;p&gt;This isn’t just about convenience. The export-to-Python pattern has structural costs that compound in production:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Latency.&lt;/strong&gt; Serializing 1M rows from a database into Python’s heap takes seconds. Add model inference, write-back, and you’re looking at minutes for a pipeline that should be a query. For fraud detection or infrastructure monitoring, that latency window is when damage happens.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Memory duplication.&lt;/strong&gt; The data exists in the database AND in Python’s process. For large datasets, this means either paying for 2x RAM or batching with additional orchestration complexity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Operational surface area.&lt;/strong&gt; You now maintain a database AND a Python environment with scikit-learn, NumPy, and their transitive dependencies. Version pinning, compatibility testing, deployment coordination. Every additional system boundary is a place where things break.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Security perimeter.&lt;/strong&gt; Moving data out of the database means it leaves whatever access controls, encryption, and audit logging the database provides. For regulated industries, this is a compliance headache.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lost optimization.&lt;/strong&gt; When anomaly scoring is a SQL function, the query engine can apply zone-map pruning, skip entire chunks where min/max statistics prove no rows will match downstream filters, and fuse the scoring into the execution pipeline. An external Python process sees a flat array with no metadata.&lt;/p&gt;
&lt;h2 id=&quot;how-it-works-in-stratum&quot;&gt;How it works in Stratum&lt;/h2&gt;
&lt;h3 id=&quot;sql-interface&quot;&gt;SQL interface&lt;/h3&gt;
&lt;p&gt;Stratum speaks the PostgreSQL wire protocol. Connect with psql, DBeaver, JDBC, or any PostgreSQL client — then train and query models entirely from SQL:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Train a model directly from SQL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; MODEL fraud_model&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  TYPE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ISOLATION_FOREST&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  OPTIONS (n_trees &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 200&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, sample_size &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 256&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, contamination &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;05&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AS&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; amount, freq &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transactions;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;AS SELECT&lt;/code&gt; query defines the training data — any valid SELECT works, including WHERE filters and JOINs. Column names become the model’s feature names. Once created, the model remembers its features — you don’t need to repeat them:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Short form: model knows its features from training&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, ANOMALY_SCORE(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;fraud_model&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; score&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transactions;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- All four functions support both forms&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, ANOMALY_PREDICT(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;fraud_model&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; is_anomaly &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transactions;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, ANOMALY_PROBA(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;fraud_model&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; prob &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transactions;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, ANOMALY_CONFIDENCE(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;fraud_model&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; conf &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transactions;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Need to score on different columns, computed expressions, or join results? Use the long form with explicit arguments (mapped positionally to the model’s features):&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Explicit columns&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, ANOMALY_SCORE(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;fraud_model&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, amount, freq) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; score&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transactions;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Score on expressions&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, ANOMALY_SCORE(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;fraud_model&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, amount &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;LOG&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(freq)) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; score&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transactions;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Score across JOINs&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; t.&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, ANOMALY_SCORE(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;fraud_model&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;amount&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;rate&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; score&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transactions t &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; rates r &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;currency&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;code&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Model management is also SQL-native:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW MODELS;                    &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- list all registered models&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;DESCRIBE MODEL fraud_model;     &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- features, hyperparameters, threshold&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DROP&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; MODEL fraud_model;         &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- remove a model&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DROP&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; MODEL &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IF&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; EXISTS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; old_model; &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- remove only if it exists&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The anomaly functions look and compose like any other SQL expression — filter on them, aggregate them, join them.&lt;/p&gt;
&lt;h3 id=&quot;clojure-api&quot;&gt;Clojure API&lt;/h3&gt;
&lt;p&gt;For programmatic workflows — custom training pipelines, model rotation, or embedding Stratum as a library — there’s a direct Clojure API:&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;require(&apos;[stratum.api :as st])

;; Your data — plain Java arrays
def amounts: double-array([10 15 12 11 14 200 13 11 300 12])
def freqs: double-array([5 6 4 5 7 1 5 4 1 6])

;; Train: 100 trees, 256 samples each, expect ~5% anomalies
def model: st/train-iforest({:from {:amount amounts, :freq freqs}, :contamination 0.05})

;; Score: double[] in [0, 1] — higher = more anomalous
st/iforest-score(model {:amount amounts, :freq freqs})

;; Binary prediction: long[] with 1 = anomaly, 0 = normal
st/iforest-predict(model {:amount amounts, :freq freqs})

;; Confidence: how much do the trees agree? [0, 1]
st/iforest-predict-confidence(model {:amount amounts, :freq freqs})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(require &apos;[stratum.api :as st])

;; Your data — plain Java arrays
(def amounts (double-array [10 15 12 11 14 200 13 11 300 12]))
(def freqs   (double-array [ 5  6  4  5  7   1  5  4   1  6]))

;; Train: 100 trees, 256 samples each, expect ~5% anomalies
(def model (st/train-iforest {:from {:amount amounts :freq freqs}
                              :contamination 0.05}))

;; Score: double[] in [0, 1] — higher = more anomalous
(st/iforest-score model {:amount amounts :freq freqs})

;; Binary prediction: long[] with 1 = anomaly, 0 = normal
(st/iforest-predict model {:amount amounts :freq freqs})

;; Confidence: how much do the trees agree? [0, 1]
(st/iforest-predict-confidence model {:amount amounts :freq freqs})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Scores integrate directly with the query engine — they’re just another column:&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;def scores: st/iforest-score(model data)
st/q({:from assoc(data :score scores)
      :where [[:&gt; :score 0.7]]
      :group [:region]
      :agg [[:avg :score] [:count]]
      :having [[:&gt; :avg 0.5]]
      :order [[:avg :desc]]})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(def scores (st/iforest-score model data))
(st/q {:from   (assoc data :score scores)
       :where  [[:&gt; :score 0.7]]
       :group  [:region]
       :agg    [[:avg :score] [:count]]
       :having [[:&gt; :avg 0.5]]
       :order  [[:avg :desc]]})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h2 id=&quot;online-adaptation&quot;&gt;Online adaptation&lt;/h2&gt;
&lt;p&gt;Data distributions shift. Fraud patterns evolve. A model trained last month may not catch today’s anomalies. Retraining from scratch is wasteful when only the recent distribution has changed.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;iforest-rotate&lt;/code&gt; replaces the oldest &lt;em&gt;k&lt;/em&gt; trees with new ones trained on fresh data. The original model is unchanged — copy-on-write semantics mean you can keep the old model for comparison:&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;;; Replace 10% of trees with new ones trained on this week&apos;s data
def updated-model: st/iforest-rotate(model this-week-data)

;; Score with recency bias: newer trees weighted higher
st/iforest-score-weighted(updated-model data 0.98)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;;; Replace 10% of trees with new ones trained on this week&apos;s data
(def updated-model (st/iforest-rotate model this-week-data))

;; Score with recency bias: newer trees weighted higher
(st/iforest-score-weighted updated-model data 0.98)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This is a lightweight operation — training 10 new trees on 256 samples each costs microseconds. The resulting model maintains sensitivity to historical patterns (90 original trees) while adapting to recent distribution changes (10 new trees). In our temporal evaluation with synthetic concept drift (outlier region shifting at the midpoint), the rotating model maintains AUC above 0.95 across all segments where a static model degrades to 0.75.&lt;/p&gt;
&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;
&lt;p&gt;Measured on an Intel Core Ultra 7 258V (8 cores, Lunar Lake), JDK 25, 100 trees with sample size 256:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Batch scoring (online processing)&lt;/strong&gt;&lt;/p&gt;
&lt;table style=&quot;width: 100%; border-collapse: collapse; margin: 1rem 0;&quot;&gt;
  &lt;thead&gt;
    &lt;tr style=&quot;text-align: left; border-bottom: 1px solid var(--border);&quot;&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Batch size&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Latency&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 0;&quot;&gt;Use case&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;1 row&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;6 μs&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Single transaction check&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;10 rows&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;19 μs&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Micro-batch&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;100 rows&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;163 μs&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;API batch&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;1,000 rows&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;1.6 ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Payment gateway batch&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;10,000 rows&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;16 ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Bulk ingest check&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;At 6 microseconds per row, anomaly scoring adds negligible overhead to any transaction processing pipeline. A payment gateway checking 1,000 transactions per batch stays under 2ms — well within the latency budget that even real-time payment systems allow for fraud checks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Full-table scoring (analytics)&lt;/strong&gt;&lt;/p&gt;
&lt;table style=&quot;width: 100%; border-collapse: collapse; margin: 1rem 0;&quot;&gt;
  &lt;thead&gt;
    &lt;tr style=&quot;text-align: left; border-bottom: 1px solid var(--border);&quot;&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Operation&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;1M rows&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 0;&quot;&gt;10M rows&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Train (100 trees × 256 samples)&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;~1ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;6ms&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Score (parallel, 8 cores)&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;448ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;4.6s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Score (single-threaded)&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;~1.7s&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;17s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Model memory&lt;/td&gt;
      &lt;td colspan=&quot;2&quot; style=&quot;padding: 0.5rem 0;&quot;&gt;~2.5 MB (100 trees × 511 nodes × 8 bytes)&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Training is near-instant because it only reads 25,600 rows total (256 per tree), regardless of dataset size. Scoring scales linearly and parallelizes across cores with morsel-driven execution — each morsel sized to fit L1 cache for branchless tree traversal.&lt;/p&gt;
&lt;p&gt;The isolation forest validates against standard ODDS benchmark datasets (Shuttle, Http, ForestCover, Mammography, CreditCard) with AUC-ROC scores matching or exceeding scikit-learn’s implementation at equivalent hyperparameters. The benchmark suite includes a head-to-head comparison with &lt;a href=&quot;https://pyod.readthedocs.io/&quot;&gt;PyOD&lt;/a&gt; that you can run yourself: &lt;code&gt;clj -M:iforest pyod&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&quot;under-the-hood&quot;&gt;Under the hood&lt;/h2&gt;
&lt;p&gt;The tree structure is packed for cache efficiency. Each node is a single &lt;code&gt;long&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Internal nodes: split feature index (upper 32 bits) + split value as float (lower 32 bits)&lt;/li&gt;
&lt;li&gt;Leaf nodes: path length adjustment stored as &lt;code&gt;Double.doubleToRawLongBits(c(leafSize))&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Trees are contiguous in memory: &lt;code&gt;forest[tree × maxNodes + nodeIdx]&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Scoring traverses each tree with a branchless comparison — &lt;code&gt;node = 2*node + 1 + (val &gt;= splitVal ? 1 : 0)&lt;/code&gt; — no branch misprediction, no pointer chasing. The anomaly score is &lt;code&gt;2^(-E(h(x)) / c(ψ))&lt;/code&gt; where &lt;code&gt;E(h(x))&lt;/code&gt; is the mean path length across all trees and &lt;code&gt;c(ψ)&lt;/code&gt; is the expected path length of an unsuccessful BST search, normalizing scores to [0, 1].&lt;/p&gt;
&lt;p&gt;Parallelism uses the same morsel-driven architecture as the rest of the query engine: the ForkJoinPool processes rows in 64K-row morsels, each morsel’s feature data fitting in L1 cache. No lock contention — each thread accumulates independently into its own score region.&lt;/p&gt;
&lt;p&gt;The confidence metric (&lt;code&gt;predict-confidence&lt;/code&gt;) uses the coefficient of variation of per-tree path lengths. When trees agree on a point’s isolation depth, confidence is high. When they disagree — the point sits near a decision boundary — confidence is low. This gives you a principled way to triage uncertain predictions rather than trusting every score blindly.&lt;/p&gt;
&lt;h2 id=&quot;what-this-enables&quot;&gt;What this enables&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Online payment fraud detection.&lt;/strong&gt; At 6μs per transaction, anomaly scoring can sit directly in the payment authorization path — not as a post-hoc batch job, but as a synchronous check before the charge clears. Train on your historical transaction data, register the model, and every &lt;code&gt;SELECT&lt;/code&gt; against the transactions table can include &lt;code&gt;ANOMALY_SCORE&lt;/code&gt; inline. For batch settlement processing, 1,000 transactions score in 1.6ms. The model stays in-process — no network hop to an external ML service, no serialization overhead, no additional point of failure in the payment critical path.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data quality monitoring.&lt;/strong&gt; Run &lt;code&gt;ANOMALY_SCORE&lt;/code&gt; over your staging table before promoting to production. Flag rows that don’t fit the historical distribution. Catch data pipeline bugs before they propagate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;IoT sensor monitoring.&lt;/strong&gt; Train on a baseline period of normal sensor readings. Score incoming data. When vibration, temperature, and power consumption are each individually normal but their &lt;em&gt;combination&lt;/em&gt; is anomalous, the isolation forest catches it — z-scores don’t.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Versioned anomaly detection.&lt;/strong&gt; Because Stratum datasets are immutable values with &lt;a href=&quot;/notes/the-git-model-for-databases&quot;&gt;copy-on-write branching&lt;/a&gt;, you can score against historical snapshots. “What would this model have flagged last quarter?” is a query, not a data engineering project.&lt;/p&gt;
&lt;h2 id=&quot;try-it-yourself&quot;&gt;Try it yourself&lt;/h2&gt;
&lt;p&gt;Start the demo server — it loads 100K taxi ride rows and a pre-trained anomaly model:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;java&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --add-modules&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; jdk.incubator.vector&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;     --enable-native-access=ALL-UNNAMED&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;     -jar&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; stratum-standalone.jar&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --demo&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Connect with any PostgreSQL client and run real anomaly queries immediately:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;psql&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -h&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; localhost&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -p&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 5432&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -U&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; stratum&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Find the most anomalous taxi rides&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; fare_amount, tip_amount, pickup_hour,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       ANOMALY_SCORE(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;taxi_anomaly&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, fare_amount, tip_amount,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;                     total_amount, passenger_count, pickup_hour) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; score&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; taxi&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ANOMALY_SCORE(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;taxi_anomaly&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, fare_amount, tip_amount,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;                    total_amount, passenger_count, pickup_hour) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;7&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; score &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 20&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Binary prediction: which rides are anomalous?&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; fare_amount, tip_amount,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       ANOMALY_PREDICT(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;taxi_anomaly&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, fare_amount, tip_amount,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;                       total_amount, passenger_count, pickup_hour) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; is_anomaly&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; taxi&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ANOMALY_PREDICT(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;taxi_anomaly&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, fare_amount, tip_amount,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;                      total_amount, passenger_count, pickup_hour) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- How confident is the model about each prediction?&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; fare_amount,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       ANOMALY_SCORE(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;taxi_anomaly&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, fare_amount, tip_amount,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;                     total_amount, passenger_count, pickup_hour) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; score,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       ANOMALY_CONFIDENCE(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;taxi_anomaly&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, fare_amount, tip_amount,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;                          total_amount, passenger_count, pickup_hour) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; confidence&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; taxi&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; score &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The demo dataset includes synthetic anomalies — high fares with zero tips late at night — that the model detects out of the box. But the model also finds natural outliers in the data: unusual combinations of fare, tip, passenger count, and hour that don’t match the bulk distribution.&lt;/p&gt;
&lt;h2 id=&quot;getting-started-with-your-own-data&quot;&gt;Getting started with your own data&lt;/h2&gt;
&lt;p&gt;Start the server (requires JDK 21+):&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;java&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --add-modules&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; jdk.incubator.vector&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;     --enable-native-access=ALL-UNNAMED&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;     -jar&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; stratum-standalone.jar&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then connect with any PostgreSQL client and do everything from SQL:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Load your data&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; transactions&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (amount &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DOUBLE PRECISION&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, freq &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BIGINT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;hour&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; BIGINT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INSERT INTO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transactions &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;VALUES&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;5&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;14&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;), (&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;15&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;6&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;9&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;), ...;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Or query directly from files&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; read_csv(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;/path/to/transactions.csv&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Train a model&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; MODEL fraud_model&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  TYPE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ISOLATION_FOREST&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  OPTIONS (n_trees &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 200&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, contamination &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;05&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AS&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; amount, freq, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;hour&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transactions;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Score your data&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, ANOMALY_SCORE(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;fraud_model&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, amount, freq, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;hour&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; score&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transactions&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; score &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For programmatic workflows, Stratum also has a Clojure API for model training, online rotation, and integration with the query engine. Add to &lt;code&gt;deps.edn&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;{:deps {org.replikativ/stratum {:mvn/version &quot;RELEASE&quot;}}}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;{:deps {org.replikativ/stratum {:mvn/version &quot;RELEASE&quot;}}}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Source and full documentation: &lt;a href=&quot;https://github.com/replikativ/stratum&quot;&gt;github.com/replikativ/stratum&lt;/a&gt;. The &lt;a href=&quot;https://github.com/replikativ/stratum/blob/main/doc/anomaly-detection.md&quot;&gt;anomaly detection guide&lt;/a&gt; has the complete API reference.&lt;/p&gt;
&lt;p&gt;Feedback welcome on &lt;a href=&quot;https://clojurians.slack.com/archives/CB7GJAN0L&quot;&gt;Clojurians #datahike&lt;/a&gt; or &lt;a href=&quot;mailto:contact@datahike.io&quot;&gt;email&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;</content:encoded></item><item><title>Memory That Collaborates</title><link>https://datahike.io/notes/collaborate-without-infrastructure/</link><guid isPermaLink="true">https://datahike.io/notes/collaborate-without-infrastructure/</guid><description>How Datahike&apos;s distributed index space lets independent processes share and join databases through storage alone.</description><pubDate>Wed, 25 Mar 2026 23:00:00 GMT</pubDate><content:encoded>&lt;div class=&quot;container page prose&quot;&gt;
&lt;h1 id=&quot;memory-that-collaborates&quot;&gt;Memory That Collaborates&lt;/h1&gt;
&lt;p&gt;&lt;em&gt;March 2026&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;When two teams need to combine data, the usual answer is infrastructure: an ETL pipeline, an API, a message bus. Each adds latency, maintenance burden, and a new failure mode. The data moves because the systems can’t share it in place.&lt;/p&gt;
&lt;p&gt;There’s a simpler model. If your database is an immutable value in storage, then anyone who can read the storage can query it. No server to run, no API to negotiate, no data to copy. And if your query language supports multiple inputs, you can join databases from different teams in a single expression.&lt;/p&gt;
&lt;p&gt;This is how &lt;a href=&quot;https://datahike.io&quot;&gt;Datahike&lt;/a&gt; works. It isn’t a feature we bolted on - it intentionally falls out of two properties fundamental to the architecture.&lt;/p&gt;
&lt;h2 id=&quot;databases-are-values&quot;&gt;Databases are values&lt;/h2&gt;
&lt;p&gt;In a traditional database, you query through a connection to a running server. The data may change between queries. The database is a service, not something you hold.&lt;/p&gt;
&lt;p&gt;Datahike inverts this. Dereference a connection (&lt;code&gt;@conn&lt;/code&gt;) and you get an immutable database value - a snapshot frozen at a specific transaction. It won’t change. Pass it to a function, hold it in a variable, hand it to another thread. Two concurrent readers holding the same snapshot always agree, without locks or coordination.&lt;/p&gt;
&lt;p&gt;This is an idea Rich Hickey introduced with &lt;a href=&quot;https://www.infoq.com/presentations/Datomic-Database-Value/&quot;&gt;Datomic&lt;/a&gt; in 2012: separate &lt;em&gt;process&lt;/em&gt; (writes, managed by a single writer) from &lt;em&gt;perception&lt;/em&gt; (reads, which are just values). The insight was that a correct implementation of perception does not require coordination.&lt;/p&gt;
&lt;p&gt;Datomic’s indices live in storage, but its transactor holds an in-memory overlay of recent index segments that haven’t been flushed yet. Readers typically need to coordinate with the transactor to get a complete, current view. The storage alone isn’t enough.&lt;/p&gt;
&lt;p&gt;Datahike removes that dependency. The writer flushes to storage on every transaction, so storage is always authoritative. Any process that can read the store sees the full, current database - no overlay, no transactor connection needed. To understand why this works, you need to see how the data is structured.&lt;/p&gt;
&lt;h2 id=&quot;trees-in-storage&quot;&gt;Trees in storage&lt;/h2&gt;
&lt;p&gt;Datahike keeps its indices in a &lt;a href=&quot;https://github.com/replikativ/persistent-sorted-set&quot;&gt;persistent sorted set&lt;/a&gt; - a B-tree variant where nodes are immutable. Every node is stored as a key-value pair in &lt;a href=&quot;https://github.com/replikativ/konserve&quot;&gt;konserve&lt;/a&gt;, which abstracts over storage backends: S3, filesystem, JDBC, IndexedDB.&lt;/p&gt;
&lt;p&gt;When a transaction adds data, Datahike doesn’t modify existing nodes. It creates new nodes for the changed path from leaf to root, while the unchanged subtrees are shared with the previous version. This is &lt;em&gt;structural sharing&lt;/em&gt; - the same technique behind Clojure’s persistent vectors and Git’s object store.&lt;/p&gt;
&lt;p&gt;A concrete example: a database with a million datoms might have a B-tree with thousands of nodes. A transaction that adds ten datoms rewrites perhaps a dozen nodes along the affected paths. The new tree root points to these new nodes and to the thousands of unchanged nodes from before. Both the old and new snapshots are valid, complete trees. They just share most of their structure.&lt;/p&gt;
&lt;p&gt;The crucial property: every node is written once and never modified. Its key can be content-addressed. This means nodes can be cached aggressively, replicated independently, and read by any process that has access to the storage - without coordinating with the process that wrote them. (For more on how structural sharing, branching, and the tradeoffs work, see &lt;a href=&quot;/notes/the-git-model-for-databases&quot;&gt;The Git Model for Databases&lt;/a&gt;.)&lt;/p&gt;
&lt;h2 id=&quot;the-distributed-index-space&quot;&gt;The distributed index space&lt;/h2&gt;
&lt;p&gt;This is where it comes together.&lt;/p&gt;
&lt;p&gt;When you call &lt;code&gt;@conn&lt;/code&gt;, Datahike fetches one key from the konserve store: the &lt;strong&gt;branch head&lt;/strong&gt; (e.g. &lt;code&gt;:db&lt;/code&gt;). This returns a small map containing root pointers for each index, schema metadata, and the current transaction ID. Nothing else is loaded - the database value you receive is a lazy handle into the tree.&lt;/p&gt;
&lt;p&gt;When a query traverses the index, each node is fetched on demand from storage and cached in a local LRU. Subsequent queries hitting the same nodes pay no I/O.&lt;/p&gt;
&lt;p&gt;That’s the entire read path. No server process mediating access, no connection protocol, no port to expose. The indices live in storage, and any process that can read the storage can load the branch head, traverse the tree, and run queries. We call this the &lt;a href=&quot;https://github.com/replikativ/datahike/blob/main/doc/distributed.md&quot;&gt;distributed index space&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Two processes reading the same database fetch the same immutable nodes independently. They don’t know about each other. A writer publishes new snapshots by writing new tree nodes, then atomically updating the branch head. Readers that dereference afterward see the new snapshot. Readers holding an earlier snapshot continue undisturbed - their nodes are immutable and won’t be garbage collected while reachable.&lt;/p&gt;
&lt;h2 id=&quot;joining-across-databases&quot;&gt;Joining across databases&lt;/h2&gt;
&lt;p&gt;Because databases are values and Datalog natively supports multiple input sources, the next step is natural: join databases from different teams, different storage backends, or different points in time - in a single query.&lt;/p&gt;
&lt;p&gt;Team A maintains a product catalog on S3. Team B maintains inventory on a separate bucket. A third team joins them without either team doing anything:&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;def catalog: d/connect({:store {:backend :s3, :bucket &quot;team-a&quot;}})
def inventory: d/connect({:store {:backend :s3, :bucket &quot;team-b&quot;}})

d/q(&apos;[:find ?name ?price ?stock
      :in $cat $inv
      :where [$cat ?p :product/sku ?sku]
             [$cat ?p :product/name ?name]
             [$cat ?p :product/price ?price]
             [$inv ?i :stock/sku ?sku]
             [$inv ?i :stock/count ?stock]
             [?stock &gt; 0]]
  @catalog @inventory)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(def catalog   (d/connect {:store {:backend :s3 :bucket &quot;team-a&quot;}}))
(def inventory (d/connect {:store {:backend :s3 :bucket &quot;team-b&quot;}}))

(d/q &apos;[:find ?name ?price ?stock
       :in $cat $inv
       :where [$cat ?p :product/sku   ?sku]
              [$cat ?p :product/name  ?name]
              [$cat ?p :product/price ?price]
              [$inv ?i :stock/sku     ?sku]
              [$inv ?i :stock/count   ?stock]
              [(&gt; ?stock 0)]]
  @catalog @inventory)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Each &lt;code&gt;@&lt;/code&gt; dereference fetches a branch head from its respective S3 bucket and returns an immutable database value. The query engine joins them locally. There is no server coordinating between the two, no data copied.&lt;/p&gt;
&lt;p&gt;And because both are values, you can mix snapshots from different points in time:&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;;; Last quarter&apos;s catalog crossed with current inventory
def old-catalog: d/as-of(@catalog #inst &quot;2025-11-01&quot;)

d/q(&apos;[:find ?name ?stock
      :in $cat $inv
      :where [$cat ?p :product/sku ?sku]
             [$cat ?p :product/name ?name]
             [$inv ?i :stock/sku ?sku]
             [$inv ?i :stock/count ?stock]]
  old-catalog @inventory)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;;; Last quarter&apos;s catalog crossed with current inventory
(def old-catalog (d/as-of @catalog #inst &quot;2025-11-01&quot;))

(d/q &apos;[:find ?name ?stock
       :in $cat $inv
       :where [$cat ?p :product/sku  ?sku]
              [$cat ?p :product/name ?name]
              [$inv ?i :stock/sku    ?sku]
              [$inv ?i :stock/count  ?stock]]
  old-catalog @inventory)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The old snapshot and the current one are both just values. The query engine doesn’t care when they’re from. This is useful for audits, regulatory reproducibility, and debugging: “what would this report have shown against last quarter’s data?”&lt;/p&gt;
&lt;h2 id=&quot;from-storage-to-browsers&quot;&gt;From storage to browsers&lt;/h2&gt;
&lt;p&gt;So far, “storage” has meant S3 or a filesystem. But konserve also has an IndexedDB backend, which means the same model works in a browser. Using &lt;a href=&quot;https://github.com/replikativ/kabel&quot;&gt;Kabel&lt;/a&gt; WebSocket sync and &lt;a href=&quot;https://github.com/replikativ/konserve-sync&quot;&gt;konserve-sync&lt;/a&gt;, a browser client replicates a database locally into IndexedDB. Queries run against the local replica with zero network round-trips. Updates sync differentially - only changed tree nodes are transmitted, the same structural sharing that makes snapshots cheap on the server makes sync cheap over the wire.&lt;/p&gt;
&lt;h2 id=&quot;try-it&quot;&gt;Try it&lt;/h2&gt;
&lt;p&gt;A complete cross-database join, runnable in a Clojure REPL:&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;require(&apos;[datahike.api :as d])

;; Two independent databases
def catalog-cfg: {:store {:backend :memory, :id java.util.UUID/randomUUID()}, :schema-flexibility :read}
def inventory-cfg: {:store {:backend :memory, :id java.util.UUID/randomUUID()}, :schema-flexibility :read}

d/create-database(catalog-cfg)
d/create-database(inventory-cfg)

def catalog: d/connect(catalog-cfg)
def inventory: d/connect(inventory-cfg)

;; Team A: products
d/transact(catalog
  [{:product/sku &quot;W001&quot;, :product/name &quot;Widget&quot;, :product/price 9.99}
   {:product/sku &quot;G002&quot;, :product/name &quot;Gadget&quot;, :product/price 24.5}
   {:product/sku &quot;T003&quot;, :product/name &quot;Thingamajig&quot;, :product/price 3.75}])

;; Team B: stock levels
d/transact(inventory
  [{:stock/sku &quot;W001&quot;, :stock/count 140}
   {:stock/sku &quot;G002&quot;, :stock/count 0}
   {:stock/sku &quot;T003&quot;, :stock/count 58}])

;; Join: in-stock products with price
d/q(&apos;[:find ?name ?price ?stock
      :in $cat $inv
      :where [$cat ?p :product/sku ?sku]
             [$cat ?p :product/name ?name]
             [$cat ?p :product/price ?price]
             [$inv ?i :stock/sku ?sku]
             [$inv ?i :stock/count ?stock]
             [?stock &gt; 0]]
  @catalog @inventory)
;; =&gt; #{[&quot;Widget&quot; 9.99 140] [&quot;Thingamajig&quot; 3.75 58]}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(require &apos;[datahike.api :as d])

;; Two independent databases
(def catalog-cfg  {:store {:backend :memory
                           :id (java.util.UUID/randomUUID)}
                   :schema-flexibility :read})
(def inventory-cfg {:store {:backend :memory
                            :id (java.util.UUID/randomUUID)}
                    :schema-flexibility :read})

(d/create-database catalog-cfg)
(d/create-database inventory-cfg)

(def catalog  (d/connect catalog-cfg))
(def inventory (d/connect inventory-cfg))

;; Team A: products
(d/transact catalog
  [{:product/sku &quot;W001&quot; :product/name &quot;Widget&quot;      :product/price 9.99}
   {:product/sku &quot;G002&quot; :product/name &quot;Gadget&quot;      :product/price 24.50}
   {:product/sku &quot;T003&quot; :product/name &quot;Thingamajig&quot; :product/price 3.75}])

;; Team B: stock levels
(d/transact inventory
  [{:stock/sku &quot;W001&quot; :stock/count 140}
   {:stock/sku &quot;G002&quot; :stock/count 0}
   {:stock/sku &quot;T003&quot; :stock/count 58}])

;; Join: in-stock products with price
(d/q &apos;[:find ?name ?price ?stock
       :in $cat $inv
       :where [$cat ?p :product/sku   ?sku]
              [$cat ?p :product/name  ?name]
              [$cat ?p :product/price ?price]
              [$inv ?i :stock/sku     ?sku]
              [$inv ?i :stock/count   ?stock]
              [(&gt; ?stock 0)]]
  @catalog @inventory)
;; =&gt; #{[&quot;Widget&quot; 9.99 140] [&quot;Thingamajig&quot; 3.75 58]}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Replace &lt;code&gt;:memory&lt;/code&gt; with &lt;code&gt;:s3&lt;/code&gt;, &lt;code&gt;:file&lt;/code&gt;, or &lt;code&gt;:jdbc&lt;/code&gt; and the same code works across storage backends. The databases don’t need to share a backend - join an S3 database against a local file store in the same query.&lt;/p&gt;
&lt;/div&gt;</content:encoded></item><item><title>Datahike Speaks Postgres</title><link>https://datahike.io/notes/datahike-speaks-postgres/</link><guid isPermaLink="true">https://datahike.io/notes/datahike-speaks-postgres/</guid><description>pg-datahike beta — pgwire access to Datahike. ORMs, migrations, and psql work, with branches, time-travel, and immutable snapshots underneath.</description><pubDate>Thu, 07 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;div class=&quot;container page prose&quot;&gt;
&lt;h1 id=&quot;datahike-speaks-postgres&quot;&gt;Datahike Speaks Postgres&lt;/h1&gt;
&lt;p&gt;&lt;em&gt;May 2026&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Open psql. Connect. Run a query. Switch branches. Run it again — same connection, same wire protocol, different version of the database.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$ psql postgresql:&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;//&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;localhost:&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;5432&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;inventory&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;inventory&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; widget;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; count&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-------&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  4218&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;inventory&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;branch&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;pricing-experiment&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;inventory&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; widget;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; count&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-------&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  4221&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;inventory&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; RESET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;branch&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That’s not a feature toggle on a Postgres replica. It’s the same database — addressed through standard pgwire — viewed through two different commits. The implementation is &lt;a href=&quot;https://github.com/replikativ/pg-datahike&quot;&gt;pg-datahike&lt;/a&gt;, a beta we’re shipping today.&lt;/p&gt;
&lt;h2 id=&quot;what-it-is&quot;&gt;What it is&lt;/h2&gt;
&lt;p&gt;pg-datahike embeds a PostgreSQL-compatible adapter inside a Datahike process: wire protocol, SQL translator, virtual &lt;code&gt;pg_*&lt;/code&gt; and &lt;code&gt;information_schema&lt;/code&gt; catalogs, constraint enforcement, schema hints. Clients that speak Postgres talk to Datahike without a Postgres install — pgjdbc, Hibernate, SQLAlchemy, Odoo 19, and Metabase bootstrap unmodified against it. The migration path is round-trippable: &lt;code&gt;pg_dump&lt;/code&gt; output replays into pg-datahike via &lt;code&gt;psql&lt;/code&gt;, and the standalone jar dumps Datahike databases back out as portable PG SQL. Detailed test results at the end of this post.&lt;/p&gt;
&lt;h2 id=&quot;a-60-second-tour&quot;&gt;A 60-second tour&lt;/h2&gt;
&lt;p&gt;The operator runs one jar. Everything else is &lt;code&gt;psql&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;$&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; java&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -jar&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pg-datahike-VERSION-standalone.jar&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg-datahike&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; VERSION&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; ready&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; on&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; 127.0.0.1:5432&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;  backend:&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  file&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (~/.local/share/pg-datahike)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  history&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;:&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  off&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;  CREATE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; DATABASE:&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  enabled&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;  databases:&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; [&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;datahike&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;Connect&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; with:&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; psql&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -h&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 127.0.0.1&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -p&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 5432&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -U&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; datahike&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;Press&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; Ctrl+C&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; to&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; stop.&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;JDK 17+ is the only prerequisite; the jar is on &lt;a href=&quot;https://github.com/replikativ/pg-datahike/releases&quot;&gt;GitHub releases&lt;/a&gt;. &lt;code&gt;--memory&lt;/code&gt; for an ephemeral run; &lt;code&gt;--help&lt;/code&gt; covers the rest.&lt;/p&gt;
&lt;p&gt;The rest is &lt;code&gt;psql&lt;/code&gt; — provision a fresh database, populate it, pin a session to a historical commit, drop it.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$ psql postgresql:&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;//&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;localhost:&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;5432&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;datahike&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;datahike&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DATABASE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; inventory;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DATABASE&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;datahike&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; \c inventory&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;You are &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; connected &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;to&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; database&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;inventory&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;inventory&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; widget (sku &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;TEXT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; PRIMARY KEY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;weight&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;inventory&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INSERT INTO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; widget &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;VALUES&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;A&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;), (&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;B&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;20&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INSERT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 2&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;inventory&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;commit_id&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;                commit_id&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;---------------------------------------&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; b4f2e1c0&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;2feb&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;5b61&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;be14&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;5590b9e01e48      ← &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;copy&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; this&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;inventory&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INSERT INTO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; widget &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;VALUES&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;C&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;30&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INSERT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;inventory&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; widget;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; count&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-------&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;     3&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;inventory&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;commit_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;b4f2e1c0-2feb-5b61-be14-5590b9e01e48&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;inventory&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; widget;     &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- the database before the third insert&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; count&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-------&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;     2&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;inventory&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; RESET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;commit_id&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;inventory&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; \c datahike&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;datahike&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DROP&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DATABASE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; inventory;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DROP&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DATABASE&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;SET datahike.commit_id&lt;/code&gt; pins the session to a historical commit; everything else is plain Postgres. Sixty seconds, one jar, no Postgres install, no Clojure.&lt;/p&gt;
&lt;h2 id=&quot;architecture-in-one-minute&quot;&gt;Architecture in one minute&lt;/h2&gt;
&lt;p&gt;What happens when you &lt;code&gt;SET datahike.branch = &apos;feature&apos;&lt;/code&gt;?&lt;/p&gt;
&lt;p&gt;Datahike stores its database as a tree of immutable nodes in &lt;a href=&quot;https://github.com/replikativ/konserve&quot;&gt;konserve&lt;/a&gt;, a key-value abstraction over filesystems, S3, JDBC, IndexedDB, and others. Every transaction writes new nodes for changed paths and shares unchanged subtrees with the previous version — the trick behind Clojure’s persistent vectors and Git’s object store. A commit is a small map listing the root pointers for each index; a branch is a named pointer at a commit.&lt;/p&gt;
&lt;p&gt;So on &lt;code&gt;SET datahike.branch = &apos;feature&apos;&lt;/code&gt;, the handler updates a session variable, and the next query loads that branch’s commit pointer from konserve, walks the tree, returns rows. No coordination with a transactor; storage is the source of truth. &lt;code&gt;SET datahike.commit_id = &apos;&amp;#x3C;uuid&gt;&apos;&lt;/code&gt; works the same way one level deeper — the session points at a specific commit instead of a branch head.&lt;/p&gt;
&lt;p&gt;Two consequences worth flagging:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Branching is one konserve write.&lt;/strong&gt; Creating a branch from any commit is constant time, regardless of database size, because structural sharing means the new branch points at existing nodes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reads don’t go through a transactor.&lt;/strong&gt; Every node is content-addressable; any process that can read the storage can run queries against it. In principle, read fanout is bounded by storage bandwidth, not replica capacity — we’ll publish numbers in a follow-up. See &lt;a href=&quot;/notes/collaborate-without-infrastructure&quot;&gt;Memory That Collaborates&lt;/a&gt; for more.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;integration-patterns&quot;&gt;Integration patterns&lt;/h2&gt;
&lt;h3 id=&quot;1-multi-database-server&quot;&gt;1. Multi-database server&lt;/h3&gt;
&lt;p&gt;A single &lt;code&gt;start-server&lt;/code&gt; call serves many Datahike connections. Clients route on the JDBC URL’s database name:&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;pg/start-server({&quot;prod&quot; prod-conn
                 &quot;staging&quot; staging-conn
                 &quot;reports&quot; reports-conn}
  {:port 5432})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(pg/start-server {&quot;prod&quot;    prod-conn
                  &quot;staging&quot; staging-conn
                  &quot;reports&quot; reports-conn}
                 {:port 5432})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Same shape on the standalone jar with repeatable &lt;code&gt;--db&lt;/code&gt; flags: &lt;code&gt;java -jar pg-datahike.jar --db prod --db staging --db reports&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;jdbc:postgresql://localhost:5432/prod      → prod-conn&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;jdbc:postgresql://localhost:5432/staging   → staging-conn&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;jdbc:postgresql://localhost:5432/nonsuch   → 3D000 invalid_catalog_name&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;SELECT current_database()&lt;/code&gt; returns the connected name; &lt;code&gt;pg_database&lt;/code&gt; enumerates the registry. Useful for multi-tenant deployments, or when ops wants one pgwire endpoint serving many independent stores.&lt;/p&gt;
&lt;h3 id=&quot;2-schema-hints&quot;&gt;2. Schema hints&lt;/h3&gt;
&lt;p&gt;Existing Datahike schemas don’t always look the way you’d want them to over SQL. &lt;code&gt;:datahike.pg/*&lt;/code&gt; meta-attributes customize the SQL view without touching the underlying schema:&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;pg/set-hint!(conn :person/full_name {:column &quot;name&quot;})
pg/set-hint!(conn :person/ssn {:hidden true})
pg/set-hint!(conn :person/company {:references :company/id})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(pg/set-hint! conn :person/full_name {:column &quot;name&quot;})           ; rename the column
(pg/set-hint! conn :person/ssn       {:hidden true})             ; exclude from SQL
(pg/set-hint! conn :person/company   {:references :company/id})  ; FK target&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;After &lt;code&gt;set-hint!&lt;/code&gt;, &lt;code&gt;SELECT name FROM person&lt;/code&gt; works, &lt;code&gt;ssn&lt;/code&gt; is invisible to &lt;code&gt;SELECT *&lt;/code&gt; and &lt;code&gt;information_schema.columns&lt;/code&gt;, and &lt;code&gt;JOIN company c ON p.company = c.id&lt;/code&gt; resolves on Datahike’s native ref semantics.&lt;/p&gt;
&lt;h3 id=&quot;3-time-travel-via-set&quot;&gt;3. Time-travel via SET&lt;/h3&gt;
&lt;p&gt;Datahike’s temporal primitives are exposed as session variables. The client doesn’t need to know what &lt;code&gt;as-of&lt;/code&gt; means — it just sets a variable:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;as_of&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;   =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2024-01-15T00:00:00Z&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;  &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- d/as-of&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;since&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;   =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2024-01-01T00:00:00Z&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;  &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- d/since&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;history&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;true&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;                  &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- d/history&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;RESET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;as_of&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Every subsequent query in the session sees the chosen view. A reporting tool that doesn’t know about Datahike can produce point-in-time reports by setting one variable.&lt;/p&gt;
&lt;h3 id=&quot;4-git-like-branching&quot;&gt;4. Git-like branching&lt;/h3&gt;
&lt;p&gt;Branching is cheap in Datahike: every transaction produces a new immutable commit, so a branch is just a named pointer at a commit UUID. Creation is O(1) — one konserve write, no data copy, no WAL replay. pgwire exposes the read side and the admin operations through standard PG mechanisms:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Introspect&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;branches&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;current_branch&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;commit_id&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Admin (konserve-level writes — they don&apos;t go through the tx writer)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;create_branch&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;preview&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;db&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);     &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- &apos;db&apos; is Datahike&apos;s default branch name&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;create_branch&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;from-cid&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;69ea6ee1-…&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;delete_branch&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;preview&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Session view: three cuts on the same immutable log.&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- They compose — a feature branch&apos;s state as of yesterday is two SETs.&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;branch&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;feature&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;commit_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;69ea6ee1-2feb-5b61-be14-5590b9e01e48&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;as_of&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;     =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2024-01-15T00:00:00Z&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or pin a branch at connect time via the JDBC URL:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;jdbc:postgresql://localhost:5432/prod:feature   → prod-conn, pinned to :feature&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;jdbc:postgresql://localhost:5432/prod           → prod-conn, default branch&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;SET datahike.commit_id = &apos;&amp;#x3C;uuid&gt;&apos;&lt;/code&gt; is Datahike-unique: no other PG-compatible database lets a session pin to an exact commit identifier.&lt;/p&gt;
&lt;p&gt;We’ll cover the structural-sharing model that makes branching this cheap in a follow-up post — including how it works across all the Datahike bindings, not just pgwire.&lt;/p&gt;
&lt;h3 id=&quot;5-sql-driven-database-provisioning&quot;&gt;5. SQL-driven database provisioning&lt;/h3&gt;
&lt;p&gt;Set a &lt;code&gt;:database-template&lt;/code&gt; on the server and pgwire clients self-provision and tear down databases over plain SQL. The template is a partial Datahike config; each &lt;code&gt;CREATE DATABASE&lt;/code&gt; produces a fresh store with a generated UUID:&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;pg/start-server({&quot;datahike&quot; boot-conn}
  {:port 5432 :database-template {:store {:backend :memory} :schema-flexibility :write :keep-history? true}})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(pg/start-server {&quot;datahike&quot; boot-conn}
                 {:port 5432
                  :database-template {:store {:backend :memory}
                                      :schema-flexibility :write
                                      :keep-history? true}})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code&gt;WITH&lt;/code&gt; clauses override the template per-database, and the SQL surface accepts both standard PG forms:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DATABASE&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; myapp&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;                              &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- inherits the template&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DATABASE&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; histdb&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; WITH&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; KEEP_HISTORY &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; true;    &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- override per database&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DATABASE&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; memdb&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  WITH&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (BACKEND &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;memory&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,    &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Yugabyte-style paren form&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;                             INDEX&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;   =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;persistent-set&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DROP&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DATABASE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; myapp;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DROP&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DATABASE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; IF&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; EXISTS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; old_one;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Accepted &lt;code&gt;WITH&lt;/code&gt; keys map case-insensitively to Datahike config:&lt;/p&gt;
&lt;table style=&quot;width: 100%; border-collapse: collapse; margin: 1.5rem 0;&quot;&gt;
  &lt;thead&gt;
    &lt;tr style=&quot;text-align: left; border-bottom: 1px solid var(--border);&quot;&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;&lt;code&gt;WITH&lt;/code&gt; option&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0; white-space: nowrap;&quot;&gt;Datahike config&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 0;&quot;&gt;Notes&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;&lt;code&gt;BACKEND&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; white-space: nowrap;&quot;&gt;&lt;code&gt;[:store :backend]&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;&lt;code&gt;&apos;memory&apos;&lt;/code&gt;, &lt;code&gt;&apos;file&apos;&lt;/code&gt; built-in; &lt;code&gt;&apos;jdbc&apos;&lt;/code&gt;, &lt;code&gt;&apos;s3&apos;&lt;/code&gt;, &lt;code&gt;&apos;redis&apos;&lt;/code&gt;, &lt;code&gt;&apos;lmdb&apos;&lt;/code&gt;, &lt;code&gt;&apos;rocksdb&apos;&lt;/code&gt;, &lt;code&gt;&apos;dynamodb&apos;&lt;/code&gt; via external konserve libraries&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;&lt;code&gt;STORE_ID&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; white-space: nowrap;&quot;&gt;&lt;code&gt;[:store :id]&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Defaults to a fresh UUID per &lt;code&gt;CREATE&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;&lt;code&gt;PATH&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; white-space: nowrap;&quot;&gt;&lt;code&gt;[:store :path]&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;File backend; &lt;code&gt;{{name}}&lt;/code&gt; interpolation supported&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;&lt;code&gt;HOST&lt;/code&gt; / &lt;code&gt;PORT&lt;/code&gt; / &lt;code&gt;USER&lt;/code&gt; / &lt;code&gt;PASSWORD&lt;/code&gt; / &lt;code&gt;DBNAME&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; white-space: nowrap;&quot;&gt;&lt;code&gt;[:store :*]&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;&lt;code&gt;jdbc&lt;/code&gt; / &lt;code&gt;redis&lt;/code&gt; backends&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;&lt;code&gt;SCHEMA_FLEXIBILITY&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; white-space: nowrap;&quot;&gt;&lt;code&gt;:schema-flexibility&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;&lt;code&gt;&apos;read&apos;&lt;/code&gt; or &lt;code&gt;&apos;write&apos;&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;&lt;code&gt;KEEP_HISTORY&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; white-space: nowrap;&quot;&gt;&lt;code&gt;:keep-history?&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;&lt;code&gt;INDEX&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; white-space: nowrap;&quot;&gt;&lt;code&gt;:index&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;&lt;code&gt;&apos;persistent-set&apos;&lt;/code&gt; → &lt;code&gt;:datahike.index/persistent-set&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;&lt;code&gt;OWNER&lt;/code&gt; / &lt;code&gt;TEMPLATE&lt;/code&gt; / &lt;code&gt;ENCODING&lt;/code&gt; / &lt;code&gt;LOCALE&lt;/code&gt; / &lt;code&gt;TABLESPACE&lt;/code&gt; / …&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; white-space: nowrap;&quot;&gt;—&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Postgres-only; silently accepted with a NOTICE so &lt;code&gt;pg_dump&lt;/code&gt; round-trips work&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The standalone jar enables this by default (use &lt;code&gt;--no-create-database&lt;/code&gt; to disable). Embedded servers opt in via &lt;code&gt;:database-template&lt;/code&gt; (or explicit &lt;code&gt;:on-create-database&lt;/code&gt; / &lt;code&gt;:on-delete-database&lt;/code&gt; hooks). Without one, &lt;code&gt;CREATE&lt;/code&gt; / &lt;code&gt;DROP DATABASE&lt;/code&gt; return SQLSTATE &lt;code&gt;0A000 feature_not_supported&lt;/code&gt;; mismatched preconditions return the standard PG SQLSTATEs.&lt;/p&gt;
&lt;h2 id=&quot;migrating-from-postgresql&quot;&gt;Migrating from PostgreSQL&lt;/h2&gt;
&lt;p&gt;Wire compatibility extends to &lt;code&gt;pg_dump&lt;/code&gt; SQL on both sides. Three workflows.&lt;/p&gt;
&lt;h3 id=&quot;real-postgresql--pg-datahike&quot;&gt;Real PostgreSQL → pg-datahike&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;pg_dump&lt;/code&gt; output replays straight into pg-datahike via &lt;code&gt;psql&lt;/code&gt; or any JDBC client. Schema-side coverage: &lt;code&gt;CREATE TABLE&lt;/code&gt; with FK constraints, &lt;code&gt;CREATE SEQUENCE&lt;/code&gt;, &lt;code&gt;DEFAULT nextval(…)&lt;/code&gt;, &lt;code&gt;CREATE TYPE … AS ENUM&lt;/code&gt;, &lt;code&gt;CREATE DOMAIN&lt;/code&gt;, partitioned tables. Data-side: &lt;code&gt;INSERT&lt;/code&gt; (single + multi-&lt;code&gt;VALUES&lt;/code&gt;) and &lt;code&gt;COPY … FROM stdin&lt;/code&gt; (text and CSV).&lt;/p&gt;
&lt;p&gt;Run with the &lt;code&gt;:pg-dump&lt;/code&gt; compat preset to silently accept constructs &lt;code&gt;pg-datahike&lt;/code&gt; doesn’t model — triggers, functions, materialized views, &lt;code&gt;ALTER OWNER&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;java&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -jar&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pg-datahike.jar&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --compat&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pg-dump&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;psql&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -h&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; localhost&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -p&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 5432&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -U&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -d&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; datahike&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -f&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; my_pg_dump.sql&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Validated end-to-end against &lt;a href=&quot;https://github.com/lerocha/chinook-database&quot;&gt;Chinook&lt;/a&gt; (15.6k rows, 11 tables, FKs, NUMERIC, TIMESTAMP) — full byte-identical bidirectional roundtrip — and &lt;a href=&quot;https://github.com/devrimgunduz/pagila&quot;&gt;Pagila&lt;/a&gt; (50k rows, 22 tables, ENUM, DOMAIN, partitioning, triggers, functions) — schema parses end-to-end, data loads.&lt;/p&gt;
&lt;h3 id=&quot;pg-datahike--portable-pg-sql&quot;&gt;pg-datahike → portable PG SQL&lt;/h3&gt;
&lt;p&gt;The standalone jar’s &lt;code&gt;dump&lt;/code&gt; subcommand walks a Datahike database and emits &lt;code&gt;pg_dump&lt;/code&gt;-shaped SQL. The output replays into either pg-datahike or real PostgreSQL via &lt;code&gt;psql&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;java&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -jar&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pg-datahike.jar&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; dump&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --data-dir&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; DIR&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --db&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; NAME&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --out&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; out.sql&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;java&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -jar&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pg-datahike.jar&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; dump&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --config&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; datahike-config.edn&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --copy&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Flags cover INSERT-vs-COPY output, schema-only / data-only, and table exclusion. &lt;code&gt;--config&lt;/code&gt; accepts a full Datahike config EDN, so any konserve backend works; store-id is auto-discovered.&lt;/p&gt;
&lt;h3 id=&quot;what-the-resulting-datahike-schema-looks-like&quot;&gt;What the resulting Datahike schema looks like&lt;/h3&gt;
&lt;p&gt;A native Datahike database — created with &lt;code&gt;d/transact&lt;/code&gt;, never touched by SQL — also dumps as clean PG SQL. The inverse mapping is well-defined:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;:db.unique/identity&lt;/code&gt; → &lt;code&gt;PRIMARY KEY NOT NULL&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;:db.unique/value&lt;/code&gt; → &lt;code&gt;UNIQUE&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;:db.cardinality/many T&lt;/code&gt; → &lt;code&gt;T[]&lt;/code&gt; with PG array literals&lt;/li&gt;
&lt;li&gt;&lt;code&gt;:db.type/ref&lt;/code&gt; → &lt;code&gt;bigint&lt;/code&gt; (the entity id; opt in to FK constraints with &lt;code&gt;set-hint! :references&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So whether you start from a real PostgreSQL dump or from native Datahike, both sides translate cleanly through the same shape. The resulting schema is correct and queryable as both SQL relations and Datalog datoms. It isn’t always what you’d hand-design for entity-shaped Datalog queries — many apps stay with the relational shape, others evolve incrementally as they reach for Datalog’s strengths (pull patterns, rules, multi-source joins).&lt;/p&gt;
&lt;h2 id=&quot;what-it-isnt&quot;&gt;What it isn’t&lt;/h2&gt;
&lt;p&gt;This is a 0.1 beta and we want to be specific about the gaps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PL/pgSQL, stored functions, triggers, rules, and materialized views are accepted under the &lt;code&gt;:pg-dump&lt;/code&gt; compat preset (loaded but not executed); strict mode rejects them&lt;/li&gt;
&lt;li&gt;No &lt;code&gt;LISTEN&lt;/code&gt; / &lt;code&gt;NOTIFY&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;No &lt;code&gt;COPY … TO STDOUT&lt;/code&gt; (&lt;code&gt;COPY … FROM stdin&lt;/code&gt; is supported in text and CSV formats)&lt;/li&gt;
&lt;li&gt;FK &lt;code&gt;ON DELETE&lt;/code&gt; enforced for &lt;code&gt;NO ACTION&lt;/code&gt; / &lt;code&gt;RESTRICT&lt;/code&gt; / &lt;code&gt;CASCADE&lt;/code&gt;; &lt;code&gt;SET NULL&lt;/code&gt; / &lt;code&gt;SET DEFAULT&lt;/code&gt; and any &lt;code&gt;ON UPDATE&lt;/code&gt; action are rejected at DDL&lt;/li&gt;
&lt;li&gt;Single &lt;code&gt;public&lt;/code&gt; schema — &lt;code&gt;CREATE SCHEMA&lt;/code&gt; is silently accepted but a no-op&lt;/li&gt;
&lt;li&gt;Cursor materialization is eager (entire result set held in memory)&lt;/li&gt;
&lt;li&gt;No deferrable constraints&lt;/li&gt;
&lt;li&gt;Generated columns parse but aren’t enforced&lt;/li&gt;
&lt;li&gt;Writes always land on the connection’s default branch in 0.1, even when &lt;code&gt;SET datahike.branch&lt;/code&gt; is active. Reads respect the pinned branch; writes don’t yet. Use &lt;code&gt;datahike.versioning/branch!&lt;/code&gt; and &lt;code&gt;merge!&lt;/code&gt; from Clojure for branch-targeted writes, or open a second connection on &lt;code&gt;/&amp;#x3C;db&gt;:&amp;#x3C;branch&gt;&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Constraint enforcement is one-directional. SQL constraints declared via DDL (&lt;code&gt;NOT NULL&lt;/code&gt;, &lt;code&gt;CHECK&lt;/code&gt;, &lt;code&gt;UNIQUE&lt;/code&gt;, FK &lt;code&gt;RESTRICT&lt;/code&gt;) are enforced by the pgwire handler; direct &lt;code&gt;(d/transact)&lt;/code&gt; writes from Clojure bypass them because Datahike’s schema doesn’t yet carry the constraint vocabulary. A future release will lift enforcement into the tx layer so both paths are gated.&lt;/li&gt;
&lt;li&gt;Bulk-insert throughput is ~5,000 rows/sec on JDBC batch (Pagila replays in ~12s, Chinook in ~3s) — Datahike maintains EAVT/AEVT/AVET live, so a 10-column row costs ~10× a single index write. Tuned bulk paths in vanilla PG (&lt;code&gt;COPY&lt;/code&gt;, &lt;code&gt;pg_restore -j&lt;/code&gt;) are an order of magnitude faster, partly via deferred index construction; an analogous bulk-load fast path is a future item. Large migrations are overnight-cutover territory today.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The conformance posture is: pass for the workloads we’ve measured against, fail fast and loud everywhere else. We’d rather reject a stored procedure than execute it incorrectly.&lt;/p&gt;
&lt;h2 id=&quot;where-this-fits&quot;&gt;Where this fits&lt;/h2&gt;
&lt;p&gt;If you’ve used Neon or Xata, the goal will look familiar — branchable Postgres. The mechanism is different. Their branches are control-plane operations: call the API, get a new compute instance over copy-on-write storage. pg-datahike’s branches are session-level — &lt;code&gt;SET datahike.branch = &apos;feature&apos;&lt;/code&gt; inside an open psql connection switches what you’re reading. No provisioning, no compute. An agent or a query planner can switch branches mid-session.&lt;/p&gt;
&lt;p&gt;Commit pinning — &lt;code&gt;SET datahike.commit_id = &apos;&amp;#x3C;uuid&gt;&apos;&lt;/code&gt; — is the part where we don’t know of a peer. Neon’s time-travel is bounded by a 6h–1d restore window; pg-datahike pins to any historical commit, indefinitely. We have not seen another PG-compatible database expose this directly through the wire protocol.&lt;/p&gt;
&lt;p&gt;Dolt is the closest in spirit — git-like semantics, commit pinning, time-travel — but Dolt is MySQL with a custom storage engine. pg-datahike rides on the standard Postgres wire protocol; every PG client works without modification.&lt;/p&gt;
&lt;p&gt;The honest tradeoff: we are a compatibility layer over Datahike’s storage, not a fork of Postgres. Some features tied to the Postgres codebase — PL/pgSQL, the extension ecosystem, procedural languages — aren’t on our roadmap today. If you need those, use Postgres. If your bottleneck is versioning, branching, or reproducibility, this gets you there without leaving the wire protocol your tools already speak.&lt;/p&gt;
&lt;p&gt;Datahike has been a Datalog database with a Clojure API and growing language bindings; pg-datahike isn’t a separate database, just another front end on the same store. There’s a sibling: &lt;a href=&quot;/notes/stratum-analytics-engine&quot;&gt;Stratum&lt;/a&gt;, a SIMD-accelerated columnar engine that speaks the same wire protocol over an analytical column store with the same fork-as-pointer semantics. Both fit into a shared branching model — see &lt;a href=&quot;/notes/yggdrasil-unified-cow-protocols&quot;&gt;Yggdrasil: Branching Protocols&lt;/a&gt; for how a Datahike database, a Stratum dataset, and a vector index can fork together at a single snapshot.&lt;/p&gt;
&lt;p&gt;The rest of this post is for callers who do speak Clojure — the same data accessible as relations and as datoms, in-process queries that skip the wire, embedded mode without TCP, and configuration knobs that aren’t exposed over SQL.&lt;/p&gt;
&lt;h2 id=&quot;bidirectional-view&quot;&gt;Bidirectional view&lt;/h2&gt;
&lt;p&gt;The pgwire layer is a view onto Datahike’s datom store, not a separate representation. Tables you create over SQL show up as normal Datahike schemas, queryable from Clojure with &lt;code&gt;(d/q …)&lt;/code&gt;. Existing Datahike schemas show up as SQL tables with no setup.&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;;; Plain Datahike schema, transacted from Clojure
d/transact(conn
  [{:db/ident :person/id :db/valueType :db.type/long
    :db/cardinality :db.cardinality/one :db/unique :db.unique/identity}
   {:db/ident :person/name :db/valueType :db.type/string
    :db/cardinality :db.cardinality/one}])

d/transact(conn [{:person/id 1, :person/name &quot;Alice&quot;}])&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;;; Plain Datahike schema, transacted from Clojure
(d/transact conn
  [{:db/ident :person/id   :db/valueType :db.type/long
    :db/cardinality :db.cardinality/one :db/unique :db.unique/identity}
   {:db/ident :person/name :db/valueType :db.type/string
    :db/cardinality :db.cardinality/one}])

(d/transact conn [{:person/id 1 :person/name &quot;Alice&quot;}])&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Same database, over psql:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; person;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;--   id |  name&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;--  ----+-------&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;--    1 | Alice&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The reverse holds too — &lt;code&gt;CREATE TABLE&lt;/code&gt; over pgwire transacts a normal Datahike schema, and the next &lt;code&gt;(d/q …)&lt;/code&gt; from Clojure sees the rows you just inserted. There is no shadow representation, no separate metadata. One datom store, two query languages.&lt;/p&gt;
&lt;h2 id=&quot;using-the-library-directly&quot;&gt;Using the library directly&lt;/h2&gt;
&lt;p&gt;Two ways to skip the standalone jar — start a server from your own JVM application, or bypass the wire layer entirely.&lt;/p&gt;
&lt;h3 id=&quot;start-a-server-in-process&quot;&gt;Start a server in-process&lt;/h3&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;;; deps.edn
{:deps {org.replikativ/datahike {:mvn/version &quot;LATEST&quot;}
        org.replikativ/pg-datahike {:mvn/version &quot;LATEST&quot;}}}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;;; deps.edn
{:deps {org.replikativ/datahike    {:mvn/version &quot;LATEST&quot;}
        org.replikativ/pg-datahike {:mvn/version &quot;LATEST&quot;}}}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;require(&apos;[datahike.api :as d] &apos;[datahike.pg :as pg])

let [boot {:store {:backend :memory, :id random-uuid()}, :schema-flexibility :write}]:
  d/create-database(boot)
  pg/start-server({&quot;datahike&quot; d/connect(boot)} {:port 5432, :database-template {:store {:backend :memory}, :schema-flexibility :write, :keep-history? true}})
end
;; =&gt; :running on :5432&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(require &apos;[datahike.api :as d]
         &apos;[datahike.pg  :as pg])

(let [boot {:store {:backend :memory :id (random-uuid)}
            :schema-flexibility :write}]
  (d/create-database boot)
  (pg/start-server {&quot;datahike&quot; (d/connect boot)}
                   {:port 5432
                    :database-template {:store {:backend :memory}
                                        :schema-flexibility :write
                                        :keep-history? true}}))
;; =&gt; :running on :5432&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Same pgwire surface, in-process. The integration patterns earlier in this post are the embedded-library API; the standalone jar wraps the same calls behind CLI flags.&lt;/p&gt;
&lt;h3 id=&quot;bypass-the-wire-entirely&quot;&gt;Bypass the wire entirely&lt;/h3&gt;
&lt;p&gt;Tests and in-process applications don’t need the wire layer at all:&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;def h: pg/make-query-handler(conn)
h.execute(&quot;CREATE TABLE person (id INT PRIMARY KEY, name TEXT)&quot;)
h.execute(&quot;INSERT INTO person VALUES (1, &apos;Alice&apos;)&quot;)
h.execute(&quot;SELECT * FROM person&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(def h (pg/make-query-handler conn))
(.execute h &quot;CREATE TABLE person (id INT PRIMARY KEY, name TEXT)&quot;)
(.execute h &quot;INSERT INTO person VALUES (1, &apos;Alice&apos;)&quot;)
(.execute h &quot;SELECT * FROM person&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Same SQL surface, no socket. Useful for property-based testing of SQL workloads, or for embedding the SQL interface inside a Clojure or ClojureScript application without exposing a port.&lt;/p&gt;
&lt;h2 id=&quot;permissive-vs-strict-compat&quot;&gt;Permissive vs. strict compat&lt;/h2&gt;
&lt;p&gt;By default the handler rejects unsupported DDL — &lt;code&gt;GRANT&lt;/code&gt;, &lt;code&gt;REVOKE&lt;/code&gt;, &lt;code&gt;CREATE POLICY&lt;/code&gt;, &lt;code&gt;ROW LEVEL SECURITY&lt;/code&gt;, &lt;code&gt;CREATE EXTENSION&lt;/code&gt;, &lt;code&gt;COPY&lt;/code&gt; — with SQLSTATE &lt;code&gt;0A000 feature_not_supported&lt;/code&gt;. Most ORMs emit some of these unconditionally. Two ways to relax:&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;;; silently accept every auth/RLS/extension no-op (Hibernate, Odoo)
pg/make-query-handler(conn {:compat :permissive})

;; accept specific kinds only
pg/make-query-handler(conn {:silently-accept #{:grant :policy}})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;;; silently accept every auth/RLS/extension no-op (Hibernate, Odoo)
(pg/make-query-handler conn {:compat :permissive})

;; accept specific kinds only
(pg/make-query-handler conn {:silently-accept #{:grant :policy}})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The named presets in &lt;code&gt;datahike.pg.server/compat-presets&lt;/code&gt; cover the common ORM patterns.&lt;/p&gt;
&lt;h2 id=&quot;sql-or-datalog&quot;&gt;SQL or Datalog?&lt;/h2&gt;
&lt;p&gt;Both interfaces see the same datoms, the same indexes, the same history. The choice is about how the query reaches the engine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reach for SQL&lt;/strong&gt; when callers don’t share a runtime with the database — services over the wire, analysts in Metabase, tools that only speak the wire protocol — or when you want existing tooling: ORMs, migration runners, BI dashboards.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reach for Datalog&lt;/strong&gt; when the query runs in the same process as the database. Datahike’s Datalog API is a Clojure function: pass values in, get values out, no parsing, no serialization, no socket. Even pg-datahike’s embedded mode (the &lt;code&gt;make-query-handler&lt;/code&gt; path shown above) still goes through the SQL parser and the translator; Datalog skips both. You can invoke arbitrary Clojure functions inside predicates, return live data structures without copying, and &lt;a href=&quot;/notes/collaborate-without-infrastructure&quot;&gt;join across multiple databases&lt;/a&gt; on different storage backends in a single query.&lt;/p&gt;
&lt;p&gt;The two paths compose. DDL via Flyway over SQL, then reads in Datalog from your Clojure backend. Or: Datahike schema in Clojure, ORM-driven CRUD over SQL. Both stay coherent because they’re views of the same datom store.&lt;/p&gt;
&lt;h2 id=&quot;compatibility-evidence&quot;&gt;Compatibility evidence&lt;/h2&gt;
&lt;p&gt;We test pg-datahike against the same suites the Postgres ecosystem uses on itself. If a suite passes here, the apps that depend on it generally work here.&lt;/p&gt;
&lt;table style=&quot;width: 100%; border-collapse: collapse; margin: 1.5rem 0;&quot;&gt;
  &lt;thead&gt;
    &lt;tr style=&quot;text-align: left; border-bottom: 1px solid var(--border);&quot;&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Layer&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Test suite&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Result&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 0;&quot;&gt;What this proves&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;JDBC driver&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;pgjdbc 42.7.5 — &lt;code&gt;ResultSetTest&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; white-space: nowrap;&quot;&gt;80 / 80&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Cursors, type decoding, and metadata behave the way every JVM Postgres client expects.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Java ORM&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Hibernate 6 — &lt;code&gt;DatahikeHibernateTest&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; white-space: nowrap;&quot;&gt;13 / 13&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;JPA stacks — Spring, Quarkus, Jakarta — talk to pg-datahike the same way they talk to Postgres.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Python ORM&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;SQLAlchemy 2.0 dialect&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;16 / 16 across 7 phases&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;The Python data ecosystem — Django, Flask, FastAPI, Airflow, dbt — connects via the standard dialect path.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;SQL semantics&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;sqllogictest&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;779 assertions, 61 files&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Cases derived from PostgreSQL&apos;s regression suite, expressed in the sqllogictest format SQLite, CockroachDB, and DuckDB use for their own correctness work.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Real application&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Odoo 19 — &lt;code&gt;--init=base --test-tags=:TestORM&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;11 / 11 cases, ~38k queries, zero translator errors&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;A 200-table ERP with one of the most demanding open-source ORM layers boots and passes its own test suite.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;BI tool&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Metabase native SQL&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;20-probe MBQL sweep&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Schema introspection, prepared statements, and result handling work for the paths real BI tools depend on.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Migration roundtrip&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Chinook + Pagila &lt;code&gt;pg_dump&lt;/code&gt; fixtures&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Chinook: byte-equal roundtrip. Pagila: schema parses, data loads.&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;A real Postgres database can be exported, replayed in pg-datahike, and dumped back — schema and data preserved through the round-trip.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Internal&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Unit suite&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;544 tests, 1603 assertions&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Standard regression coverage.&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Per-commit suites run on CircleCI. Odoo, Metabase, and &lt;code&gt;psql&lt;/code&gt; / &lt;code&gt;libpq&lt;/code&gt; (&lt;code&gt;\d&lt;/code&gt;, &lt;code&gt;\dt&lt;/code&gt;, &lt;code&gt;\df&lt;/code&gt; family) are run on a manual harness before each release. A dedicated compatibility page with linked test artifacts and a published gaps registry is in flight.&lt;/p&gt;
&lt;h2 id=&quot;try-it&quot;&gt;Try it&lt;/h2&gt;
&lt;p&gt;Download the jar from &lt;a href=&quot;https://github.com/replikativ/pg-datahike/releases&quot;&gt;GitHub releases&lt;/a&gt;, &lt;code&gt;java -jar pg-datahike-VERSION-standalone.jar&lt;/code&gt;, point &lt;code&gt;psql&lt;/code&gt; at it. To embed in a JVM app, the coordinate is &lt;code&gt;org.replikativ/pg-datahike&lt;/code&gt; on Clojars. Repo, docs, and issues at &lt;a href=&quot;https://github.com/replikativ/pg-datahike&quot;&gt;github.com/replikativ/pg-datahike&lt;/a&gt;; feedback to &lt;a href=&quot;mailto:contact@datahike.io&quot;&gt;contact@datahike.io&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A follow-up post will cover the structural-sharing model that makes branching O(1), what &lt;code&gt;merge!&lt;/code&gt; does, and the same workflow across every Datahike binding (Clojure, Java, JavaScript, Python, the C library, the CLI, and SQL). Subscribe to the &lt;a href=&quot;/rss.xml&quot;&gt;RSS feed&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;</content:encoded></item><item><title>Stratum: SQL that branches</title><link>https://datahike.io/notes/stratum-analytics-engine/</link><guid isPermaLink="true">https://datahike.io/notes/stratum-analytics-engine/</guid><description>How we built a SIMD-accelerated columnar SQL engine on the JVM with copy-on-write branching - faster than DuckDB on 35 of 46 queries via the Java Vector API.</description><pubDate>Thu, 26 Feb 2026 00:00:00 GMT</pubDate><content:encoded>&lt;div class=&quot;container page prose&quot;&gt;
&lt;h1 id=&quot;stratum-sql-that-branches&quot;&gt;Stratum: SQL that branches&lt;/h1&gt;
&lt;p&gt;A few years ago I hit a wall I suspect many data engineers know. I had a million-row analytical dataset and I wanted to run an experiment: modify a few pricing assumptions, re-run a set of aggregation queries, compare the results against the original. Simple enough - except in a mutable database, “compare against the original” means either keeping a copy of the data or hoping nothing changed. Neither scales.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/replikativ/datahike&quot;&gt;Datahike&lt;/a&gt; solves this for entity-level data. Its storage is EAVT-indexed - like &lt;a href=&quot;https://datomic.com&quot;&gt;Datomic&lt;/a&gt;, tuned for entity traversal and point lookups. That’s the right structure for a system-of-record, but not for scanning 10M rows to compute a GROUP BY with SIMD. Stratum explores the columnar alternative: the same CoW branching semantics, but over column-oriented storage optimized for analytical scans. SQL is the natural interface for this access pattern - something Datahike doesn’t yet have. The longer-term plan is integration: Stratum’s columnar engine and SQL support as a query path within Datahike’s Datalog planner.&lt;/p&gt;
&lt;p&gt;The core insight is that &lt;strong&gt;a columnar dataset is just a value&lt;/strong&gt;. Make it immutable with structural sharing and you get git-like semantics for free: fork a dataset in O(1), modify branches independently, time-travel to any snapshot, persist named commits to storage. Then add SIMD execution via the Java Vector API, and it turns out you can beat DuckDB on most single-threaded analytical queries from pure JVM code - no native compilation, no JNI.&lt;/p&gt;
&lt;h2 id=&quot;the-sql-interface&quot;&gt;The SQL interface&lt;/h2&gt;
&lt;p&gt;Stratum speaks the PostgreSQL wire protocol. The quickest entry point is the standalone server:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;java&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --add-modules&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; jdk.incubator.vector&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;     --enable-native-access=ALL-UNNAMED&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;     -&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; jar&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; stratum-standalone.jar&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;     --index&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; orders:/data/orders.csv&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Any PostgreSQL client connects immediately - psql, DBeaver, JDBC, psycopg2:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;psql&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; -&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; h&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; localhost&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; -&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; p&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 5432&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; -&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; U&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; stratum&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Standard analytical SQL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; region,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       SUM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(amount &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; discount) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; revenue,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       COUNT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)               &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ship_date &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BETWEEN&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2024-01-01&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2024-12-31&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GROUP BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; region&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; revenue &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Query CSV and Parquet files inline - auto-indexed on first access&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; payment_type,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       AVG&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(tip_amount),&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       PERCENTILE_CONT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;95&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WITHIN&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; GROUP&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tip_amount)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; read_csv(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;/data/taxi.csv&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GROUP BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; payment_type;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Full DML: SELECT, INSERT, UPDATE, DELETE, UPSERT (INSERT ON CONFLICT). CTEs, correlated subqueries, window functions (ROW_NUMBER, RANK, LAG, LEAD, running aggregates), joins (INNER/LEFT/RIGHT/FULL with multi-column keys), set operations (UNION/INTERSECT/EXCEPT). Aggregates: SUM, COUNT, AVG, MIN, MAX, STDDEV, VARIANCE, CORR, MEDIAN, PERCENTILE_CONT, APPROX_QUANTILE, COUNT(DISTINCT). CASE WHEN, COALESCE, date functions, LIKE/ILIKE, FILTER clause. &lt;a href=&quot;https://github.com/replikativ/stratum/blob/main/doc/sql-interface.md&quot;&gt;Full SQL reference →&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&quot;how-the-engine-works&quot;&gt;How the engine works&lt;/h2&gt;
&lt;p&gt;Every column is split into fixed-size chunks. Each chunk carries pre-computed statistics: minimum, maximum, sum, count. This unlocks two significant optimizations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Zone-map pruning.&lt;/strong&gt; Each chunk carries pre-computed min, max, sum, and count statistics. DuckDB stores only min and max per segment, used for predicate filter pushdown - skipping segments that can’t contain rows matching a WHERE clause. Both engines do this. What DuckDB doesn’t pre-compute is per-segment SUM or COUNT, so unfiltered aggregates like &lt;code&gt;COUNT(*)&lt;/code&gt;, &lt;code&gt;SUM(price)&lt;/code&gt;, or &lt;code&gt;AVG(price)&lt;/code&gt; require a full data scan in DuckDB. In Stratum, these are answered by traversing the pre-computed metadata at tree nodes - no row data touched. &lt;code&gt;SELECT AVG(price) FROM orders&lt;/code&gt; on 10M rows: Stratum 0.1ms, DuckDB 7.1ms.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fused SIMD execution.&lt;/strong&gt; Most columnar engines evaluate predicates in one pass, then apply the result mask during a separate aggregation pass. Stratum fuses these into a single loop: predicates and accumulation run simultaneously via Java Vector API &lt;code&gt;VectorMask&lt;/code&gt; chains, processing four doubles or longs per SIMD cycle. No intermediate arrays, no second pass, no extra allocation.&lt;/p&gt;
&lt;p&gt;The Vector API (JDK 21+) provides &lt;code&gt;DoubleVector&lt;/code&gt; and &lt;code&gt;LongVector&lt;/code&gt; operations backed by AVX-512 on x86 and SVE on ARM. The bet was that the JVM incubator API had matured enough to compete with native code on analytical workloads without the deployment complexity of a native library. The benchmarks suggest that bet paid off.&lt;/p&gt;
&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;
&lt;p&gt;Single-threaded comparison vs DuckDB v1.4.4 (JDBC in-process) on 10M rows, Intel Core Ultra 7 258V, JVM 25. Median of 10 iterations, 5 warmup:&lt;/p&gt;
&lt;table style=&quot;width: 100%; border-collapse: collapse; margin: 1rem 0;&quot;&gt;
  &lt;thead&gt;
    &lt;tr style=&quot;text-align: left; border-bottom: 1px solid var(--border);&quot;&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Query&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Stratum&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;DuckDB&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 0;&quot;&gt;Ratio&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;TPC-H Q6 (filter + sum-product)&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;13ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;28ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0; font-weight: 600;&quot;&gt;2.2x faster&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Filtered COUNT (NEQ pred)&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;3ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;12ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0; font-weight: 600;&quot;&gt;4.0x faster&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;TPC-H Q1 (7 aggs, 4 groups)&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;75ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;93ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0; font-weight: 600;&quot;&gt;1.2x faster&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;H2O Q3 (100K string groups)&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;71ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;362ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0; font-weight: 600;&quot;&gt;5.1x faster&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;H2O Q10 (10M groups, 6 cols)&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;832ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;7056ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0; font-weight: 600;&quot;&gt;8.5x faster&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;LIKE &apos;%search%&apos;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;47ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;240ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0; font-weight: 600;&quot;&gt;5.1x faster&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;AVG(LENGTH(URL))&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;38ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;170ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0; font-weight: 600;&quot;&gt;4.5x faster&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;H2O Q6 (STDDEV group-by)&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;30ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;81ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0; font-weight: 600;&quot;&gt;2.7x faster&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;H2O Q9 (CORR)&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;61ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;134ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0; font-weight: 600;&quot;&gt;2.2x faster&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;MEDIAN(price)&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;68ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;158ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0; font-weight: 600;&quot;&gt;2.3x faster&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;ROW_NUMBER window&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;316ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;426ms&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0; font-weight: 600;&quot;&gt;1.3x faster&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Stratum wins 35 of 46 queries at 10M rows (single-threaded, median of 10 runs). DuckDB wins on sparse-selectivity filters, window-based top-N, high-cardinality hash group-by at scale (1M+ unique groups where hash tables become DRAM-bound), and global COUNT(DISTINCT). Full methodology and raw results: &lt;a href=&quot;https://github.com/replikativ/stratum/blob/main/doc/benchmarks.md&quot;&gt;benchmark docs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;DuckDB is an excellent system. The point is that pure JVM code can compete with a mature native engine on the workloads that matter most, while adding semantics DuckDB doesn’t have.&lt;/p&gt;
&lt;h2 id=&quot;branching-where-it-diverges&quot;&gt;Branching: where it diverges&lt;/h2&gt;
&lt;p&gt;This is the part that doesn’t exist anywhere else.&lt;/p&gt;
&lt;p&gt;Each column is backed by a chunked B-tree (&lt;code&gt;PersistentColumnIndex&lt;/code&gt;) that implements Clojure’s &lt;code&gt;IPersistentCollection&lt;/code&gt; and &lt;code&gt;IEditableCollection&lt;/code&gt; protocols. When you call &lt;code&gt;(st/fork ds)&lt;/code&gt;, you get a new dataset that shares all unchanged chunks with the original. No data is copied - just a new root pointer into a shared tree. Mutations through the transient protocol only write-copy the chunks they touch. A billion-row dataset costs essentially nothing to fork.&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;require(&apos;[stratum.api :as st]
  &apos;[konserve.file-store :as fs]
  &apos;[clojure.core.async :refer [&amp;#x3C;!!]])

;; Open storage, load the orders dataset (10M rows)
def store: &amp;#x3C;!!(fs/new-fs-store(&quot;/data/stratum&quot;))
def orders: &amp;#x3C;!!(st/load(store &quot;orders&quot;))

;; Fork in O(1) - structural sharing, zero data copied
def experiment: st/fork(orders)

;; Persist the fork as a named branch
&amp;#x3C;!!(st/sync!(experiment store &quot;experiment&quot;))

;; Query both branches via SQL - pass column data as table map
st/q(&quot;SELECT SUM(price * qty) FROM t&quot; {&quot;t&quot; st/columns(orders)})
;; =&gt; {:SUM(price * qty) 4821903.40}   ← main branch

st/q(&quot;SELECT SUM(price * qty) FROM t&quot; {&quot;t&quot; st/columns(experiment)})
;; =&gt; {:SUM(price * qty) 4401238.66}   ← experiment branch

;; Time-travel: load any historical branch by name
def baseline: &amp;#x3C;!!(st/load(store &quot;orders-baseline&quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(require &apos;[stratum.api :as st]
         &apos;[konserve.file-store :as fs]
         &apos;[clojure.core.async :refer [&amp;#x3C;!!]])

;; Open storage, load the orders dataset (10M rows)
(def store  (&amp;#x3C;!! (fs/new-fs-store &quot;/data/stratum&quot;)))
(def orders (&amp;#x3C;!! (st/load store &quot;orders&quot;)))

;; Fork in O(1) - structural sharing, zero data copied
(def experiment (st/fork orders))

;; Persist the fork as a named branch
(&amp;#x3C;!! (st/sync! experiment store &quot;experiment&quot;))

;; Query both branches via SQL - pass column data as table map
(st/q &quot;SELECT SUM(price * qty) FROM t&quot; {&quot;t&quot; (st/columns orders)})
;; =&gt; {:SUM(price * qty) 4821903.40}   ← main branch

(st/q &quot;SELECT SUM(price * qty) FROM t&quot; {&quot;t&quot; (st/columns experiment)})
;; =&gt; {:SUM(price * qty) 4401238.66}   ← experiment branch

;; Time-travel: load any historical branch by name
(def baseline (&amp;#x3C;!! (st/load store &quot;orders-baseline&quot;)))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;From the server side, &lt;code&gt;register-live-table!&lt;/code&gt; lets you expose named branches as separate SQL tables - query them with plain SQL over the PostgreSQL connection without touching the Clojure API.&lt;/p&gt;
&lt;p&gt;The practical uses this unlocks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reproducible experiments&lt;/strong&gt;: fork a dataset, run your pipeline on the fork, compare results against the original without managing separate data copies or locking the source&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Audit trails&lt;/strong&gt;: every query result is tied to a specific database state - you can always recover the exact snapshot that produced a given answer&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What-if analysis&lt;/strong&gt;: branch before a bulk UPDATE, run your scenario, inspect the diff, discard - the original is untouched&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Zero-ETL&lt;/strong&gt;: &lt;a href=&quot;/datahike&quot;&gt;Datahike&lt;/a&gt; is the system-of-record; Stratum queries the same versioned snapshots directly, no extraction pipeline needed&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;for-clojure-developers&quot;&gt;For Clojure developers&lt;/h2&gt;
&lt;p&gt;If you’re coming from the Clojure ecosystem, Stratum datasets behave like ordinary Clojure values. They implement &lt;code&gt;IPersistentCollection&lt;/code&gt;, &lt;code&gt;ILookup&lt;/code&gt;, &lt;code&gt;IEditableCollection&lt;/code&gt; - tablecloth and tech.ml.dataset work with them directly as column maps. You can query with SQL strings or a Clojure DSL that composes programmatically:&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;require(&apos;[stratum.api :as st])

;; DSL - composable, no string manipulation
st/q({:from {:price prices, :qty quantities, :region regions}
      :where [[:&gt; :price 100]]
      :group [:region]
      :agg [[:sum [:* :price :qty]] [:count]]})

;; SQL string - same engine underneath
st/q(
  &quot;SELECT region, SUM(price * qty), COUNT(*)\n       FROM orders WHERE price &gt; 100 GROUP BY region&quot; {&quot;orders&quot; {:price prices, :qty quantities, :region regions}})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(require &apos;[stratum.api :as st])

;; DSL - composable, no string manipulation
(st/q {:from   {:price prices :qty quantities :region regions}
       :where  [[:&gt; :price 100]]
       :group  [:region]
       :agg    [[:sum [:* :price :qty]]
                [:count]]})

;; SQL string - same engine underneath
(st/q &quot;SELECT region, SUM(price * qty), COUNT(*)
       FROM orders WHERE price &gt; 100 GROUP BY region&quot;
      {&quot;orders&quot; {:price prices :qty quantities :region regions}})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The DSL is useful when building queries programmatically - no string interpolation, no injection risk, results are plain Clojure maps.&lt;/p&gt;
&lt;h2 id=&quot;the-origin&quot;&gt;The origin&lt;/h2&gt;
&lt;p&gt;This work started with &lt;a href=&quot;http://reluk.ca/project/Votorola/home-2013.html&quot;&gt;Votorola&lt;/a&gt;, a collaborative democracy project that needed distributed state. The limitations of imperative systems led to Clojure, then to &lt;a href=&quot;https://github.com/replikativ/replikativ&quot;&gt;replikativ&lt;/a&gt; for distributed replication, then to Datahike for immutable entity-level storage. Each step sharpened the same conviction: mutability is the core problem. When data changes in place, you lose history, auditability, and the ability to reason about what a system knew at any point in time.&lt;/p&gt;
&lt;p&gt;My PhD work on &lt;a href=&quot;https://scholar.google.com/citations?user=6foQfZwAAAAJ&quot;&gt;simulator-based inference&lt;/a&gt; at UBC’s PLAI lab reinforced this. Probabilistic systems need to fork hypotheses, accumulate evidence, and explain their reasoning - tracking not just the current state but the path that led to it. Stratum is the analytics piece of the infrastructure we’re building for that.&lt;/p&gt;
&lt;h2 id=&quot;the-ecosystem&quot;&gt;The ecosystem&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/replikativ/datahike&quot;&gt;Datahike&lt;/a&gt;&lt;/strong&gt; - immutable Datalog database: system-of-record for structured entity data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;/stratum&quot;&gt;Stratum&lt;/a&gt;&lt;/strong&gt; - SIMD-accelerated columnar SQL: analytics and scans over those same snapshots&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;/proximum&quot;&gt;Proximum&lt;/a&gt;&lt;/strong&gt; - version-controlled vector search (HNSW)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/replikativ/scriptum&quot;&gt;Scriptum&lt;/a&gt;&lt;/strong&gt; - git-like branching for full-text search (Lucene)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/replikativ/yggdrasil&quot;&gt;Yggdrasil&lt;/a&gt;&lt;/strong&gt; - unified branching across all of the above&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Via Yggdrasil you can fork a Datahike database, a Stratum dataset, and a Proximum index together - consistent snapshots across SQL, Datalog, and vector search at the same point in time.&lt;/p&gt;
&lt;h2 id=&quot;getting-started&quot;&gt;Getting started&lt;/h2&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;;; deps.edn - check https://clojars.org/org.replikativ/stratum for latest
{:deps {org.replikativ/stratum {:mvn/version &quot;0.1.7&quot;}}}

:jvm-opts

[&quot;--add-modules=jdk.incubator.vector&quot; &quot;--enable-native-access=ALL-UNNAMED&quot;]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;;; deps.edn - check https://clojars.org/org.replikativ/stratum for latest
{:deps {org.replikativ/stratum {:mvn/version &quot;0.1.7&quot;}}}

:jvm-opts [&quot;--add-modules=jdk.incubator.vector&quot;
           &quot;--enable-native-access=ALL-UNNAMED&quot;]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Standalone server with built-in demo tables (lineitem, taxi -100K rows each)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;java&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --add-modules&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; jdk.incubator.vector&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;     --enable-native-access=ALL-UNNAMED&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;     -&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; jar&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; stratum-standalone.jar&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --demo&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Requirements:&lt;/strong&gt; JDK 21+&lt;/p&gt;
&lt;p&gt;Source: &lt;a href=&quot;https://github.com/replikativ/stratum&quot;&gt;github.com/replikativ/stratum&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Documentation: &lt;a href=&quot;https://github.com/replikativ/stratum/blob/main/doc/query-engine.md&quot;&gt;Query DSL&lt;/a&gt; · &lt;a href=&quot;https://github.com/replikativ/stratum/blob/main/doc/sql-interface.md&quot;&gt;SQL Interface&lt;/a&gt; · &lt;a href=&quot;https://github.com/replikativ/stratum/blob/main/doc/dataset.md&quot;&gt;Dataset API&lt;/a&gt; · &lt;a href=&quot;https://github.com/replikativ/stratum/blob/main/doc/anomaly-detection.md&quot;&gt;Anomaly Detection&lt;/a&gt; · &lt;a href=&quot;https://github.com/replikativ/stratum/blob/main/doc/benchmarks.md&quot;&gt;Benchmarks&lt;/a&gt; · &lt;a href=&quot;https://github.com/replikativ/stratum/blob/main/doc/architecture.md&quot;&gt;Architecture&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Source and documentation: &lt;a href=&quot;https://github.com/replikativ/stratum&quot;&gt;github.com/replikativ/stratum&lt;/a&gt;. Feedback welcome on the &lt;a href=&quot;https://clojurians.slack.com/archives/CB7GJAN0L&quot;&gt;Clojurians #datahike channel&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;</content:encoded></item><item><title>The Git Model for Databases</title><link>https://datahike.io/notes/the-git-model-for-databases/</link><guid isPermaLink="true">https://datahike.io/notes/the-git-model-for-databases/</guid><description>Copy-on-write, structural sharing, and branching - applied to your data.</description><pubDate>Tue, 06 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;div class=&quot;container page prose&quot;&gt;
&lt;h1 id=&quot;the-git-model-for-databases&quot;&gt;The Git Model for Databases&lt;/h1&gt;
&lt;p&gt;Every commit is a snapshot. Branches are cheap. Merging is a first-class operation. Developers internalize this model for code - but it applies equally well to data.&lt;/p&gt;
&lt;h2 id=&quot;databases-as-values&quot;&gt;Databases as values&lt;/h2&gt;
&lt;p&gt;In a traditional database, you interact through a connection. The data may change between queries; the database is a service, not a thing you hold.&lt;/p&gt;
&lt;p&gt;Datahike inverts this. Dereference a connection (&lt;code&gt;@conn&lt;/code&gt;) and you get a &lt;strong&gt;database value&lt;/strong&gt;: a snapshot frozen at a particular transaction. That value won’t change. Pass it to a function, store it, compare it to another snapshot. Two threads reading the same snapshot always agree - no locks, no coordination. And because a snapshot is just a value, you can hand it to any number of workers across threads, processes, or machines. Read scaling is built in: spin up more readers, not more database connections.&lt;/p&gt;
&lt;h2 id=&quot;structural-sharing&quot;&gt;Structural sharing&lt;/h2&gt;
&lt;p&gt;If every write produces a new snapshot, won’t you run out of memory? No - because snapshots share structure. When you transact new data, Datahike creates new tree nodes only for changed portions; everything else is reused. A million-row database with one updated row shares 99.99% of its structure with the previous version.&lt;/p&gt;
&lt;p&gt;This is the same trick that powers Clojure’s persistent vectors and git’s object store. Overhead is logarithmic, not linear.&lt;/p&gt;
&lt;h2 id=&quot;branching&quot;&gt;Branching&lt;/h2&gt;
&lt;p&gt;Fork a database, make changes in isolation, merge back when ready. Unlike git (which merges text files), database merges operate on datoms with application-defined conflict resolution.&lt;/p&gt;
&lt;p&gt;This enables workflows that are awkward otherwise: feature branches for data migrations, parallel experiments with different schemas, per-tenant forks sharing a common ancestor. It’s also how coding assistants use git worktrees to isolate their edits - the same model applies to data.&lt;/p&gt;
&lt;h2 id=&quot;history-that-persists&quot;&gt;History that persists&lt;/h2&gt;
&lt;p&gt;Most databases offer snapshot isolation through MVCC, but those snapshots are ephemeral - garbage collected after the transaction commits. You can’t query “what was the value last Tuesday?”&lt;/p&gt;
&lt;p&gt;Datahike keeps history by default. Every past state is addressable. Query as-of a specific instant, diff two snapshots, audit when a fact changed. Useful for debugging, compliance, and any system that needs to explain itself.&lt;/p&gt;
&lt;h2 id=&quot;the-tradeoff&quot;&gt;The tradeoff&lt;/h2&gt;
&lt;p&gt;Immutability isn’t free. Write amplification is real: inserting a row touches multiple tree nodes. Storage grows with history, though compaction can prune what you don’t need.&lt;/p&gt;
&lt;p&gt;In practice, this cost is amortized. You don’t create a snapshot for every fact added during a bulk load. The underlying data structures support &lt;em&gt;transient&lt;/em&gt; modes - mutable during a batch, immutable at the boundary. Snapshots are created only when a batch commits, and only those become visible to external readers. The system can adaptively coarse-grain batches to balance write throughput against snapshot granularity.&lt;/p&gt;
&lt;p&gt;For systems that value auditability, reproducibility, and coordination-free reads, this model beats the connection-oriented one we inherited from the 1970s.&lt;/p&gt;
&lt;/div&gt;</content:encoded></item><item><title>Versioned Analytics for Regulated Industries</title><link>https://datahike.io/notes/versioned-analytics-regulated-industries/</link><guid isPermaLink="true">https://datahike.io/notes/versioned-analytics-regulated-industries/</guid><description>How immutable snapshots, copy-on-write branching, and cross-system consistency solve audit compliance, reproducibility, and scenario analysis in regulated environments.</description><pubDate>Mon, 06 Apr 2026 23:00:00 GMT</pubDate><content:encoded>&lt;div class=&quot;container page prose&quot;&gt;
&lt;h1 id=&quot;versioned-analytics-for-regulated-industries&quot;&gt;Versioned Analytics for Regulated Industries&lt;/h1&gt;
&lt;p&gt;Financial regulation — &lt;a href=&quot;https://www.bis.org/bcbs/basel3.htm&quot;&gt;Basel III&lt;/a&gt;, &lt;a href=&quot;https://www.esma.europa.eu/publications-and-data/interactive-single-rulebook/mifid-ii&quot;&gt;MiFID II&lt;/a&gt;, &lt;a href=&quot;https://www.eiopa.europa.eu/browse/regulation-and-policy/solvency-ii_en&quot;&gt;Solvency II&lt;/a&gt;, &lt;a href=&quot;https://www.congress.gov/bill/107th-congress/house-bill/3763&quot;&gt;SOX&lt;/a&gt; — requires that risk calculations, credit decisions, and compliance reports be reproducible. Not just the code, but the exact data state that produced them. When an auditor asks “show me the data behind this risk number from six months ago,” the answer can’t be “we’ll try to reconstruct it.”&lt;/p&gt;
&lt;p&gt;Version control solved this problem for source code decades ago. But analytical data infrastructure never caught up. Data warehouses don’t version tables. Temporal tables track row-level changes but don’t compose across tables or systems. Manual snapshots are expensive, fragile, and don’t support branching for scenario analysis.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/replikativ/stratum&quot;&gt;Stratum&lt;/a&gt; brings the &lt;a href=&quot;/notes/the-git-model-for-databases&quot;&gt;git model&lt;/a&gt; to analytical data: every write creates an immutable, content-addressed snapshot. Old states remain accessible by commit UUID. Branches are O(1). And via &lt;a href=&quot;/notes/yggdrasil-unified-cow-protocols&quot;&gt;Yggdrasil&lt;/a&gt;, you can tie entity databases, analytical datasets, and search indices into a single consistent, auditable snapshot.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The problem&lt;/h2&gt;
&lt;p&gt;A typical analytical pipeline at a regulated institution:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Transactional data flows into a warehouse (nightly ETL or streaming)&lt;/li&gt;
&lt;li&gt;Analysts run GROUP BY / SUM / STDDEV queries for risk models and reports&lt;/li&gt;
&lt;li&gt;Results feed regulatory submissions — capital adequacy, liquidity coverage, market risk&lt;/li&gt;
&lt;li&gt;Months later, an auditor asks: “What data produced risk report X on date Y?”&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Step 4 is where things break. The warehouse has been mutated since then. Maybe there’s a backup, maybe not. Reconstructing the exact state requires replaying ETL from source systems — if those logs still exist.&lt;/p&gt;
&lt;p&gt;Even if you can reconstruct the data, you can’t &lt;em&gt;prove&lt;/em&gt; it’s the same data. There’s no cryptographic link between the report and the state that produced it. The best you can offer is procedural trust: “our backup process is reliable, and we believe this is what the data looked like.” That’s a weak foundation for regulatory compliance.&lt;/p&gt;
&lt;h2 id=&quot;immutable-snapshots-as-audit-anchors&quot;&gt;Immutable snapshots as audit anchors&lt;/h2&gt;
&lt;p&gt;With Stratum, every table is a copy-on-write value. Writes create new snapshots; old snapshots remain addressable by commit UUID or branch name. The underlying storage is a content-addressed Merkle tree — each snapshot’s identity is derived from a hash of its data, providing a cryptographic chain of custody from report to source.&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;require(&apos;[stratum.api :as st])

;; Load the current production state
def trades: st/load(store &quot;trades&quot; {:branch &quot;production&quot;})

;; Run today&apos;s risk calculation
def risk-report: st/q({:from trades, :group [:desk :currency], :agg [[:sum :notional] [:stddev :pnl] [:count]]})

;; The commit UUID is your audit anchor — store it alongside the report
;; Six months later, reproduce exactly:
def historical-trades: st/load(store &quot;trades&quot; {:as-of #uuid &quot;a1b2c3d4-...&quot;})

def historical-report: st/q({:from historical-trades, :group [:desk :currency], :agg [[:sum :notional] [:stddev :pnl] [:count]]})
;; Identical results, guaranteed by content addressing&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(require &apos;[stratum.api :as st])

;; Load the current production state
(def trades (st/load store &quot;trades&quot; {:branch &quot;production&quot;}))

;; Run today&apos;s risk calculation
(def risk-report
  (st/q {:from trades
         :group [:desk :currency]
         :agg [[:sum :notional] [:stddev :pnl] [:count]]}))

;; The commit UUID is your audit anchor — store it alongside the report
;; Six months later, reproduce exactly:
(def historical-trades
  (st/load store &quot;trades&quot; {:as-of #uuid &quot;a1b2c3d4-...&quot;}))

(def historical-report
  (st/q {:from historical-trades
         :group [:desk :currency]
         :agg [[:sum :notional] [:stddev :pnl] [:count]]}))
;; Identical results, guaranteed by content addressing&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Or via SQL — connect any PostgreSQL client:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Today&apos;s report&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; desk, currency, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;SUM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(notional), STDDEV(pnl), &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;COUNT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; trades &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GROUP BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; desk, currency;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Historical report: same query, different snapshot&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- resolved server-side via branch/commit configuration&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once committed, data cannot be modified — every state is a value, addressable by its content hash. Historical snapshots load lazily from storage on demand, so keeping years of history doesn’t mean paying for it in memory. And because snapshots are immutable values, multiple analysts can query the same or different points in time concurrently without coordination or locks.&lt;/p&gt;
&lt;h2 id=&quot;scenario-analysis-with-branching&quot;&gt;Scenario analysis with branching&lt;/h2&gt;
&lt;p&gt;Beyond audit compliance, regulated institutions need scenario analysis. Basel III &lt;a href=&quot;https://www.bis.org/bcbs/publ/d450.htm&quot;&gt;stress testing&lt;/a&gt; requires banks to evaluate capital adequacy under hypothetical adverse conditions — equity drawdowns, interest rate shocks, credit spread widening. Traditional approaches involve copying production data into staging environments, running scenarios, comparing results, and cleaning up. That process is slow, expensive, and error-prone.&lt;/p&gt;
&lt;p&gt;With copy-on-write branching, forking a dataset is O(1) regardless of size. A 100-million-row table branches in microseconds because the fork is just a new root pointer into the shared tree. Only chunks that are actually modified get copied.&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;;; Fork production data for stress testing — O(1) regardless of table size
def stress-scenario: st/fork(trades)

;; Apply adverse conditions — only modified chunks are copied
;; e.g. via SQL: UPDATE trades SET price = price * 0.7
;;               WHERE asset_class = &apos;equity&apos;

;; Compare risk metrics: production vs stressed
def baseline-risk: st/q({:from trades, :group [:desk], :agg [[:stddev :pnl] [:sum :notional]]})

def stressed-risk: st/q({:from stress-scenario, :group [:desk], :agg [[:stddev :pnl] [:sum :notional]]})

;; Run as many scenarios as needed — each is an independent branch
;; Baseline, adverse, severely adverse, custom scenarios
;; all sharing unmodified data via structural sharing&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;;; Fork production data for stress testing — O(1) regardless of table size
(def stress-scenario (st/fork trades))

;; Apply adverse conditions — only modified chunks are copied
;; e.g. via SQL: UPDATE trades SET price = price * 0.7
;;               WHERE asset_class = &apos;equity&apos;

;; Compare risk metrics: production vs stressed
(def baseline-risk
  (st/q {:from trades
         :group [:desk]
         :agg [[:stddev :pnl] [:sum :notional]]}))

(def stressed-risk
  (st/q {:from stress-scenario
         :group [:desk]
         :agg [[:stddev :pnl] [:sum :notional]]}))

;; Run as many scenarios as needed — each is an independent branch
;; Baseline, adverse, severely adverse, custom scenarios
;; all sharing unmodified data via structural sharing&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Each branch is fully isolated: modifications to the stress scenario can’t touch production data. You can maintain dozens of concurrent scenarios without multiplying storage costs — they share all unmodified data. When you stop referencing a branch, mark-and-sweep GC reclaims the storage. No staging environments, no cleanup scripts.&lt;/p&gt;
&lt;p&gt;This also applies to model validation. When a risk model is updated, you can run the new model against historical snapshots and compare its outputs to the original model’s results — same data, different code, verifiable divergence.&lt;/p&gt;
&lt;h2 id=&quot;cross-system-consistency&quot;&gt;Cross-system consistency&lt;/h2&gt;
&lt;p&gt;A real regulatory pipeline isn’t just one analytical table. Entity data (customers, counterparties, legal entities) lives in a transactional database. Analytical views (positions, P&amp;#x26;L, exposures) live in a columnar engine. Compliance documents and communications live in a search index. For an audit to be meaningful, all of these need to be at the same point in time.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/notes/yggdrasil-unified-cow-protocols&quot;&gt;Yggdrasil&lt;/a&gt; provides a shared branching protocol across these heterogeneous systems. You can compose a &lt;a href=&quot;https://github.com/replikativ/datahike&quot;&gt;Datahike&lt;/a&gt; entity database, a Stratum analytical dataset, and a &lt;a href=&quot;https://github.com/replikativ/scriptum&quot;&gt;Scriptum&lt;/a&gt; search index into a single composite system — branching, snapshotting, and time-traveling all of them together.&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;require(&apos;[yggdrasil.core :as ygg])

;; Compose entity database + analytics + search into one system
def system: ygg/composite-system({:entities datahike-conn, :analytics stratum-store, :search scriptum-index})

;; Branch the entire system for an investigation
ygg/branch!(system &quot;investigation-2026-Q1&quot;)

;; Every component is now at the same logical point in time
;; Query across all three with a single consistent snapshot&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(require &apos;[yggdrasil.core :as ygg])

;; Compose entity database + analytics + search into one system
(def system
  (ygg/composite-system
    {:entities datahike-conn    ;; customer records, counterparties
     :analytics stratum-store   ;; trade data, positions, P&amp;#x26;L
     :search scriptum-index}))  ;; compliance documents, communications

;; Branch the entire system for an investigation
(ygg/branch! system &quot;investigation-2026-Q1&quot;)

;; Every component is now at the same logical point in time
;; Query across all three with a single consistent snapshot&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;When an auditor needs the full picture — the trade data, the customer entity that placed the trade, and the compliance documents reviewed at the time — they get a single consistent view across all systems, tied to one branch identifier. No manual coordination, no hoping the timestamps line up.&lt;/p&gt;
&lt;h2 id=&quot;compliance-lifecycle&quot;&gt;Compliance lifecycle&lt;/h2&gt;
&lt;p&gt;Immutable systems raise an obvious question: what about &lt;a href=&quot;https://gdpr-info.eu/&quot;&gt;GDPR&lt;/a&gt; right-to-erasure, or data retention policies that require deletion?&lt;/p&gt;
&lt;p&gt;Immutability doesn’t mean data can never be removed — it means deletion is explicit and verifiable rather than implicit and unauditable. The Datahike ecosystem supports purge operations that remove specific data from all indices and all historical snapshots. Mark-and-sweep garbage collection, coordinated across systems via Yggdrasil, reclaims storage from unreachable snapshots.&lt;/p&gt;
&lt;p&gt;This is actually a stronger compliance story than mutable databases offer. In a mutable system, you &lt;code&gt;DELETE&lt;/code&gt; a row and trust that the storage layer eventually overwrites it — but you can’t prove it’s gone from backups, replicas, or caches. With explicit purge on content-addressed storage, you can verify that the data no longer exists in any reachable snapshot.&lt;/p&gt;
&lt;h2 id=&quot;production-ready-performance&quot;&gt;Production-ready performance&lt;/h2&gt;
&lt;p&gt;Versioning and immutability don’t come at the cost of query speed. Stratum uses SIMD-accelerated execution via the &lt;a href=&quot;https://openjdk.org/jeps/469&quot;&gt;Java Vector API&lt;/a&gt;, fused filter-aggregate pipelines, and zone-map pruning to skip entire data chunks. It runs standard OLAP benchmarks competitively with engines like &lt;a href=&quot;https://duckdb.org/&quot;&gt;DuckDB&lt;/a&gt; — while also providing branching, time travel, and content addressing that pure analytical engines don’t.&lt;/p&gt;
&lt;p&gt;Full SQL is supported via the PostgreSQL wire protocol: aggregates, window functions, joins, CTEs, subqueries. Connect with psql, JDBC, DBeaver, or any PostgreSQL-compatible client. See the &lt;a href=&quot;/notes/stratum-analytics-engine&quot;&gt;Stratum technical deep-dive&lt;/a&gt; for architecture details and &lt;a href=&quot;https://github.com/replikativ/stratum/blob/main/doc/benchmarks.md&quot;&gt;benchmark methodology&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;getting-started&quot;&gt;Getting started&lt;/h2&gt;
&lt;p&gt;Stratum runs as an in-process Clojure library or a standalone SQL server. Requires JDK 21+.&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;{:deps {org.replikativ/stratum {:mvn/version &quot;RELEASE&quot;}}}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;{:deps {org.replikativ/stratum {:mvn/version &quot;RELEASE&quot;}}}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;If you’re building analytical infrastructure in a regulated environment — or exploring how versioned data can simplify your compliance story — &lt;a href=&quot;mailto:contact@datahike.io&quot;&gt;get in touch&lt;/a&gt;. We work with teams in finance, insurance, and healthcare to design data architectures where auditability is built in, not bolted on.&lt;/p&gt;
&lt;/div&gt;</content:encoded></item><item><title>Why Search Needs Versioning</title><link>https://datahike.io/notes/why-search-needs-versioning/</link><guid isPermaLink="true">https://datahike.io/notes/why-search-needs-versioning/</guid><description>Immutable search indexes for reproducible retrieval and systems that can explain themselves.</description><pubDate>Mon, 05 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;div class=&quot;container page prose&quot;&gt;
&lt;h1 id=&quot;why-search-needs-versioning&quot;&gt;Why Search Needs Versioning&lt;/h1&gt;
&lt;p&gt;Search indexes are almost always mutable. You insert documents or embeddings, update them, delete them - the index reflects current state. This is fine when you’ll never need to query or audit past states, but breaks when retrieval feeds into reasoning.&lt;/p&gt;
&lt;p&gt;Once search results enter an LLM’s context window or guide an agent’s action, the index is effectively memory. If that memory overwrites itself on every update, you can’t reproduce or audit past retrieval results. This applies to vector search and full-text search alike.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The problem&lt;/h2&gt;
&lt;p&gt;A retrieval-augmented system in production: embeddings are indexed, queries retrieve context, responses are generated. A week later, someone asks why the system returned a particular result. In a mutable index, there’s no answer. The index changed. The embedding model may have been updated. The retrieval state that produced that response no longer exists.&lt;/p&gt;
&lt;p&gt;This isn’t theoretical. Any system where retrieval influences outcomes - recommendations, classifications, agent decisions - has this problem. The less human oversight there is, the more it matters.&lt;/p&gt;
&lt;h2 id=&quot;proximum-git-semantics-for-search&quot;&gt;Proximum: git semantics for search&lt;/h2&gt;
&lt;p&gt;Proximum applies the same copy-on-write model that powers Datahike and Clojure to HNSW (Hierarchical Navigable Small World) vector indexes. Every insert returns a new index version. Previous versions remain valid and queryable:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;java&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Create and populate&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;var&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ProximumVectorStore.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;builder&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;()&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    .&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;dimensions&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1536&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    .&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;storagePath&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;/var/data/vectors&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    .&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;build&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;idx.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;addBatch&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(embeddings, ids);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;idx.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;sync&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;().&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;join&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;();  &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// persist and wait for completion&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;UUID v1 &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;getCommitId&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;idx.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;addBatch&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(moreEmbeddings, moreIds);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;idx.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;sync&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;().&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;join&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;UUID v2 &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;getCommitId&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Both versions remain searchable&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;var&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; storeConfig &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Map.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;of&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;backend&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;:file&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;path&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;/var/data/vectors&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;var&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; oldIndex &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ProximumVectorStore.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;connectCommit&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(storeConfig, v1);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;oldIndex.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;search&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(query, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);  &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// original state&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;idx.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;search&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(query, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);       &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// current state&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;branch()&lt;/code&gt; operation is O(1) - it shares structure with the original. Two branches diverge independently without copying data. This makes A/B testing embeddings, bisecting regressions, and maintaining reproducible baselines cheap.&lt;/p&gt;
&lt;h2 id=&quot;how-it-works&quot;&gt;How it works&lt;/h2&gt;
&lt;p&gt;The core data structure is a &lt;code&gt;PersistentEdgeIndex&lt;/code&gt;: chunked copy-on-write arrays that hold HNSW graph edges. Layer 0 (the dense bottom layer) uses fixed-size chunks; upper layers use sparse per-node arrays. When you modify the graph, only affected chunks are copied. Unchanged structure is shared.&lt;/p&gt;
&lt;p&gt;Vectors themselves live in a memory-mapped store backed by Konserve, so the same index can be persisted to disk, S3, or any pluggable backend. The combination gives you SIMD-accelerated search with full version history and portable storage.&lt;/p&gt;
&lt;h2 id=&quot;scriptum-git-semantics-for-full-text-search&quot;&gt;Scriptum: git semantics for full-text search&lt;/h2&gt;
&lt;p&gt;The same versioning principles apply to traditional full-text search. Scriptum brings copy-on-write branching to Apache Lucene by sharing immutable segment files across branches:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;java&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Create an index&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;BranchIndexWriter main &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; BranchIndexWriter.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;create&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Path.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;of&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;/var/data/search&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;), &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;main&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;Document doc &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; new&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; Document&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;doc.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;add&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;new&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; TextField&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;content&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;searchable text&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, Field.Store.YES));&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;main.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;addDocument&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(doc);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;main.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;commit&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;Initial index&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Fork a branch (3-5ms regardless of index size)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;BranchIndexWriter experiment &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; main.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;fork&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;experiment&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Branches evolve independently&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;experiment.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;addDocument&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(anotherDoc);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;experiment.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;commit&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;Experimental changes&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Time travel - query past state&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;DirectoryReader historical &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; main.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;openReaderAt&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Merge back when ready&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;main.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;mergeFrom&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(experiment);&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Forking is near-instant because Scriptum copies only new data - segment files are shared read-only. New writes create branch-specific segments. The &lt;code&gt;BranchedDirectory&lt;/code&gt; overlay pattern routes reads to the base index while capturing writes in the branch overlay.&lt;/p&gt;
&lt;p&gt;This gives you the same capabilities for keyword search, faceted navigation, and document retrieval that Proximum provides for vector search: reproducible queries, safe experimentation, and full audit history.&lt;/p&gt;
&lt;h2 id=&quot;what-this-enables&quot;&gt;What this enables&lt;/h2&gt;
&lt;p&gt;With versioned indexes you can run the same query against the same index state and get the same results, which makes evaluation of embedding models and ranking algorithms reproducible. You can fork an index to test a new chunking strategy or analyzer configuration without risking production state. You can query the index as it existed at any past instant to answer “what could the system have retrieved when it made that decision?” And because a snapshot is a value, you can hand it to any number of reader threads or processes without coordination or locking.&lt;/p&gt;
&lt;h2 id=&quot;the-cost&quot;&gt;The cost&lt;/h2&gt;
&lt;p&gt;Immutable indexes have write amplification: inserting a vector touches multiple graph edges, each potentially triggering chunk copies. Storage grows with history.&lt;/p&gt;
&lt;p&gt;In practice, this cost is amortized. You don’t create a snapshot for every vector added during a bulk load. The &lt;code&gt;PersistentEdgeIndex&lt;/code&gt; supports &lt;em&gt;transient&lt;/em&gt; mode - mutable during batch insert, immutable at the boundary. Snapshots are created only when a batch commits, and only those become visible to readers. The system can adaptively coarse-grain batches to balance throughput against snapshot granularity.&lt;/p&gt;
&lt;p&gt;If your search needs to be reproducible and auditable, versioned indexes are a good fit. &lt;a href=&quot;/proximum&quot;&gt;Proximum&lt;/a&gt; handles vector search, &lt;a href=&quot;/scriptum&quot;&gt;Scriptum&lt;/a&gt; handles full-text. Both use the same copy-on-write approach.&lt;/p&gt;
&lt;/div&gt;</content:encoded></item><item><title>Why We Built Datahike</title><link>https://datahike.io/notes/why-we-built-datahike/</link><guid isPermaLink="true">https://datahike.io/notes/why-we-built-datahike/</guid><description>A personal story about functional values, long-lived systems, and the memory layer AI needs.</description><pubDate>Sat, 14 Feb 2026 00:00:00 GMT</pubDate><content:encoded>&lt;div class=&quot;container page prose&quot;&gt;
&lt;h1 id=&quot;why-we-built-datahike&quot;&gt;Why We Built Datahike&lt;/h1&gt;
&lt;p&gt;&lt;em&gt;February 2026&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I’ve been working toward this for over a decade. It started with &lt;a href=&quot;http://reluk.ca/project/Votorola/home-2013.html&quot;&gt;Votorola&lt;/a&gt; - collaborative liquid democracy software - where I first needed to distribute a memory model across systems. That led me to Clojure, which led me to a question that I’ve been chasing ever since: how do you build data infrastructure that doesn’t lose history?&lt;/p&gt;
&lt;p&gt;Most databases are designed for transactional business logic: process an order, update an account, move on. But many of the systems we’re building today are different. They run for weeks or months, accumulate knowledge, and need to reason about their own past. A database that overwrites state on every write doesn’t support that well.&lt;/p&gt;
&lt;p&gt;This is the story of why we built Datahike, and why I think immutable, versioned data is the right foundation for systems that need to last.&lt;/p&gt;
&lt;h2 id=&quot;the-problem-with-mutable-state&quot;&gt;The problem with mutable state&lt;/h2&gt;
&lt;p&gt;In 2013, I started &lt;a href=&quot;https://github.com/replikativ&quot;&gt;replikativ&lt;/a&gt; to explore distributed, cross-platform replication systems. The core challenge was always synchronization: how do you keep data consistent across nodes without losing the ability to reason about history? But the deeper I got, the more I realized the problem wasn’t distribution - it was mutability.&lt;/p&gt;
&lt;p&gt;When data changes in place, you lose the ability to ask “what did the system know last Tuesday?” You can’t fork an experiment, try something, and merge it back. You can’t audit what happened, because the evidence has been overwritten.&lt;/p&gt;
&lt;p&gt;In functional programming we solved this decades ago. Data structures are immutable - values don’t change, you get new values. Programs become easier to reason about and test. I kept wondering why databases didn’t work the same way.&lt;/p&gt;
&lt;h2 id=&quot;finding-the-pieces&quot;&gt;Finding the pieces&lt;/h2&gt;
&lt;p&gt;The answer, it turned out, was that they could - but the pieces weren’t assembled yet. &lt;a href=&quot;https://datomic.com&quot;&gt;Datomic&lt;/a&gt; had shown the way: immutable, versioned data with time travel. But Datomic was closed source and designed for centralized deployment. I wanted something open, distributed by design, and built for systems that live everywhere - from edge devices to cloud clusters.&lt;/p&gt;
&lt;p&gt;We needed the right combination of query engine, index structure, and persistence.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. A mature query engine&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Nikita Prokopov’s &lt;a href=&quot;https://github.com/tonsky/datascript&quot;&gt;DataScript&lt;/a&gt; provided this. It was an in-memory Datalog database with five years of development, a robust query engine, and a clean, well-designed codebase. The only problem: it was purely in-memory. No durability.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. A functional, persistent index&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We initially experimented with David Greenberg’s &lt;a href=&quot;https://github.com/datacrypt-project/hitchhiker-tree&quot;&gt;Hitchhiker Tree&lt;/a&gt;, which taught us a lot about immutable indexing. It combines B+ tree query performance with append-only write semantics - great for logs and write-heavy workloads. But for database indices, we prefer optimal read performance. The Hitchhiker Tree trades some read speed for write performance, which wasn’t the right trade-off for our use case.&lt;/p&gt;
&lt;p&gt;So we extended &lt;a href=&quot;https://github.com/replikativ/persistent-sorted-set&quot;&gt;persistent-sorted-set&lt;/a&gt;, a functionally persistent sorted set optimized for database indices. It gives us excellent read performance while maintaining immutable semantics and efficient structural sharing. When you “update” the index, you don’t mutate nodes in place - you create new nodes that share structure with the old ones. The old version still exists, unchanged.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. The glue to put them together&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is where Datahike came in. We forked DataScript, adjusted persistent-sorted-set, added storage backends (file, SQL, LMDB, S3, GCS and more via &lt;a href=&quot;https://github.com/replikativ/konserve&quot;&gt;Konserve&lt;/a&gt;), and kept going. &lt;a href=&quot;https://github.com/kordano&quot;&gt;Konrad Kühne&lt;/a&gt; and our former team at Lambdaforge UG contributed substantially in the early years - adding history indices, time travel support, and helping Datahike achieve temporal query parity with Datomic. Together we built out schema flexibility and the protocols that make Datahike extensible.&lt;/p&gt;
&lt;h2 id=&quot;the-realization-databases-should-be-values&quot;&gt;The realization: databases should be values&lt;/h2&gt;
&lt;p&gt;Here’s the thing that took me years to fully appreciate: in Datahike, a database is a value, not a service.&lt;/p&gt;
&lt;p&gt;In a traditional database, you connect to a server. The data changes between queries. You’re always interacting with “the database” as a mutable thing.&lt;/p&gt;
&lt;p&gt;In Datahike, you dereference a connection and get a database value: a snapshot frozen at a particular transaction. That value won’t change. You can pass it to a function. Store it. Compare it to another snapshot. Two threads reading the same database value always see the same thing - no locks, no coordination needed.&lt;/p&gt;
&lt;p&gt;This matters because it makes the database composable. You can hold a snapshot in a variable, hand it to a worker, serialize it, or compare two versions structurally. Read scaling becomes trivial: spin up more readers, not more database connections.&lt;/p&gt;
&lt;p&gt;But the real power is what this enables.&lt;/p&gt;
&lt;h2 id=&quot;git-semantics-for-data&quot;&gt;Git semantics for data&lt;/h2&gt;
&lt;p&gt;Once you have immutable snapshots, you can do things that are awkward or impossible with traditional databases:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Branching&lt;/strong&gt;: Fork a database, make changes in isolation, merge back when ready. Unlike git (which merges text files), database merges operate on datoms with application-defined conflict resolution. This enables feature branches for data migrations, parallel experiments with different schemas, per-tenant forks sharing a common ancestor.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Time travel&lt;/strong&gt;: Query any past state. Not “last 7 days” - any specific instant. Diff two snapshots to see exactly what changed. Audit when a fact was added or retracted.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reproducibility&lt;/strong&gt;: Capture a snapshot, store it, query it later. Same snapshot always yields same results. This is essential for ML experiments, compliance systems, or anything that needs to explain its decisions.&lt;/p&gt;
&lt;h2 id=&quot;why-this-matters-for-ai&quot;&gt;Why this matters for AI&lt;/h2&gt;
&lt;p&gt;During my PhD, I developed inference systems that accumulate evidence over time. Probabilistic programs build up distributions, revise beliefs, maintain uncertainty. They need to fork hypotheses, evaluate alternatives, and keep track of the path that led to each conclusion. The database backing such a system needs to support that natively - not as a bolt-on.&lt;/p&gt;
&lt;p&gt;The same applies to any long-running system that accumulates knowledge: agent pipelines, compliance systems, scientific workflows. They all benefit from being able to fork state safely, roll back when something goes wrong, and answer “what did this system know when it made that decision?”&lt;/p&gt;
&lt;p&gt;Datahike provides this: knowledge survives restarts, you can fork and merge, every past state is queryable, and the same query on the same snapshot always returns the same result.&lt;/p&gt;
&lt;h2 id=&quot;what-weve-built&quot;&gt;What we’ve built&lt;/h2&gt;
&lt;p&gt;From those early experiments, Datahike has grown into more than just a database:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Core database&lt;/strong&gt;: Immutable Datalog with pluggable storage&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;/proximum&quot;&gt;Proximum&lt;/a&gt;&lt;/strong&gt;: Version-controlled vector indexing for semantic search&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;/scriptum&quot;&gt;Scriptum&lt;/a&gt;&lt;/strong&gt;: Git-like branching for full-text search (Apache Lucene extension)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/replikativ/yggdrasil&quot;&gt;Yggdrasil&lt;/a&gt;&lt;/strong&gt;: Protocols unifying branching semantics across storage systems&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each piece applies the same underlying idea: data should be immutable and versioned by default. We’re not done. Datalog is our starting point, and we’re working toward a broader programming model where persistent, versioned state is the default across distributed environments.&lt;/p&gt;
&lt;h2 id=&quot;where-were-going&quot;&gt;Where we’re going&lt;/h2&gt;
&lt;p&gt;I’m bootstrapping a company on top of Datahike. We’re looking for collaborators who want to push distributed immutable systems forward, and for early customers who need versioned data infrastructure in production.&lt;/p&gt;
&lt;p&gt;This work has always been collaborative. Konrad Kühne and our early team helped shape Datahike’s foundation. The broader open source community continues to push it forward through issues, PRs, and production deployments.&lt;/p&gt;
&lt;p&gt;If you’re building something where audit, reproducibility, or long-term memory matter, I’d like to hear about it.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Christian Weilbach&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Founder and maintainer&lt;/em&gt;&lt;/p&gt;
&lt;/div&gt;</content:encoded></item><item><title>Yggdrasil - Branching Protocols</title><link>https://datahike.io/notes/yggdrasil-unified-cow-protocols/</link><guid isPermaLink="true">https://datahike.io/notes/yggdrasil-unified-cow-protocols/</guid><description>A protocol stack that brings Git-like branching to any storage system.</description><pubDate>Sat, 24 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;div class=&quot;container page prose&quot;&gt;
&lt;h1 id=&quot;yggdrasil-branching-protocols&quot;&gt;Yggdrasil: Branching Protocols&lt;/h1&gt;
&lt;p&gt;What if every storage system spoke the same branching language? Yggdrasil is a protocol stack that brings Git-like semantics (snapshots, branches, merges, history) to heterogeneous storage backends.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;In Norse mythology, Yggdrasil is the World Tree connecting nine realms. This library connects storage systems under one unified API.&lt;/em&gt;&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The problem&lt;/h2&gt;
&lt;p&gt;Modern data systems are fragmented. Your vector index, your database, your filesystem, your container images - each has its own versioning model (or none at all). When you need reproducible pipelines across these systems, you’re left stitching together incompatible abstractions.&lt;/p&gt;
&lt;p&gt;Consider an ML training pipeline:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Datasets versioned in LakeFS&lt;/li&gt;
&lt;li&gt;Model weights on a filesystem&lt;/li&gt;
&lt;li&gt;Embeddings in a vector store&lt;/li&gt;
&lt;li&gt;Metadata in a database&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each system has different semantics for “create a snapshot” or “roll back to yesterday.” Coordinating them requires custom glue code that’s brittle and hard to reason about.&lt;/p&gt;
&lt;h2 id=&quot;the-solution-shared-protocols&quot;&gt;The solution: shared protocols&lt;/h2&gt;
&lt;p&gt;Yggdrasil defines a layered protocol stack that any storage system can implement. All operations use value semantics - mutating operations return new system values, never modify in place.&lt;/p&gt;
&lt;table style=&quot;width: 100%; border-collapse: collapse; margin: 1.5rem 0;&quot;&gt;
  &lt;thead&gt;
    &lt;tr style=&quot;text-align: left; border-bottom: 1px solid var(--border);&quot;&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Protocol&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 0;&quot;&gt;Operations&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Snapshotable&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;&lt;code&gt;snapshot-id&lt;/code&gt;, &lt;code&gt;parent-ids&lt;/code&gt;, &lt;code&gt;as-of&lt;/code&gt;, &lt;code&gt;snapshot-meta&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Branchable&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;&lt;code&gt;branches&lt;/code&gt;, &lt;code&gt;branch!&lt;/code&gt;, &lt;code&gt;checkout&lt;/code&gt;, &lt;code&gt;delete-branch!&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Graphable&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;&lt;code&gt;history&lt;/code&gt;, &lt;code&gt;ancestors&lt;/code&gt;, &lt;code&gt;common-ancestor&lt;/code&gt;, &lt;code&gt;commit-graph&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Mergeable&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;&lt;code&gt;merge!&lt;/code&gt;, &lt;code&gt;conflicts&lt;/code&gt;, &lt;code&gt;diff&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Overlayable&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;&lt;code&gt;overlay&lt;/code&gt;, &lt;code&gt;advance!&lt;/code&gt;, &lt;code&gt;merge-down!&lt;/code&gt;, &lt;code&gt;discard!&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Watchable&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;&lt;code&gt;watch!&lt;/code&gt;, &lt;code&gt;unwatch!&lt;/code&gt; - receives typed events on commit, branch, checkout&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;GarbageCollectable&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;&lt;code&gt;gc-roots&lt;/code&gt;, &lt;code&gt;gc-sweep!&lt;/code&gt; - coordinated cross-system mark-and-sweep&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Addressable &lt;em&gt;(optional)&lt;/em&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;&lt;code&gt;working-path&lt;/code&gt; - filesystem path for current branch (Git, ZFS, Btrfs, OverlayFS)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Committable &lt;em&gt;(optional)&lt;/em&gt;&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;&lt;code&gt;commit!&lt;/code&gt; - explicit commit, separated from snapshot reads&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;When multiple systems implement these protocols, you can compose them. Fork a database and a vector index together. Merge changes across both atomically. Query historical state consistently.&lt;/p&gt;
&lt;h2 id=&quot;twelve-adapters&quot;&gt;Twelve adapters&lt;/h2&gt;
&lt;table style=&quot;width: 100%; border-collapse: collapse; margin: 1.5rem 0;&quot;&gt;
  &lt;thead&gt;
    &lt;tr style=&quot;text-align: left; border-bottom: 1px solid var(--border);&quot;&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Adapter&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;System&lt;/th&gt;
      &lt;th style=&quot;padding: 0.5rem 0;&quot;&gt;Branching model&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Git&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Version control&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Native branches/commits&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;ZFS&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Filesystem&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Snapshots + clones&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Btrfs&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Filesystem&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Subvolumes + snapshots&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;OverlayFS&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Filesystem&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Layered directories&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Podman&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Containers&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Image layers&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;IPFS&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;P2P storage&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Content-addressed commits + IPNS branches&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Iceberg&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Table format&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Snapshots + native branches&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Datahike&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Database&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Native COW&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;LakeFS&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Data lake&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Git-like branches&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Dolt&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;SQL database&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Git-like branches&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Scriptum&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Full-text search&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Lucene segment sharing&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr style=&quot;border-bottom: 1px solid var(--border-subtle);&quot;&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0; font-weight: 600;&quot;&gt;Proximum&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 1rem 0.5rem 0;&quot;&gt;Vector search&lt;/td&gt;
      &lt;td style=&quot;padding: 0.5rem 0;&quot;&gt;Merkle-verified snapshots&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&quot;compositesystem-branching-multiple-systems-as-one&quot;&gt;CompositeSystem: branching multiple systems as one&lt;/h2&gt;
&lt;p&gt;The most significant recent addition is &lt;code&gt;CompositeSystem&lt;/code&gt; - a fiber product (pullback) over shared branch space. Given systems A and B, the composite is the pair (A, B) where both are always on the same branch. All protocol operations apply componentwise.&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;require(&apos;[yggdrasil.composite :as composite] &apos;[yggdrasil.protocols :as p])

;; Compose a database and a search index
def sys: composite/composite([datahike-sys scriptum-sys] :name &quot;my-app&quot; :branch :main :store-path &quot;/var/lib/yggdrasil/composite&quot;)

;; All protocol operations work on both systems simultaneously
def branched: sys .&gt; p/branch!(:experiment) .&gt; p/checkout(:experiment)

;; Commit both atomically - gets a deterministic composite snapshot-id
def committed: p/commit!(branched &quot;experimental run&quot;)

;; Merge back
def merged: committed .&gt; p/checkout(:main) .&gt; p/merge!(:experiment)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(require &apos;[yggdrasil.composite :as composite]
         &apos;[yggdrasil.protocols :as p])

;; Compose a database and a search index
(def sys (composite/composite [datahike-sys scriptum-sys]
           :name &quot;my-app&quot;
           :branch :main
           :store-path &quot;/var/lib/yggdrasil/composite&quot;))  ; optional persistence

;; All protocol operations work on both systems simultaneously
(def branched (-&gt; sys
                  (p/branch! :experiment)
                  (p/checkout :experiment)))

;; Commit both atomically - gets a deterministic composite snapshot-id
(def committed (p/commit! branched &quot;experimental run&quot;))

;; Merge back
(def merged (-&gt; committed
                (p/checkout :main)
                (p/merge! :experiment)))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code&gt;snapshot-id&lt;/code&gt; on a composite returns a deterministic UUID derived from the combined state of all sub-systems - the same combination always yields the same ID. History, conflicts, and GC roots are all computed across the full set.&lt;/p&gt;
&lt;p&gt;Passing &lt;code&gt;:store-path&lt;/code&gt; persists the composite history via a PSS B-tree backed by konserve, so history survives process restarts.&lt;/p&gt;
&lt;h2 id=&quot;workspace-hlc-coordinated-multi-system-operations&quot;&gt;Workspace: HLC-coordinated multi-system operations&lt;/h2&gt;
&lt;p&gt;The workspace layer adds Hybrid Logical Clock (HLC) coordination across independently managed systems. This enables temporal queries that span system boundaries.&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;require(&apos;[yggdrasil.workspace :as ws])

def w: ws/create-workspace({:store-path &quot;/var/lib/yggdrasil/my-app&quot;})

;; Manage systems - auto-installs commit hooks
;; Datahike uses d/listen (immediate); others fall back to polling
ws/manage!(w datahike-sys)
ws/manage!(w git-sys)

;; Query world state at any wall-clock time
let [world ws/as-of-time(w some-past-date.getTime())]:
  doseq [[[system-id branch] entry] world]:
    println(system-id branch &quot;was at snapshot&quot; :snapshot-id(entry))
  end
end&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(require &apos;[yggdrasil.workspace :as ws])

(def w (ws/create-workspace {:store-path &quot;/var/lib/yggdrasil/my-app&quot;}))

;; Manage systems - auto-installs commit hooks
;; Datahike uses d/listen (immediate); others fall back to polling
(ws/manage! w datahike-sys)
(ws/manage! w git-sys)

;; Query world state at any wall-clock time
(let [world (ws/as-of-time w (.getTime some-past-date))]
  (doseq [[[system-id branch] entry] world]
    (println system-id branch &quot;was at snapshot&quot; (:snapshot-id entry))))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Each commit in the registry carries an HLC timestamp. &lt;code&gt;as-of-time&lt;/code&gt; scans the index and returns the snapshot each system was at for any given moment - across all managed systems consistently.&lt;/p&gt;
&lt;h2 id=&quot;typed-diffs&quot;&gt;Typed diffs&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;diff&lt;/code&gt; returns system-specific records:&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;;; GitDiff: {:snapshot-a, :snapshot-b, :stat, :patch, :files [{:status :added/:modified/:deleted, :path}]}
;; DatahikeDiff: {:from, :to, :added [datoms], :removed [datoms], :summary {:added-datoms n, ...}}
;; DiffError: {:from, :to, :error}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;;; GitDiff: {:snapshot-a, :snapshot-b, :stat, :patch, :files [{:status :added/:modified/:deleted, :path}]}
;; DatahikeDiff: {:from, :to, :added [datoms], :removed [datoms], :summary {:added-datoms n, ...}}
;; DiffError: {:from, :to, :error}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Callers can pattern-match on record type for system-specific handling.&lt;/p&gt;
&lt;h2 id=&quot;compliance-testing&quot;&gt;Compliance testing&lt;/h2&gt;
&lt;p&gt;Every adapter passes the same compliance test suite - consistent behavior is guaranteed across all systems.&lt;/p&gt;
&lt;div class=&quot;code-dual&quot;&gt;
&lt;div class=&quot;code-dual-tabs&quot;&gt;
&lt;button class=&quot;code-dual-tab active&quot; data-lang=&quot;superficie&quot;&gt;Superficie&lt;/button&gt;
&lt;button class=&quot;code-dual-tab&quot; data-lang=&quot;clojure&quot;&gt;Clojure&lt;/button&gt;
&lt;a class=&quot;code-dual-what&quot; href=&quot;https://github.com/replikativ/superficie&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;What is this syntax?&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel active&quot; data-lang=&quot;superficie&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-superficie&quot;&gt;compliance/run-compliance-tests({:create-system fn []:
  my-adapter/init!(config)
end
                                 :mutate fn [sys]:
  ...
end
                                 :commit fn [sys msg]:
  ...
end
                                 :close! fn [sys]:
  ...
end})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&quot;code-dual-panel&quot; data-lang=&quot;clojure&quot;&gt;
&lt;pre&gt;&lt;code class=&quot;language-clojure&quot;&gt;(compliance/run-compliance-tests
  {:create-system (fn [] (my-adapter/init! config))
   :mutate        (fn [sys] ...)
   :commit        (fn [sys msg] ...)
   :close!        (fn [sys] ...)})&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h2 id=&quot;why-this-matters&quot;&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;The practical value comes from being able to treat heterogeneous systems as one versioned unit. An ML pipeline can version its datasets, model weights, and embeddings together under one composite snapshot, making any training run fully reproducible. An agent system can fork its complete environment - database, vector store, working directory - per agent, merge successful experiments back, and discard failures without cleanup. A test suite can fork production state across all systems in milliseconds.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;as-of-time&lt;/code&gt; query is particularly useful for audit: “what was the exact state of every system when this decision was made?” answered across heterogeneous backends with causal ordering.&lt;/p&gt;
&lt;h2 id=&quot;getting-started&quot;&gt;Getting started&lt;/h2&gt;
&lt;p&gt;See the &lt;a href=&quot;https://github.com/replikativ/yggdrasil&quot;&gt;GitHub repository&lt;/a&gt; for installation and adapter-specific setup. Licensed under Apache 2.0.&lt;/p&gt;
&lt;h2 id=&quot;part-of-the-datahike-ecosystem&quot;&gt;Part of the Datahike ecosystem&lt;/h2&gt;
&lt;p&gt;Yggdrasil is the protocol layer that connects:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/replikativ/datahike&quot;&gt;Datahike&lt;/a&gt; - Immutable Datalog database&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/replikativ/proximum&quot;&gt;Proximum&lt;/a&gt; - Version-controlled vector search&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/replikativ/scriptum&quot;&gt;Scriptum&lt;/a&gt; - Branching for Apache Lucene&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/stratum&quot;&gt;Stratum&lt;/a&gt; - Columnar SQL with CoW snapshots&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Yggdrasil provides the shared vocabulary that lets these systems branch together.&lt;/p&gt;
&lt;/div&gt;</content:encoded></item></channel></rss>