Versioned Analytics for Regulated Industries

Financial regulation — Basel III, MiFID II, Solvency II, SOX — requires that risk calculations, credit decisions, and compliance reports be reproducible. Not just the code, but the exact data state that produced them. When an auditor asks “show me the data behind this risk number from six months ago,” the answer can’t be “we’ll try to reconstruct it.”

Version control solved this problem for source code decades ago. But analytical data infrastructure never caught up. Data warehouses don’t version tables. Temporal tables track row-level changes but don’t compose across tables or systems. Manual snapshots are expensive, fragile, and don’t support branching for scenario analysis.

Stratum brings the git model to analytical data: every write creates an immutable, content-addressed snapshot. Old states remain accessible by commit UUID. Branches are O(1). And via Yggdrasil, you can tie entity databases, analytical datasets, and search indices into a single consistent, auditable snapshot.

The problem

A typical analytical pipeline at a regulated institution:

Transactional data flows into a warehouse (nightly ETL or streaming)
Analysts run GROUP BY / SUM / STDDEV queries for risk models and reports
Results feed regulatory submissions — capital adequacy, liquidity coverage, market risk
Months later, an auditor asks: “What data produced risk report X on date Y?”

Step 4 is where things break. The warehouse has been mutated since then. Maybe there’s a backup, maybe not. Reconstructing the exact state requires replaying ETL from source systems — if those logs still exist.

Even if you can reconstruct the data, you can’t prove it’s the same data. There’s no cryptographic link between the report and the state that produced it. The best you can offer is procedural trust: “our backup process is reliable, and we believe this is what the data looked like.” That’s a weak foundation for regulatory compliance.

Immutable snapshots as audit anchors

With Stratum, every table is a copy-on-write value. Writes create new snapshots; old snapshots remain addressable by commit UUID or branch name. The underlying storage is a content-addressed Merkle tree — each snapshot’s identity is derived from a hash of its data, providing a cryptographic chain of custody from report to source.

What is this syntax?

require('[stratum.api :as st])

;; Load the current production state
def trades: st/load(store "trades" {:branch "production"})

;; Run today's risk calculation
def risk-report: st/q({:from trades, :group [:desk :currency], :agg [[:sum :notional] [:stddev :pnl] [:count]]})

;; The commit UUID is your audit anchor — store it alongside the report
;; Six months later, reproduce exactly:
def historical-trades: st/load(store "trades" {:as-of #uuid "a1b2c3d4-..."})

def historical-report: st/q({:from historical-trades, :group [:desk :currency], :agg [[:sum :notional] [:stddev :pnl] [:count]]})
;; Identical results, guaranteed by content addressing

(require '[stratum.api :as st])

;; Load the current production state
(def trades (st/load store "trades" {:branch "production"}))

;; Run today's risk calculation
(def risk-report
  (st/q {:from trades
         :group [:desk :currency]
         :agg [[:sum :notional] [:stddev :pnl] [:count]]}))

;; The commit UUID is your audit anchor — store it alongside the report
;; Six months later, reproduce exactly:
(def historical-trades
  (st/load store "trades" {:as-of #uuid "a1b2c3d4-..."}))

(def historical-report
  (st/q {:from historical-trades
         :group [:desk :currency]
         :agg [[:sum :notional] [:stddev :pnl] [:count]]}))
;; Identical results, guaranteed by content addressing

Or via SQL — connect any PostgreSQL client:

-- Today's report
SELECT desk, currency, SUM(notional), STDDEV(pnl), COUNT(*)
FROM trades GROUP BY desk, currency;

-- Historical report: same query, different snapshot
-- resolved server-side via branch/commit configuration

Once committed, data cannot be modified — every state is a value, addressable by its content hash. Historical snapshots load lazily from storage on demand, so keeping years of history doesn’t mean paying for it in memory. And because snapshots are immutable values, multiple analysts can query the same or different points in time concurrently without coordination or locks.

Scenario analysis with branching

Beyond audit compliance, regulated institutions need scenario analysis. Basel III stress testing requires banks to evaluate capital adequacy under hypothetical adverse conditions — equity drawdowns, interest rate shocks, credit spread widening. Traditional approaches involve copying production data into staging environments, running scenarios, comparing results, and cleaning up. That process is slow, expensive, and error-prone.

With copy-on-write branching, forking a dataset is O(1) regardless of size. A 100-million-row table branches in microseconds because the fork is just a new root pointer into the shared tree. Only chunks that are actually modified get copied.

What is this syntax?

;; Fork production data for stress testing — O(1) regardless of table size
def stress-scenario: st/fork(trades)

;; Apply adverse conditions — only modified chunks are copied
;; e.g. via SQL: UPDATE trades SET price = price * 0.7
;;               WHERE asset_class = 'equity'

;; Compare risk metrics: production vs stressed
def baseline-risk: st/q({:from trades, :group [:desk], :agg [[:stddev :pnl] [:sum :notional]]})

def stressed-risk: st/q({:from stress-scenario, :group [:desk], :agg [[:stddev :pnl] [:sum :notional]]})

;; Run as many scenarios as needed — each is an independent branch
;; Baseline, adverse, severely adverse, custom scenarios
;; all sharing unmodified data via structural sharing

;; Fork production data for stress testing — O(1) regardless of table size
(def stress-scenario (st/fork trades))

;; Apply adverse conditions — only modified chunks are copied
;; e.g. via SQL: UPDATE trades SET price = price * 0.7
;;               WHERE asset_class = 'equity'

;; Compare risk metrics: production vs stressed
(def baseline-risk
  (st/q {:from trades
         :group [:desk]
         :agg [[:stddev :pnl] [:sum :notional]]}))

(def stressed-risk
  (st/q {:from stress-scenario
         :group [:desk]
         :agg [[:stddev :pnl] [:sum :notional]]}))

;; Run as many scenarios as needed — each is an independent branch
;; Baseline, adverse, severely adverse, custom scenarios
;; all sharing unmodified data via structural sharing

Each branch is fully isolated: modifications to the stress scenario can’t touch production data. You can maintain dozens of concurrent scenarios without multiplying storage costs — they share all unmodified data. When you stop referencing a branch, mark-and-sweep GC reclaims the storage. No staging environments, no cleanup scripts.

This also applies to model validation. When a risk model is updated, you can run the new model against historical snapshots and compare its outputs to the original model’s results — same data, different code, verifiable divergence.

Cross-system consistency

A real regulatory pipeline isn’t just one analytical table. Entity data (customers, counterparties, legal entities) lives in a transactional database. Analytical views (positions, P&L, exposures) live in a columnar engine. Compliance documents and communications live in a search index. For an audit to be meaningful, all of these need to be at the same point in time.

Yggdrasil provides a shared branching protocol across these heterogeneous systems. You can compose a Datahike entity database, a Stratum analytical dataset, and a Scriptum search index into a single composite system — branching, snapshotting, and time-traveling all of them together.

What is this syntax?

require('[yggdrasil.core :as ygg])

;; Compose entity database + analytics + search into one system
def system: ygg/composite-system({:entities datahike-conn, :analytics stratum-store, :search scriptum-index})

;; Branch the entire system for an investigation
ygg/branch!(system "investigation-2026-Q1")

;; Every component is now at the same logical point in time
;; Query across all three with a single consistent snapshot

(require '[yggdrasil.core :as ygg])

;; Compose entity database + analytics + search into one system
(def system
  (ygg/composite-system
    {:entities datahike-conn    ;; customer records, counterparties
     :analytics stratum-store   ;; trade data, positions, P&L
     :search scriptum-index}))  ;; compliance documents, communications

;; Branch the entire system for an investigation
(ygg/branch! system "investigation-2026-Q1")

;; Every component is now at the same logical point in time
;; Query across all three with a single consistent snapshot

When an auditor needs the full picture — the trade data, the customer entity that placed the trade, and the compliance documents reviewed at the time — they get a single consistent view across all systems, tied to one branch identifier. No manual coordination, no hoping the timestamps line up.

Compliance lifecycle

Immutable systems raise an obvious question: what about GDPR right-to-erasure, or data retention policies that require deletion?

Immutability doesn’t mean data can never be removed — it means deletion is explicit and verifiable rather than implicit and unauditable. The Datahike ecosystem supports purge operations that remove specific data from all indices and all historical snapshots. Mark-and-sweep garbage collection, coordinated across systems via Yggdrasil, reclaims storage from unreachable snapshots.

This is actually a stronger compliance story than mutable databases offer. In a mutable system, you DELETE a row and trust that the storage layer eventually overwrites it — but you can’t prove it’s gone from backups, replicas, or caches. With explicit purge on content-addressed storage, you can verify that the data no longer exists in any reachable snapshot.

Production-ready performance

Versioning and immutability don’t come at the cost of query speed. Stratum uses SIMD-accelerated execution via the Java Vector API, fused filter-aggregate pipelines, and zone-map pruning to skip entire data chunks. It runs standard OLAP benchmarks competitively with engines like DuckDB — while also providing branching, time travel, and content addressing that pure analytical engines don’t.

Full SQL is supported via the PostgreSQL wire protocol: aggregates, window functions, joins, CTEs, subqueries. Connect with psql, JDBC, DBeaver, or any PostgreSQL-compatible client. See the Stratum technical deep-dive for architecture details and benchmark methodology.

Getting started

Stratum runs as an in-process Clojure library or a standalone SQL server. Requires JDK 21+.

What is this syntax?

{:deps {org.replikativ/stratum {:mvn/version "RELEASE"}}}

{:deps {org.replikativ/stratum {:mvn/version "RELEASE"}}}

If you’re building analytical infrastructure in a regulated environment — or exploring how versioned data can simplify your compliance story — get in touch. We work with teams in finance, insurance, and healthcare to design data architectures where auditability is built in, not bolted on.