Anomaly Detection Belongs in Your Database

Every analytical database can aggregate, filter, and join. None of them can tell you “something is wrong with this data” as a first-class operation.

The standard workflow today: query your warehouse, serialize millions of rows into a DataFrame, import scikit-learn, fit an IsolationForest, write results back. You now maintain two systems, two runtimes, and a serialization boundary that adds seconds of latency per round-trip. For a fraud detection pipeline running against live transactions, those seconds matter. For a data engineer who just wants to flag outliers in a SELECT statement, the entire Python detour is unnecessary friction.

We built anomaly detection directly into Stratum — not as a UDF shim that calls Python under the hood, but as a native SIMD-accelerated implementation that runs inside the query engine. Train a model, score your data, all from SQL — no Python, no Clojure, no external runtime.

Infographic: comparing the Python export pipeline (seconds of latency, 2x memory) with Stratum's in-database approach (6 microseconds per transaction), and showing how isolation forests detect anomalies by isolating outliers in fewer tree splits

SELECT * FROM transactions
WHERE ANOMALY_SCORE('fraud_model') > 0.7;

No data leaves the database. No serialization. The query planner pushes down predicates and prunes chunks before the model ever sees a row. Scoring a single transaction takes 6 microseconds. A batch of 1,000 incoming transactions: 1.6 milliseconds. That’s fast enough to sit in the hot path of a payment gateway — not as a batch job that runs after the fact, but as a synchronous check before the transaction clears.

Why isolation forests

Most “anomaly detection in SQL” tutorials teach you to compute z-scores: (value - AVG(value)) / STDDEV(value) > 3. This works for Gaussian-distributed single columns. It fails everywhere else.

Real anomalies are multivariate. A transaction amount of $500 is normal. A frequency of 20 per hour is normal. Both together, at 3am, to a merchant in a country where the cardholder has never transacted — that’s the signal. Z-scores can’t see it. Neither can IQR-based methods or simple threshold rules. You need a model that captures the joint structure of your data.

Isolation forests (Liu, Ting & Zhou, 2008) take a fundamentally different approach. Instead of modeling what “normal” looks like — a density estimate, a distribution fit, a cluster boundary — they directly measure how easy it is to isolate a point from everything else. Build a tree of random splits across random features. Anomalous points, being few and different, get isolated in fewer splits. Normal points, packed into dense regions, require many splits to separate.

The properties that make this algorithm uniquely suited to a columnar database:

No assumptions. Z-scores assume Gaussian distributions. DBSCAN assumes density clusters. Isolation forests are non-parametric — they work on any distribution shape, any number of dimensions, without tuning.

Subsampling. Each tree is trained on only 256 randomly sampled rows, regardless of total dataset size. Training 100 trees on 10M rows takes 6ms — it reads 25,600 rows total. This is the key insight from the original paper: anomalies are so different that a tiny sample is enough to characterize them.

Linear scoring. Scoring each row means traversing 100 trees of depth ≤8. That’s 800 comparisons per row — branch-free, cache-friendly, and trivially parallelizable. Stratum’s implementation packs each tree node into a single long (split feature in upper 32 bits, split value as float in lower 32), traverses with branchless node = 2*node + 1 + cmp, and processes rows in morsel-driven parallel batches sized to fit L1 cache.

Multivariate by construction. Every tree split randomly selects a feature. The ensemble naturally captures cross-feature interactions without the user specifying which features correlate.

Unsupervised. No labels needed. You don’t need a curated training set of “known fraud” — the algorithm finds whatever doesn’t fit the bulk distribution. This matters because in practice, labeled anomaly data is expensive, incomplete, and often biased toward known attack patterns.

What the landscape looks like

We surveyed what major analytical databases offer for built-in anomaly detection:

DuckDB has no native capability. The closest is anofox-tabular, a third-party community extension (BSL-licensed) that adds isolation forests to DuckDB. We read through the implementation — it’s feature-rich (Extended IF, SCiForest, categorical columns, density scoring), but architecturally very different from what we built. anofox-tabular retrains the forest on every query — there’s no model persistence, so you can’t train once and score cheaply at query time. Its C++ implementation is scalar (no SIMD), single-threaded (no parallelism in build or score), and uses recursive traversal with std::vector allocations at every tree node. It also copies all data from DuckDB’s columnar format into its own data structures before running. The README describes “vectorized C++17” which likely refers to DuckDB’s general execution model rather than the isolation forest code itself. For small datasets (the test suite uses 5-51 rows) none of this matters. For scoring a million rows inline with a query, or scoring 1,000 transactions in the hot path of a payment system, the architectural choices compound. We haven’t benchmarked head-to-head, but the design differences — flat packed arrays vs. nested vectors, morsel-driven parallelism vs. single-threaded, persistent models vs. retrain-per-query — point to a substantial gap at scale.

ClickHouse has seriesOutliersDetectTukey — a univariate IQR method for time-series. Useful for simple threshold alerts, but it’s one column at a time, one statistical method, no learning. Cloudflare built their anomaly detection platform on ClickHouse but implemented the actual detection logic (HBOS) in external microservices — ClickHouse stores and aggregates the data, it doesn’t run the models.

TimescaleDB has an open issue proposing ARIMA and DBSCAN anomaly detection. It remains unimplemented.

PostgreSQL MADlib offers in-database ML, but it’s a heavy extension that hasn’t seen active development recently.

The pattern is consistent: analytical databases treat anomaly detection as somebody else’s problem. The “solution” is always to export data to a separate ML runtime.

The cost of exporting

This isn’t just about convenience. The export-to-Python pattern has structural costs that compound in production:

Latency. Serializing 1M rows from a database into Python’s heap takes seconds. Add model inference, write-back, and you’re looking at minutes for a pipeline that should be a query. For fraud detection or infrastructure monitoring, that latency window is when damage happens.

Memory duplication. The data exists in the database AND in Python’s process. For large datasets, this means either paying for 2x RAM or batching with additional orchestration complexity.

Operational surface area. You now maintain a database AND a Python environment with scikit-learn, NumPy, and their transitive dependencies. Version pinning, compatibility testing, deployment coordination. Every additional system boundary is a place where things break.

Security perimeter. Moving data out of the database means it leaves whatever access controls, encryption, and audit logging the database provides. For regulated industries, this is a compliance headache.

Lost optimization. When anomaly scoring is a SQL function, the query engine can apply zone-map pruning, skip entire chunks where min/max statistics prove no rows will match downstream filters, and fuse the scoring into the execution pipeline. An external Python process sees a flat array with no metadata.

How it works in Stratum

SQL interface

Stratum speaks the PostgreSQL wire protocol. Connect with psql, DBeaver, JDBC, or any PostgreSQL client — then train and query models entirely from SQL:

-- Train a model directly from SQL
CREATE MODEL fraud_model
  TYPE ISOLATION_FOREST
  OPTIONS (n_trees = 200, sample_size = 256, contamination = 0.05)
  AS SELECT amount, freq FROM transactions;

The AS SELECT query defines the training data — any valid SELECT works, including WHERE filters and JOINs. Column names become the model’s feature names. Once created, the model remembers its features — you don’t need to repeat them:

-- Short form: model knows its features from training
SELECT *, ANOMALY_SCORE('fraud_model') AS score
FROM transactions;

-- All four functions support both forms
SELECT *, ANOMALY_PREDICT('fraud_model') AS is_anomaly FROM transactions;
SELECT *, ANOMALY_PROBA('fraud_model') AS prob FROM transactions;
SELECT *, ANOMALY_CONFIDENCE('fraud_model') AS conf FROM transactions;

Need to score on different columns, computed expressions, or join results? Use the long form with explicit arguments (mapped positionally to the model’s features):

-- Explicit columns
SELECT *, ANOMALY_SCORE('fraud_model', amount, freq) AS score
FROM transactions;

-- Score on expressions
SELECT *, ANOMALY_SCORE('fraud_model', amount * 100, LOG(freq)) AS score
FROM transactions;

-- Score across JOINs
SELECT t.*, ANOMALY_SCORE('fraud_model', t.amount, r.rate) AS score
FROM transactions t JOIN rates r ON t.currency = r.code;

Model management is also SQL-native:

SHOW MODELS;                    -- list all registered models
DESCRIBE MODEL fraud_model;     -- features, hyperparameters, threshold
DROP MODEL fraud_model;         -- remove a model
DROP MODEL IF EXISTS old_model; -- remove only if it exists

The anomaly functions look and compose like any other SQL expression — filter on them, aggregate them, join them.

Clojure API

For programmatic workflows — custom training pipelines, model rotation, or embedding Stratum as a library — there’s a direct Clojure API:

What is this syntax?

require('[stratum.api :as st])

;; Your data — plain Java arrays
def amounts: double-array([10 15 12 11 14 200 13 11 300 12])
def freqs: double-array([5 6 4 5 7 1 5 4 1 6])

;; Train: 100 trees, 256 samples each, expect ~5% anomalies
def model: st/train-iforest({:from {:amount amounts, :freq freqs}, :contamination 0.05})

;; Score: double[] in [0, 1] — higher = more anomalous
st/iforest-score(model {:amount amounts, :freq freqs})

;; Binary prediction: long[] with 1 = anomaly, 0 = normal
st/iforest-predict(model {:amount amounts, :freq freqs})

;; Confidence: how much do the trees agree? [0, 1]
st/iforest-predict-confidence(model {:amount amounts, :freq freqs})

(require '[stratum.api :as st])

;; Your data — plain Java arrays
(def amounts (double-array [10 15 12 11 14 200 13 11 300 12]))
(def freqs   (double-array [ 5  6  4  5  7   1  5  4   1  6]))

;; Train: 100 trees, 256 samples each, expect ~5% anomalies
(def model (st/train-iforest {:from {:amount amounts :freq freqs}
                              :contamination 0.05}))

;; Score: double[] in [0, 1] — higher = more anomalous
(st/iforest-score model {:amount amounts :freq freqs})

;; Binary prediction: long[] with 1 = anomaly, 0 = normal
(st/iforest-predict model {:amount amounts :freq freqs})

;; Confidence: how much do the trees agree? [0, 1]
(st/iforest-predict-confidence model {:amount amounts :freq freqs})

Scores integrate directly with the query engine — they’re just another column:

What is this syntax?

def scores: st/iforest-score(model data)
st/q({:from assoc(data :score scores)
      :where [[:> :score 0.7]]
      :group [:region]
      :agg [[:avg :score] [:count]]
      :having [[:> :avg 0.5]]
      :order [[:avg :desc]]})

(def scores (st/iforest-score model data))
(st/q {:from   (assoc data :score scores)
       :where  [[:> :score 0.7]]
       :group  [:region]
       :agg    [[:avg :score] [:count]]
       :having [[:> :avg 0.5]]
       :order  [[:avg :desc]]})

Online adaptation

Data distributions shift. Fraud patterns evolve. A model trained last month may not catch today’s anomalies. Retraining from scratch is wasteful when only the recent distribution has changed.

iforest-rotate replaces the oldest k trees with new ones trained on fresh data. The original model is unchanged — copy-on-write semantics mean you can keep the old model for comparison:

What is this syntax?

;; Replace 10% of trees with new ones trained on this week's data
def updated-model: st/iforest-rotate(model this-week-data)

;; Score with recency bias: newer trees weighted higher
st/iforest-score-weighted(updated-model data 0.98)

;; Replace 10% of trees with new ones trained on this week's data
(def updated-model (st/iforest-rotate model this-week-data))

;; Score with recency bias: newer trees weighted higher
(st/iforest-score-weighted updated-model data 0.98)

This is a lightweight operation — training 10 new trees on 256 samples each costs microseconds. The resulting model maintains sensitivity to historical patterns (90 original trees) while adapting to recent distribution changes (10 new trees). In our temporal evaluation with synthetic concept drift (outlier region shifting at the midpoint), the rotating model maintains AUC above 0.95 across all segments where a static model degrades to 0.75.

Performance

Measured on an Intel Core Ultra 7 258V (8 cores, Lunar Lake), JDK 25, 100 trees with sample size 256:

Batch scoring (online processing)

Batch size	Latency	Use case
1 row	6 μs	Single transaction check
10 rows	19 μs	Micro-batch
100 rows	163 μs	API batch
1,000 rows	1.6 ms	Payment gateway batch
10,000 rows	16 ms	Bulk ingest check

At 6 microseconds per row, anomaly scoring adds negligible overhead to any transaction processing pipeline. A payment gateway checking 1,000 transactions per batch stays under 2ms — well within the latency budget that even real-time payment systems allow for fraud checks.

Full-table scoring (analytics)

Operation	1M rows	10M rows
Train (100 trees × 256 samples)	~1ms	6ms
Score (parallel, 8 cores)	448ms	4.6s
Score (single-threaded)	~1.7s	17s
Model memory	~2.5 MB (100 trees × 511 nodes × 8 bytes)

Training is near-instant because it only reads 25,600 rows total (256 per tree), regardless of dataset size. Scoring scales linearly and parallelizes across cores with morsel-driven execution — each morsel sized to fit L1 cache for branchless tree traversal.

The isolation forest validates against standard ODDS benchmark datasets (Shuttle, Http, ForestCover, Mammography, CreditCard) with AUC-ROC scores matching or exceeding scikit-learn’s implementation at equivalent hyperparameters. The benchmark suite includes a head-to-head comparison with PyOD that you can run yourself: clj -M:iforest pyod.

Under the hood

The tree structure is packed for cache efficiency. Each node is a single long:

Internal nodes: split feature index (upper 32 bits) + split value as float (lower 32 bits)
Leaf nodes: path length adjustment stored as Double.doubleToRawLongBits(c(leafSize))
Trees are contiguous in memory: forest[tree × maxNodes + nodeIdx]

Scoring traverses each tree with a branchless comparison — node = 2*node + 1 + (val >= splitVal ? 1 : 0) — no branch misprediction, no pointer chasing. The anomaly score is 2^(-E(h(x)) / c(ψ)) where E(h(x)) is the mean path length across all trees and c(ψ) is the expected path length of an unsuccessful BST search, normalizing scores to [0, 1].

Parallelism uses the same morsel-driven architecture as the rest of the query engine: the ForkJoinPool processes rows in 64K-row morsels, each morsel’s feature data fitting in L1 cache. No lock contention — each thread accumulates independently into its own score region.

The confidence metric (predict-confidence) uses the coefficient of variation of per-tree path lengths. When trees agree on a point’s isolation depth, confidence is high. When they disagree — the point sits near a decision boundary — confidence is low. This gives you a principled way to triage uncertain predictions rather than trusting every score blindly.

What this enables

Online payment fraud detection. At 6μs per transaction, anomaly scoring can sit directly in the payment authorization path — not as a post-hoc batch job, but as a synchronous check before the charge clears. Train on your historical transaction data, register the model, and every SELECT against the transactions table can include ANOMALY_SCORE inline. For batch settlement processing, 1,000 transactions score in 1.6ms. The model stays in-process — no network hop to an external ML service, no serialization overhead, no additional point of failure in the payment critical path.

Data quality monitoring. Run ANOMALY_SCORE over your staging table before promoting to production. Flag rows that don’t fit the historical distribution. Catch data pipeline bugs before they propagate.

IoT sensor monitoring. Train on a baseline period of normal sensor readings. Score incoming data. When vibration, temperature, and power consumption are each individually normal but their combination is anomalous, the isolation forest catches it — z-scores don’t.

Versioned anomaly detection. Because Stratum datasets are immutable values with copy-on-write branching, you can score against historical snapshots. “What would this model have flagged last quarter?” is a query, not a data engineering project.

Try it yourself

Start the demo server — it loads 100K taxi ride rows and a pre-trained anomaly model:

java --add-modules jdk.incubator.vector \
     --enable-native-access=ALL-UNNAMED \
     -jar stratum-standalone.jar --demo

Connect with any PostgreSQL client and run real anomaly queries immediately:

psql -h localhost -p 5432 -U stratum

-- Find the most anomalous taxi rides
SELECT fare_amount, tip_amount, pickup_hour,
       ANOMALY_SCORE('taxi_anomaly', fare_amount, tip_amount,
                     total_amount, passenger_count, pickup_hour) AS score
FROM taxi
WHERE ANOMALY_SCORE('taxi_anomaly', fare_amount, tip_amount,
                    total_amount, passenger_count, pickup_hour) > 0.7
ORDER BY score DESC
LIMIT 20;

-- Binary prediction: which rides are anomalous?
SELECT fare_amount, tip_amount,
       ANOMALY_PREDICT('taxi_anomaly', fare_amount, tip_amount,
                       total_amount, passenger_count, pickup_hour) AS is_anomaly
FROM taxi
WHERE ANOMALY_PREDICT('taxi_anomaly', fare_amount, tip_amount,
                      total_amount, passenger_count, pickup_hour) = 1;

-- How confident is the model about each prediction?
SELECT fare_amount,
       ANOMALY_SCORE('taxi_anomaly', fare_amount, tip_amount,
                     total_amount, passenger_count, pickup_hour) AS score,
       ANOMALY_CONFIDENCE('taxi_anomaly', fare_amount, tip_amount,
                          total_amount, passenger_count, pickup_hour) AS confidence
FROM taxi
ORDER BY score DESC
LIMIT 10;

The demo dataset includes synthetic anomalies — high fares with zero tips late at night — that the model detects out of the box. But the model also finds natural outliers in the data: unusual combinations of fare, tip, passenger count, and hour that don’t match the bulk distribution.

Getting started with your own data

Start the server (requires JDK 21+):

java --add-modules jdk.incubator.vector \
     --enable-native-access=ALL-UNNAMED \
     -jar stratum-standalone.jar

Then connect with any PostgreSQL client and do everything from SQL:

-- Load your data
CREATE TABLE transactions (amount DOUBLE PRECISION, freq BIGINT, hour BIGINT);
INSERT INTO transactions VALUES (10.0, 5, 14), (15.0, 6, 9), ...;

-- Or query directly from files
SELECT * FROM read_csv('/path/to/transactions.csv');

-- Train a model
CREATE MODEL fraud_model
  TYPE ISOLATION_FOREST
  OPTIONS (n_trees = 200, contamination = 0.05)
  AS SELECT amount, freq, hour FROM transactions;

-- Score your data
SELECT *, ANOMALY_SCORE('fraud_model', amount, freq, hour) AS score
FROM transactions
ORDER BY score DESC;

For programmatic workflows, Stratum also has a Clojure API for model training, online rotation, and integration with the query engine. Add to deps.edn:

What is this syntax?

{:deps {org.replikativ/stratum {:mvn/version "RELEASE"}}}

{:deps {org.replikativ/stratum {:mvn/version "RELEASE"}}}

Source and full documentation: github.com/replikativ/stratum. The anomaly detection guide has the complete API reference.

Feedback welcome on Clojurians #datahike or email.