Why Search Needs Versioning

Search indexes are almost always mutable. You insert documents or embeddings, update them, delete them - the index reflects current state. This is fine when you’ll never need to query or audit past states, but breaks when retrieval feeds into reasoning.

Once search results enter an LLM’s context window or guide an agent’s action, the index is effectively memory. If that memory overwrites itself on every update, you can’t reproduce or audit past retrieval results. This applies to vector search and full-text search alike.

The problem

A retrieval-augmented system in production: embeddings are indexed, queries retrieve context, responses are generated. A week later, someone asks why the system returned a particular result. In a mutable index, there’s no answer. The index changed. The embedding model may have been updated. The retrieval state that produced that response no longer exists.

This isn’t theoretical. Any system where retrieval influences outcomes - recommendations, classifications, agent decisions - has this problem. The less human oversight there is, the more it matters.

Proximum: git semantics for search

Proximum applies the same copy-on-write model that powers Datahike and Clojure to HNSW (Hierarchical Navigable Small World) vector indexes. Every insert returns a new index version. Previous versions remain valid and queryable:

// Create and populate
var idx = ProximumVectorStore.builder()
    .dimensions(1536)
    .storagePath("/var/data/vectors")
    .build();

idx.addBatch(embeddings, ids);
idx.sync().join();  // persist and wait for completion
UUID v1 = idx.getCommitId();

idx.addBatch(moreEmbeddings, moreIds);
idx.sync().join();
UUID v2 = idx.getCommitId();

// Both versions remain searchable
var storeConfig = Map.of("backend", ":file", "path", "/var/data/vectors");
var oldIndex = ProximumVectorStore.connectCommit(storeConfig, v1);
oldIndex.search(query, 10);  // original state
idx.search(query, 10);       // current state

The branch() operation is O(1) - it shares structure with the original. Two branches diverge independently without copying data. This makes A/B testing embeddings, bisecting regressions, and maintaining reproducible baselines cheap.

How it works

The core data structure is a PersistentEdgeIndex: chunked copy-on-write arrays that hold HNSW graph edges. Layer 0 (the dense bottom layer) uses fixed-size chunks; upper layers use sparse per-node arrays. When you modify the graph, only affected chunks are copied. Unchanged structure is shared.

Vectors themselves live in a memory-mapped store backed by Konserve, so the same index can be persisted to disk, S3, or any pluggable backend. The combination gives you SIMD-accelerated search with full version history and portable storage.

Scriptum: git semantics for full-text search

The same versioning principles apply to traditional full-text search. Scriptum brings copy-on-write branching to Apache Lucene by sharing immutable segment files across branches:

// Create an index
BranchIndexWriter main = BranchIndexWriter.create(
    Path.of("/var/data/search"), "main");

Document doc = new Document();
doc.add(new TextField("content", "searchable text", Field.Store.YES));
main.addDocument(doc);
main.commit("Initial index");

// Fork a branch (3-5ms regardless of index size)
BranchIndexWriter experiment = main.fork("experiment");

// Branches evolve independently
experiment.addDocument(anotherDoc);
experiment.commit("Experimental changes");

// Time travel - query past state
DirectoryReader historical = main.openReaderAt(1);

// Merge back when ready
main.mergeFrom(experiment);

Forking is near-instant because Scriptum copies only new data - segment files are shared read-only. New writes create branch-specific segments. The BranchedDirectory overlay pattern routes reads to the base index while capturing writes in the branch overlay.

This gives you the same capabilities for keyword search, faceted navigation, and document retrieval that Proximum provides for vector search: reproducible queries, safe experimentation, and full audit history.

What this enables

With versioned indexes you can run the same query against the same index state and get the same results, which makes evaluation of embedding models and ranking algorithms reproducible. You can fork an index to test a new chunking strategy or analyzer configuration without risking production state. You can query the index as it existed at any past instant to answer “what could the system have retrieved when it made that decision?” And because a snapshot is a value, you can hand it to any number of reader threads or processes without coordination or locking.

The cost

Immutable indexes have write amplification: inserting a vector touches multiple graph edges, each potentially triggering chunk copies. Storage grows with history.

In practice, this cost is amortized. You don’t create a snapshot for every vector added during a bulk load. The PersistentEdgeIndex supports transient mode - mutable during batch insert, immutable at the boundary. Snapshots are created only when a batch commits, and only those become visible to readers. The system can adaptively coarse-grain batches to balance throughput against snapshot granularity.

If your search needs to be reproducible and auditable, versioned indexes are a good fit. Proximum handles vector search, Scriptum handles full-text. Both use the same copy-on-write approach.