R10 — Vector Embeddings as First-Class ValueIDs

Research Proposal — Mapped to backlog in docs/RESEARCH_BACKLOG.md

🔬 What's Novel

LSH-extended ValueID scheme for approximate content addressing in a unified storage model
Unified storage where structured data and embeddings share the same ValueStore infrastructure
Hybrid query execution combining SQL predicates with approximate nearest neighbor search
Dependency tracking extended to embedding-derived data for automatic cache invalidation

🔧 Technical Approach

Phase 1 — LSH-ValueID Design

ValueID = (LSH_bucket, content_hash). Similar embeddings share LSH buckets, enabling approximate lookup via bucket enumeration while preserving exact-match deduplication.

Phase 2 — Unified Storage

Extend ValueStore for embedding-type values: automatic LSH computation on insert, approximate nearest neighbor via LSH bucket filtering, and exact distance refinement.

Phase 3 — Hybrid Queries

SkeinQL extensions combining SQL predicates with vector similarity: SELECT * FROM docs WHERE category = 'tech' ORDER BY embedding <-> query_vector LIMIT 10.

Phase 4 — Dependency Tracking

Extend dependency tracking to embedding relationships. If document D has embedding E, queries on E are invalidated when D changes — ensuring embedding freshness automatically.

🧪 Hypotheses

LSH extends ValueID semantics to support "similar" lookups while preserving exact-match deduplication for identical vectors.

Co-locating embeddings with source data in unified ValueStore reduces the impedance mismatch of separate vector databases.

Dependency tracking extends to embedding-based queries for cache invalidation when source data changes.

🔗 SkeinDB Integration

ValueID Store

LSH Hashing

SkeinQL RPC

Dependency Tracking

LSM / Compaction

📚 Key References

Johnson et al. — "Billion-Scale Similarity Search with GPUs (FAISS)" (2019)
Malkov & Yashunin — "Efficient and Robust Approximate Nearest Neighbor using Hierarchical Navigable Small World Graphs (HNSW)" (2018)
Wang et al. — "Milvus: A Purpose-Built Vector Data Management System" (2021)

← R09 — QUIC-Native Protocol R11 — Autoparameterization →