🔬 What's Novel
- LSH-extended ValueID scheme for approximate content addressing in a unified storage model
- Unified storage where structured data and embeddings share the same ValueStore infrastructure
- Hybrid query execution combining SQL predicates with approximate nearest neighbor search
- Dependency tracking extended to embedding-derived data for automatic cache invalidation
🔧 Technical Approach
Phase 1 — LSH-ValueID Design
ValueID = (LSH_bucket, content_hash). Similar embeddings share LSH buckets, enabling approximate lookup via bucket enumeration while preserving exact-match deduplication.
Phase 2 — Unified Storage
Extend ValueStore for embedding-type values: automatic LSH computation on insert, approximate nearest neighbor via LSH bucket filtering, and exact distance refinement.
Phase 3 — Hybrid Queries
SkeinQL extensions combining SQL predicates with vector similarity: SELECT * FROM docs WHERE category = 'tech' ORDER BY embedding <-> query_vector LIMIT 10.
Phase 4 — Dependency Tracking
Extend dependency tracking to embedding relationships. If document D has embedding E, queries on E are invalidated when D changes — ensuring embedding freshness automatically.
🧪 Hypotheses
LSH extends ValueID semantics to support "similar" lookups while preserving exact-match deduplication for identical vectors.
Co-locating embeddings with source data in unified ValueStore reduces the impedance mismatch of separate vector databases.
Dependency tracking extends to embedding-based queries for cache invalidation when source data changes.
🔗 SkeinDB Integration
📚 Key References
- Johnson et al. — "Billion-Scale Similarity Search with GPUs (FAISS)" (2019)
- Malkov & Yashunin — "Efficient and Robust Approximate Nearest Neighbor using Hierarchical Navigable Small World Graphs (HNSW)" (2018)
- Wang et al. — "Milvus: A Purpose-Built Vector Data Management System" (2021)