All Research Tracks
R10 · AI/ML Integration

Vector Embeddings as First-Class ValueIDs

SkeinDB's content-addressed ValueStore naturally extends to vector embeddings. By incorporating locality-sensitive hashing (LSH) into ValueIDs, similar vectors receive related IDs — enabling deduplication of similar (not just identical) content. This creates a unified storage model where text, embeddings, and structured data share infrastructure, with hybrid queries combining SQL predicates and approximate nearest neighbor search.

Research Proposal — Mapped to backlog in docs/RESEARCH_BACKLOG.md

🔬 What's Novel

🔧 Technical Approach

Phase 1 — LSH-ValueID Design

ValueID = (LSH_bucket, content_hash). Similar embeddings share LSH buckets, enabling approximate lookup via bucket enumeration while preserving exact-match deduplication.

Phase 2 — Unified Storage

Extend ValueStore for embedding-type values: automatic LSH computation on insert, approximate nearest neighbor via LSH bucket filtering, and exact distance refinement.

Phase 3 — Hybrid Queries

SkeinQL extensions combining SQL predicates with vector similarity: SELECT * FROM docs WHERE category = 'tech' ORDER BY embedding <-> query_vector LIMIT 10.

Phase 4 — Dependency Tracking

Extend dependency tracking to embedding relationships. If document D has embedding E, queries on E are invalidated when D changes — ensuring embedding freshness automatically.

🧪 Hypotheses

H1

LSH extends ValueID semantics to support "similar" lookups while preserving exact-match deduplication for identical vectors.

H2

Co-locating embeddings with source data in unified ValueStore reduces the impedance mismatch of separate vector databases.

H3

Dependency tracking extends to embedding-based queries for cache invalidation when source data changes.

🔗 SkeinDB Integration

ValueID Store
LSH Hashing
SkeinQL RPC
Dependency Tracking
LSM / Compaction

📚 Key References