All Research Tracks
R01 Β· Storage & Query Optimization

Learned Index Structures for ValueID Lookup

SkeinDB's ValueID-first encoding creates a unique opportunity for learned indexes. Traditional B-tree and hash indexes treat keys as opaque byte sequences, missing optimization opportunities when keys have exploitable structure. ValueIDs, being content-derived, exhibit distribution patterns that could be predicted by ML models, potentially replacing or augmenting traditional index structures with O(1) lookup and minimal space overhead.

Research Proposal β€” Mapped to backlog in docs/RESEARCH_BACKLOG.md

πŸ”¬ What's Novel

πŸ”§ Technical Approach

Phase 1 β€” Distribution Analysis

Instrument SkeinDB to collect ValueID distributions across diverse workloads (web apps, document stores, time-series). Analyze entropy, clustering, and temporal patterns to determine learnability.

Phase 2 β€” Model Selection

Evaluate learned index architectures (RMI, PGM-index, ALEX) for ValueID lookup. Implement recursive model indexes with piece-wise linear functions. Compare accuracy and latency.

Phase 3 β€” Hybrid Design

Learned models for the common case + B-tree segments for outliers. Online model updates during LSM compaction. Graceful degradation when model accuracy drops below threshold.

Phase 4 β€” Integration

Integrate into SkeinDB's ValueStore, replacing or augmenting existing lookup structures. Measure throughput, latency percentiles, and space savings in end-to-end benchmarks.

πŸ§ͺ Hypotheses

H1

Content-derived ValueIDs create predictable distribution patterns exploitable by learned models for sub-logarithmic lookups.

H2

Hybrid indexes (learned + traditional fallback) achieve better space-time tradeoffs than pure B-trees for content-addressed keys.

H3

Deduplication property creates clustering effects that improve learned index accuracy versus arbitrary key distributions.

πŸ”— SkeinDB Integration

ValueID Store
SkeinQL RPC
LSM / Compaction
Dependency Tracking
Hash-Chained WAL

πŸ“š Key References