R01 — Learned Index Structures for ValueID Lookup

Research Proposal — Mapped to backlog in docs/RESEARCH_BACKLOG.md

🔬 What's Novel

First application of learned indexes to content-addressed storage systems
Characterization of ValueID distributions and their learnability across diverse workloads
Hybrid learned/traditional index design optimized for LSM compaction workflows
Deduplication-induced clustering effects that improve learned index accuracy vs. arbitrary key distributions

🔧 Technical Approach

Phase 1 — Distribution Analysis

Instrument SkeinDB to collect ValueID distributions across diverse workloads (web apps, document stores, time-series). Analyze entropy, clustering, and temporal patterns to determine learnability.

Phase 2 — Model Selection

Evaluate learned index architectures (RMI, PGM-index, ALEX) for ValueID lookup. Implement recursive model indexes with piece-wise linear functions. Compare accuracy and latency.

Phase 3 — Hybrid Design

Learned models for the common case + B-tree segments for outliers. Online model updates during LSM compaction. Graceful degradation when model accuracy drops below threshold.

Phase 4 — Integration

Integrate into SkeinDB's ValueStore, replacing or augmenting existing lookup structures. Measure throughput, latency percentiles, and space savings in end-to-end benchmarks.

🧪 Hypotheses

Content-derived ValueIDs create predictable distribution patterns exploitable by learned models for sub-logarithmic lookups.

Hybrid indexes (learned + traditional fallback) achieve better space-time tradeoffs than pure B-trees for content-addressed keys.

Deduplication property creates clustering effects that improve learned index accuracy versus arbitrary key distributions.

🔗 SkeinDB Integration

ValueID Store

SkeinQL RPC

LSM / Compaction

Dependency Tracking

Hash-Chained WAL

📚 Key References

Kraska et al. — "The Case for Learned Index Structures" (2018)
Ferragina & Vinciguerra — "The PGM-index: a fully-dynamic compressed learned index" (2020)
Ding et al. — "ALEX: An Updatable Adaptive Learned Index" (2020)

R02 — Adaptive Row/Column Execution →