π¬ What's Novel
- First application of learned indexes to content-addressed storage systems
- Characterization of ValueID distributions and their learnability across diverse workloads
- Hybrid learned/traditional index design optimized for LSM compaction workflows
- Deduplication-induced clustering effects that improve learned index accuracy vs. arbitrary key distributions
π§ Technical Approach
Phase 1 β Distribution Analysis
Instrument SkeinDB to collect ValueID distributions across diverse workloads (web apps, document stores, time-series). Analyze entropy, clustering, and temporal patterns to determine learnability.
Phase 2 β Model Selection
Evaluate learned index architectures (RMI, PGM-index, ALEX) for ValueID lookup. Implement recursive model indexes with piece-wise linear functions. Compare accuracy and latency.
Phase 3 β Hybrid Design
Learned models for the common case + B-tree segments for outliers. Online model updates during LSM compaction. Graceful degradation when model accuracy drops below threshold.
Phase 4 β Integration
Integrate into SkeinDB's ValueStore, replacing or augmenting existing lookup structures. Measure throughput, latency percentiles, and space savings in end-to-end benchmarks.
π§ͺ Hypotheses
Content-derived ValueIDs create predictable distribution patterns exploitable by learned models for sub-logarithmic lookups.
Hybrid indexes (learned + traditional fallback) achieve better space-time tradeoffs than pure B-trees for content-addressed keys.
Deduplication property creates clustering effects that improve learned index accuracy versus arbitrary key distributions.
π SkeinDB Integration
π Key References
- Kraska et al. β "The Case for Learned Index Structures" (2018)
- Ferragina & Vinciguerra β "The PGM-index: a fully-dynamic compressed learned index" (2020)
- Ding et al. β "ALEX: An Updatable Adaptive Learned Index" (2020)