π¬ What's Novel
- Formal cost model for column snapshot materialization in LSM-based systems
- Integration of columnar storage with fine-grained dependency tracking for cache invalidation
- Online adaptive materialization decisions without offline workload analysis
- Empirical study of row-column tradeoffs using SkeinDB's unified storage model
π§ Technical Approach
Phase 1 β Cost Model
Formalize the cost of column snapshot creation (scan cost, storage overhead) vs. benefit (reduced I/O for projections). Include compaction interaction costs to model the full lifecycle.
Phase 2 β Pattern Detection
Query pattern analysis identifying frequently accessed column subsets, scan-heavy queries benefiting from columnar format, and temporal access patterns for adaptive thresholds.
Phase 3 β Dependency Integration
Extend dependency tracking to column granularity. Mark affected column snapshots for incremental refresh or invalidation on row-level updates, avoiding full recomputation.
Phase 4 β Adaptive Materialization
Continuous controller evaluating materialization decisions based on recent query patterns, resource availability, and cost-benefit thresholds. No offline workload analysis required.
π§ͺ Hypotheses
Query pattern analysis can predict which column projections offer the highest benefit-to-cost ratio for materialization.
Dependency tracking can maintain column snapshot consistency with minimal overhead compared to full invalidation.
Adaptive materialization decisions can be made online without offline workload analysis, adapting to changing access patterns.
π SkeinDB Integration
π Key References
- Arulraj et al. β "Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads" (2016)
- Pavlo et al. β "Self-Driving Database Management Systems" (2017)