🔬 What's Novel
- First application of LLMs to query parameterization classification in databases
- Labeled dataset of parameterizable vs. semantic-constant literals for training/evaluation
- Integration framework for LLM-assisted query processing within a database engine
- Analysis of accuracy-latency-cost tradeoffs for LLM-based classification vs. rule-based approaches
🔧 Technical Approach
Phase 1 — Dataset Construction
Collect query corpuses from open-source applications. Manually label each literal as "parameterizable" (user-supplied value) or "semantic constant" (enum, status code, discriminator).
Phase 2 — LLM Classification
Evaluate zero-shot (GPT-4/Claude), few-shot, and fine-tuned smaller models. Input context includes query text, schema definition, column names, and value statistics.
Phase 3 — Integration
Classify literals in the MySQL compatibility layer. Cache classification results per query fingerprint. Fall back to rule-based classification when LLM confidence is low.
Phase 4 — Feedback Loop
Plan cache misses trigger reclassification with additional context. Over time, the system builds a corpus of per-application parameterization knowledge.
🧪 Hypotheses
LLMs can classify literals with high accuracy based on context (column names, query structure, and value patterns).
Semantic parameterization achieves higher plan cache hit rates than purely syntactic parameterization.
Fine-tuned small models perform with acceptable latency for online classification in the query path.
🔗 SkeinDB Integration
📚 Key References
- Trummer — "From BERT to GPT-3 Codex: Harnessing LLMs for Text-to-SQL" (2022)
- Zhou et al. — "SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL" (2023)