R11 — LLM-Assisted Query Autoparameterization

Research Proposal — Mapped to backlog in docs/RESEARCH_BACKLOG.md

🔬 What's Novel

First application of LLMs to query parameterization classification in databases
Labeled dataset of parameterizable vs. semantic-constant literals for training/evaluation
Integration framework for LLM-assisted query processing within a database engine
Analysis of accuracy-latency-cost tradeoffs for LLM-based classification vs. rule-based approaches

🔧 Technical Approach

Phase 1 — Dataset Construction

Collect query corpuses from open-source applications. Manually label each literal as "parameterizable" (user-supplied value) or "semantic constant" (enum, status code, discriminator).

Phase 2 — LLM Classification

Evaluate zero-shot (GPT-4/Claude), few-shot, and fine-tuned smaller models. Input context includes query text, schema definition, column names, and value statistics.

Phase 3 — Integration

Classify literals in the MySQL compatibility layer. Cache classification results per query fingerprint. Fall back to rule-based classification when LLM confidence is low.

Phase 4 — Feedback Loop

Plan cache misses trigger reclassification with additional context. Over time, the system builds a corpus of per-application parameterization knowledge.

🧪 Hypotheses

LLMs can classify literals with high accuracy based on context (column names, query structure, and value patterns).

Semantic parameterization achieves higher plan cache hit rates than purely syntactic parameterization.

Fine-tuned small models perform with acceptable latency for online classification in the query path.

🔗 SkeinDB Integration

MySQL Compat Layer

Query Plan Cache

SkeinQL RPC

Autoparameterization

Telemetry

📚 Key References

Trummer — "From BERT to GPT-3 Codex: Harnessing LLMs for Text-to-SQL" (2022)
Zhou et al. — "SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL" (2023)

← R10 — Vector Embeddings R12 — NL → SkeinQL Translation →