R12 — Natural Language to SkeinQL with Verification

Research Proposal — Mapped to backlog in docs/RESEARCH_BACKLOG.md

🔬 What's Novel

Natural language interface to a structured JSON-RPC database API (not SQL)
Dependency-tracking-based query explanation generation for verifying AI-generated queries
Human-in-the-loop verification protocol showing explanations + sample results before execution
Empirical comparison of structured (SkeinQL) vs. SQL-based natural language interfaces

🔧 Technical Approach

Phase 1 — NL-to-SkeinQL Translation

LLM translation with schema context, worked examples, and SkeinQL documentation in prompts. Structured JSON-RPC output is easier to validate than free-form SQL.

Phase 2 — Explanation Generation

Dependency tracking generates natural language explanations: "This query will return rows from table X where column Y matches Z, touching these dependencies…"

Phase 3 — Verification Protocol

UI showing: generated query explanation in plain English, sample results (dry run on data subset), user actions to confirm, modify, or reject before full execution.

Phase 4 — Refinement Loop

Iterative refinement where user feedback is incorporated into subsequent generation attempts. Conversation history provides additional context for disambiguation.

🧪 Hypotheses

LLMs generate more accurate queries in SkeinQL's structured JSON-RPC format than in free-form SQL.

Dependency tracking generates explanations that help users verify query intent without understanding SkeinQL syntax.

Iterative refinement with dependency-based feedback converges to correct queries faster than direct SQL editing.

🔗 SkeinDB Integration

SkeinQL RPC

Dependency Tracking

Web Admin (SkeinAdmin)

Schema Introspection

LLM Gateway

📚 Key References

Scholak et al. — "PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models" (2021)
Rajkumar et al. — "Evaluating the Text-to-SQL Capabilities of Large Language Models" (2022)

← R11 — Autoparameterization R13 — Causal ETag Consistency →