SQL autoparameterization and query fingerprints¶
Status: Draft Last updated: 2026-01-17
Many applications (especially web apps) send large numbers of ad-hoc SQL statements that differ only in literal values. This increases parse/plan overhead, fragments telemetry, and prevents effective caching/coalescing.
This document specifies "autoparameterization" for SkeinDB's SQL/MySQL compatibility layer: - compute a normalized statement form (fingerprint/digest) - extract parameters (typed) - optionally route execution through the plan cache keyed by fingerprint
The design is inspired by existing normalized query tracking systems.
1. Goals¶
- Improve performance by reusing parse/plan work for repetitive statements.
- Provide stable query identifiers for telemetry, query coalescing, and ETag caching.
- Preserve MySQL protocol compatibility.
Non-goals: - Full SQL server optimizer equivalence. - Autoparameterizing statements where it changes semantics.
2. Normalization¶
Given a parsed SQL AST, compute a canonical form: - keywords lowercased - whitespace removed - identifiers normalized as they appear (no renaming) - constants replaced with parameters (?, $1...) depending on output format
Example: - Input: SELECT * FROM users WHERE id = 123 AND status = 'active' - Normalized: select * from users where id = ? and status = ?
3. Parameter extraction¶
From the AST, collect literals in left-to-right order and produce typed parameters: - int/float/string/bytes/date/time - NULL retains type "null"
4. Safety rules¶
Do NOT parameterize: - identifiers (table/column names) - ORDER BY direction (ASC/DESC) - LIMIT/OFFSET when used to control optimizer behavior (policy choice)
Parameterize cautiously: - IN (...) lists: optionally parameterize but cap list length - LIKE patterns: safe to parameterize, but may affect index usage; acceptable
5. Plan cache integration¶
- Key: (normalized_sql, schema_version, session_flags)
- Value: prepared plan / SkeinIR + physical plan
Sessions may opt in using: - SET @@skein.autoparameterize = 1
6. SkeinQL surfaces¶
- telemetry includes fingerprint and counts
- stats.top_queries groups by fingerprint
- query.prepare may accept sql text and return query_id + parameter schema
7. Observability¶
Expose: - autoparam.enabled - autoparam.hit_rate - autoparam.cache_entries
8. Testing¶
- normalization is stable across whitespace and literal changes
- execution results identical with and without autoparam (for supported statements)
Research extension: LLM-assisted semantic autoparameterization¶
Baseline autoparameterization is syntactic: it replaces literals with parameters.
The research agenda proposes a semantic classifier that can identify literals that should remain constants (enums, status codes, type tags) to improve plan quality and avoid poor grouping.
See: docs/research_agenda/R11_llm-assisted-query-autoparameterization.md.
Adaptation sketch: - Introduce a label taxonomy: - parameterize (default) - semantic_constant - unknown (defer) - Make the classifier pluggable: offline rules first; LLM-assisted mode as optional. - Cache classification results by query fingerprint and schema version. - Add feedback: frequent plan-cache misses or regressions trigger reclassification.
Prototype RPC:
- ai.autoparam.classifiers returns the supported classifier catalog. The first implementation is offline_rules_v1, a deterministic offline classifier using explicit rules, schema hints, and column-name heuristics.
- ai.autoparam.label_schema returns the versioned label schema, confidence bounds, and cache-key policy for each decision.
- ai.autoparam.classify accepts a list of literal contexts + optional rules/classifier selection and returns labels (see docs/SKEINQL.md).
- ai.autoparam.analyze accepts SQL text and returns normalized SQL, fingerprint, literals, and labels, with the same optional classifier selection.
- ai.autoparam.feedback records plan-cache-miss feedback, re-runs the selected classifier, and accumulates per-fingerprint miss/reclassification counts.
- ai.autoparam.metrics reports plan-cache hit rate beside classifier invocation/latency counters and feedback totals.