Observability and Server Statistics¶
Status: Baseline implemented, per-fingerprint histogram drill-down, CDC pressure telemetry, basic operator alerts, settings-backed alert routing, and stats.snapshot-driven HTTP(S) webhook delivery landed; standalone escalation automation is still evolving
Last updated: 2026-05-27
This document defines the observability surface of SkeinDB: - server load and resource stats - query and storage statistics - background task visibility (compaction, snapshots, replication) - metrics export
The goal is to make SkeinDB operable at higher loads without external tooling, while still integrating cleanly with standard monitoring stacks.
1) Principles¶
-
Low overhead by default - counters should be cheap (atomic increments) - sampling is used for expensive stats (latency histograms)
-
Two audiences - humans (SkeinAdmin dashboards) - machines (metrics endpoints / exporters)
-
Useful out of the box - no Prometheus required to see basic health - but Prometheus-compatible metrics should be available
2) Stats data model¶
2.1 Instantaneous system snapshot¶
A stats.snapshot should include:
- process uptime
- CPU usage (process and/or system)
- memory RSS, heap
- open file descriptors / handles
- disk usage for data directory
- network bytes in/out
2.2 Database workload snapshot¶
- active sessions
- active queries
- QPS (queries/sec) recent window
- TPS (transactions/sec)
- commit rate
- average latency plus
p50/p95/p99latency over the recent query sample window - cache hit rates (row cache, value cache)
- coalescing hit rate (if enabled)
- in-flight query groups
- autoparameterization hit rate (if enabled)
- plan cache entries (if enabled)
- index advisor: suggestions_pending (if enabled)
Current runtime surface in stats.snapshot.query:
- tracked_calls
- fingerprints
- recent_samples
- slow_count
- avg_latency_ms
- latency_ms.p50
- latency_ms.p95
- latency_ms.p99
- etag_hits
- coalesced
Current synthesized alert surface in stats.snapshot.alerts:
- overall status
- summary.critical
- summary.warning
- summary.total
- optional routing.configured
- optional routing.routed_alerts
- optional routing.matched_routes
- optional routing.delivery.delivered
- optional routing.delivery.suppressed
- optional routing.delivery.failed
- optional routing.delivery.unsupported
- per-item code, severity, component, panel, title, summary, and action
- optional per-item routes[].id
- optional per-item routes[].targets
- optional per-item routes[].delivery.delivered
- optional per-item routes[].delivery.suppressed
- optional per-item routes[].delivery.failed
- optional per-item routes[].delivery.unsupported
Current routing config surface:
- settings.set { "observability.alert_routes": [...] }
- each route can declare id, min_severity, components, panels, codes, and targets
- stats.snapshot.alerts resolves matching routes against the current synthesized alerts
- http:// and https:// targets receive a JSON POST the first time a matching alert is observed in stats.snapshot
- identical active alerts are suppressed on subsequent stats.snapshot calls until the alert clears
- non-HTTP targets are still annotated in the snapshot but are not delivered externally yet
Current drill-down surface in stats.query_fingerprint_latency:
- bounded recent per-fingerprint latency samples
- p50_ms, p95_ms, and p99_ms
- millisecond histogram buckets with overflow_count
- optional filtering by exact fingerprint for operator drill-down
2.3 Storage engine snapshot¶
- WAL size and append rate
- checkpoint age
- compaction state (running? stage?)
- compaction throughput
- compaction queue length + stall/backpressure events
- LSM levels sizes
- ValueStore:
- unique values
- logical bytes vs physical bytes
- dedup ratio
- encryption mode and encrypted objects count (if enabled)
- delta chain depth stats (if DELTA enabled)
History/replay (if enabled): - oldest retained commit_ts - retained history bytes - replay exports/imports + verify failures
CDC (if enabled): - active subscriptions - max lag (producer_offset - consumer_offset)
Current runtime surface in stats.snapshot.cdc:
- active_subscriptions
- table_subscriptions
- query_subscriptions
- total_lag
- max_lag
- paused_subscriptions
- pressured_subscriptions
- throttle_recommended_subscriptions
- warn_lag
- throttle_lag
- min_remaining_until_resnapshot
- resnapshot_subscriptions
- earliest_offset
- latest_offset
- retained_events
- retention_limit
- dropped_events_total
2.4 Cluster snapshot (if enabled)¶
- node role (primary/replica/router)
- replication lag (LSN difference)
- missing object fetch stats (CAS replication)
- CAS object hit rate and bytes_saved
- shard placement summary
3) Export surfaces¶
3.1 SkeinQL methods¶
stats.snapshot-> JSON summary for dashboardsstats.top_queries-> top by total time / p95 / rowsstats.query_fingerprint_latency-> per-fingerprint p50/p95/p99 plus bounded latency histogram bucketsstats.slow_queries-> recent slow query logstats.storage-> compaction + disk + dedupstats.cluster-> node + shard view
3.2 HTTP metrics endpoint (Prometheus-style)¶
Optional endpoint:
- GET /metrics
Design notes: - keep metric names stable - include labels: - db, table, node_id, shard_id, role
4) Query statistics¶
4.1 Query fingerprinting¶
- normalize query text (or SkeinIR form)
- compute
query_fingerprint(hash) - track:
- count
- total_time
- rows_returned
- bytes_returned
4.2 Sampling¶
To keep overhead low: - always count - sample timing at a configurable rate (e.g., 1/100)
5) UI requirements (SkeinAdmin)¶
The "Server Load & Stats" section should include:
1) Overview - CPU, RAM, disk, network - QPS/TPS - active sessions
2) Latency - average latency plus p50/p95/p99 over the recent query sample window - slow query list
3) Storage - WAL growth - compaction progress - dedup ratio - snapshot sizes
4) Cluster (if enabled) - node list with health - replication lag - shard placement
Delivered baseline in SkeinAdmin:
- overview cards for cache, average latency, and p95 tail latency
- overview Operational Alerts card synthesized from current query, CDC, and compaction telemetry
- active-session summary text with average, p95, and p99 latency
- telemetry panel table for per-fingerprint latency histograms with p95/p99 drill-down
- CDC panel summary cards for runtime subscription counts, lag, retention horizon, and dropped-event totals
Delivered operator routing baseline:
- persisted alert-route configuration via observability.alert_routes
- per-alert matched route IDs/targets in stats.snapshot.alerts.items[*].routes
- top-level route summary counters in stats.snapshot.alerts.routing
- HTTP(S) route targets receive JSON POST delivery once per active alert while stats.snapshot is being evaluated
- broader escalation policies, retries, and non-HTTP sinks remain backlog work
6) Backlog¶
- ST01: Implement basic counters +
stats.snapshot - ST02: Implement query fingerprint store +
stats.top_queries - ST03: Implement slow query log + UI
- ST04: Implement
/metrics - ST05: Expose storage stats (compaction, dedup, delta)
- ST06: Expose cluster stats (lag, shard placement)