Status: Draft v0.3 (v0.2 compatible)
Last updated: 2026-05-27
This document defines SkeinDB's on-disk storage layout and record formats.
All formats MUST be versioned. Any breaking change requires a format version bump.
Design goals:
- Append-only segments
- Crash safety with WAL + checkpoint
- MVCC row versioning
- Optional deduplicated ValueStore (content-addressed)
- Simple compaction + GC suitable for single-binary deployments
1) Directory layout
data/
MANIFEST.log
MANIFEST.snapshot (optional)
snapshots.json (prototype snapshot metadata, format v1)
dp_budgets.json (prototype DP budgets, format v1)
dp_audit.json (prototype DP audit log, format v1)
oblivious_policies.json (prototype oblivious policy store, format v1)
forensic_chain.json (prototype forensic hash chain, format v3)
wasm_catalog.json (prototype Wasm UDF catalog, format v1)
merge_policies.json (prototype merge policies, format v1)
merge_wasm_registry.json (prototype merge wasm registry, format v1)
views.json (prototype materialized views, format v2)
schema_versions.json (prototype schema versions, format v1)
schema_changes.json (prototype schema change log, format v2)
schema_flags.json (prototype schema flags, format v1)
changes.json (prototype retained CDC change log, format v2)
cdc_subscriptions.json (prototype CDC subscription cursors, format v8)
advisor_patterns.json (prototype index advisor patterns, format v1)
advisor_history.json (prototype index advisor history, format v2)
security_state.json (security principals + API tokens, format v1)
tables/
/.json (prototype row store, format v3)
/.rseg (prototype row segment container, format v1)
/.sidx.json (prototype secondary index cache, format v1)
wal/
wal-000001.log
rows/
rows-000001.rseg
vals/
vals-000001.vseg
idx/
rowdir-L0-000001.run
valdir-L0-000002.run
pk_-L0-000003.run
tmp/
2) Common encodings
2.1 Endianness
- Fixed-width integers in headers and record bodies are LITTLE-ENDIAN.
2.2 VarU (ULEB128)
- Variable-length unsigned integer encoding for u64.
- 7 bits payload per byte, MSB continuation.
2.3 Bytes and String
- Bytes: VarU length + N bytes
- String: VarU length + UTF-8 bytes
2.4 Checksums
- CRC32C over record payload bytes (not including len/crc).
2.5 ValueID
- ValueID = BLAKE3-128(value_bytes) (16 bytes)
- On lookup, verify bytes equality to eliminate collision risk.
FileHeader (64 bytes)
magic[8] = ASCII "SKNDB\0\1"
file_kind = u8 (1=wal,2=rowseg,3=valseg,4=run,5=manifest)
endian = u8 (1=little)
header_len = u16 (64)
format_ver = u32 (1)
file_id = u32 (segment/run id)
created_unix_s = u64
reserved[32] = bytes (0)
header_crc32c = u32 (CRC32C over bytes 0..59)
4) Record framing
RecordFrame:
len u32 (LE)
crc32c u32 (LE)
payload [len] bytes
Each MANIFEST record is wrapped in a RecordFrame. The payload body starts
with a rec_type tag, then uses VarU fields.
ManifestRecordV1 payloads:
rec_type = 0x01 (AddFile)
file_kind u8 (1=wal,2=rowseg,3=valseg,4=run)
file_id VarU (u32 domain)
level VarU (u32 domain)
rec_type = 0x02 (RemoveFile)
file_kind u8
file_id VarU (u32 domain)
rec_type = 0x03 (SetCurrentVersion)
version VarU (u64)
rec_type = 0x04 (SetLastLsn)
lsn VarU (u64)
rec_type = 0x05 (CleanShutdown)
unix_s VarU (u64)
Replayed semantics:
AddFile adds or updates a live file entry (kind,file_id) with its level.
RemoveFile deletes that entry from the live set.
SetCurrentVersion, SetLastLsn, and CleanShutdown update their
corresponding scalar state fields.
5) Pointers
FilePtr (12 bytes):
file_id u32
offset u64
6) Row segments (.rseg)
RV1 record payload:
RV1
rec_type u8 = 0x10
rec_ver u8 = 1
flags u16
table_id u32
row_id u64
begin_ts u64 // commit_ts; 0 allowed only in WAL staging
end_ts u64 // 0 means +INF
prev_ptr FilePtr // previous row version (or 0/0)
group_count VarU
repeated group_count:
group_id VarU
group_ref_kind u8 (0=inline_small, 1=value_id_ref)
if kind=0:
group_bytes Bytes // GroupObject bytes (GO1)
if kind=1:
group_vid[16] // ValueID of a GROUP in value store
Flags:
- bit0 IS_DELETE
Current skeindb-core implementation status for T014:
.rseg files are append-only and use FileHeader(file_kind=RowSeg) followed by RecordFrame-wrapped RV1 payloads.
RowSegmentWriter::append returns a FilePtr { file_id, offset } pointing at the start of the emitted RecordFrame, so callers can chain prev_ptr for MVCC version histories.
RowGroupRef is encoded as group_ref_kind=0 (inline bytes) or group_ref_kind=1 (16-byte ValueID).
RowSegmentReader supports sequential full scans (read_all) and random-access lookup by offset (read_at); decode strictly rejects unknown record types/versions, unknown group ref kinds, and trailing bytes.
7) Value segments (.vseg)
VE1 record payload:
VE1
rec_type u8 = 0x20
rec_ver u8 = 1
val_kind u8 (1=CELL, 2=GROUP, 3=BLOB_CHUNK, 4=BLOB_MANIFEST, 5=DELTA, 6=EMBEDDING)
codec u8 (0=RAW, 1=ZSTD)
value_id[16]
raw_len VarU
raw_bytes Bytes-or-compressed
GroupObject bytes GO1:
- See v0.1 GO1 spec (GroupObject is the dedup unit for a group of columns)
Current skeindb-core implementation status for T012:
.vseg files are append-only and use FileHeader(file_kind=ValSeg) followed by framed VE1 records.
ValueSegmentWriter / ValueSegmentReader support codec=0 (RAW) and codec=1 (ZSTD).
- The writer stores ZSTD only when it produces a smaller payload; otherwise it falls back to RAW.
- DELTA entries persist
DELTA1 inside raw_bytes; skip patches are runtime-only metadata and are rebuilt lazily rather than stored on disk.
8) Sorted runs (.run)
A .run is an immutable sorted key/value table (SSTable-like), used for:
- rowdir: row_id -> FilePtr
- valdir: value_id -> FilePtr
- primary/secondary indexes
DataBlock payload:
block_type u8 = 0x40
block_ver u8 = 1
entry_count VarU
repeated entry_count:
key Bytes
value Bytes
IndexBlock payload:
block_type u8 = 0x41
block_ver u8 = 1
block_count VarU
repeated:
first_key Bytes
block_offset u64
Footer:
footer_magic[8] = "SKNRUN\0\1"
index_offset u64
file_crc32c u32 (optional)
Current skeindb-core implementation status for T013:
.run files are immutable and use FileHeader(file_kind=Run) followed by one or more DataBlocks, a single IndexBlock, and a footer.
RunWriter requires strictly increasing keys and splits data blocks by a configurable target block size.
RunReader supports full scans and point lookups by binary-searching the loaded index block.
SimpleLsm provides a minimal memtable + level0 implementation: puts land in a BTreeMap, flushes create new run-######.run files, and reads consult level0 runs newest-first.
8.1 RowDir runs (T015)
RowDir (crates/skeindb-core/src/rowdir.rs) reuses the .run format to
persist the row_id → FilePtr head-pointer directory with no new on-disk
format:
- Keys are 8-byte big-endian
row_id, so ascending byte order matches
ascending numeric order.
- Values are
[tag:u8][payload]:
tag = 0 — live entry; payload is the 12-byte FilePtr (file_id:u32 LE
offset:u64 LE) pointing at the head of the row's version chain in
a .rseg file.
tag = 1 — tombstone; no payload.
- On
load_from_run, live entries upsert the in-memory map and tombstones
remove the entry entirely, so newer-generation tombstones shadow older-
generation live entries during merges.
8.2 MVCC visibility (T016)
Current skeindb-core visibility rules over .rseg chains:
- Readers walk
prev_ptr from a known head FilePtr until they find the
first version visible at the chosen snapshot.
- A version is visible iff
begin_ts != 0 && begin_ts <= snapshot_ts &&
(end_ts == 0 || end_ts > snapshot_ts).
Snapshot::latest() behaves like reading at +INF, so the current head
version wins when it is committed and not superseded.
begin_ts == 0 means staged / not yet committed and is skipped, allowing a
previous committed version in the chain to remain visible.
- If the first visible version is a delete marker (
flags & IS_DELETE != 0),
the row is considered deleted at that snapshot.
9) WAL (.log)
WALHeader prefix for all WAL records:
WALHeader
rec_type u8
rec_ver u8
flags u16
lsn u64
txn_id u64
Commit rule:
- A txn is committed iff a valid COMMIT_TXN record exists.
- Recovery replays only committed txns in LSN order.
Current v1 WAL body types:
rec_type = 0x01 (BEGIN_TXN)
- no body payload
rec_type = 0x02 (MUTATION)
op_bytes = Bytes (VarU length + opaque payload)
rec_type = 0x03 (COMMIT_TXN)
- no body payload
rec_type = 0x04 (ABORT_TXN)
- no body payload
Recovery rules for the v1 body vocabulary:
- A txn is replayable only if a valid
COMMIT_TXN was decoded for that txn_id.
ABORT_TXN discards any staged mutations for that txn_id.
- A torn or corrupt tail stops the scan at the last valid frame; recovery keeps
all earlier valid committed txns and discards the invalid tail bytes.
10) Compaction and GC
- Compute safe_ts = oldest_active_snapshot_ts.
- Row compaction discards versions with end_ts < safe_ts.
- Value GC is mark-and-sweep driven by live row versions.
These JSON files are optional and may be ignored by older binaries.
Each file includes a format_version field; unknown versions should be ignored.
11.1 security_state.json
Persisted security principals for the HTTP/API bearer surface and protocol-level DB logins.
Format:
{
"format_version": 1,
"next_api_token_id": 3,
"api_tokens": [
{
"token_id": "tok_0000000000000001",
"secret_sha256": "4a44dc15364204a80fe80e9039455cc1...",
"role": "admin",
"label": "ci",
"created_at_ms": 1730000000000,
"expires_at_ms": 0
}
],
"db_users": [
{
"username": "alice",
"role": "read_write",
"created_at_ms": 1730000000000,
"grants": {
"app": ["SELECT", "INSERT"]
},
"password_sha1": "cbfdac6008f9cab4083784cbd1874f76618d2a97",
"password_sha256": "fcf730b6d95236ecd3c9fc2d92d7b6b2..."
}
]
}
Compatibility notes:
- Added in v0.3 as an optional metadata file.
- Raw API token secrets are not stored on disk; only secret_sha256 is persisted.
- Managed DB users persist digests (password_sha1 for MySQL native-password verification and password_sha256 for cleartext-password verification), never raw passwords.
- If the file is missing or has an unknown format_version, the server starts with no managed API tokens or DB users.
11.2 forensic_chain.json
Prototype forensic chain persistence used by maintenance.audit_status,
maintenance.audit_verify, and skeindb audit-verify.
Format:
{
"format_version": 3,
"next_id": 4,
"records": [
{
"id": 1,
"ts_ms": 1730000000000,
"db": "app",
"table": "logs",
"op": "insert",
"pk": [{"t":"u64","v":1}],
"change_seq": 1,
"prev_hash": "genesis",
"hash": "8b9d..."
}
],
"checkpoint_anchors": [
{
"checkpoint_id": "ckpt_1730000001000",
"ts_ms": 1730000001000,
"chain_len": 1,
"chain_head_hash": "8b9d...",
"change_seq": 1
}
],
"last_verified_ms": 1730000002000
}
Compatibility notes:
- Format v3 persists last_verified_ms so successful verification survives reopen.
- Older v1/v2 files load with last_verified_ms = 0.
- The prototype chain remains a stand-in for the future WAL-backed verifier.
11.3 merge_wasm_registry.json
Format:
{
"format_version": 1,
"modules": [
{
"module_id": "merge_sum",
"value_id": "deadbeef...",
"size_bytes": 1234,
"capabilities": {
"values_only": true,
"deterministic": true,
"max_fuel": 1000,
"max_memory_bytes": 65536,
"max_output_bytes": 4096
},
"name": "sum merge",
"wasm_b64": "AA==",
"created_at_ms": 1730000000000
}
]
}
Compatibility notes:
- Added in v0.2 as an optional metadata file.
- If the file is missing or has an unknown format_version, it is ignored.
11.4 views.json
Materialized-view state for view.create/drop/refresh/evaluate/status/explain_deps.
Format:
{
"format_version": 2,
"views": [
{
"db": "app",
"name": "city_scores",
"query": {"body": {"select": {"projection": [], "from": []}}, "with": [], "order_by": []},
"columns": ["city", "cnt", "total"],
"pk_columns": ["city"],
"deps": [
{
"db": "app",
"table": "users",
"columns": ["city", "score"],
"projection_columns": ["city", "score"],
"predicate_columns": [],
"group_by_columns": ["city"]
}
],
"rows": [
{
"pk": [{"t": "str", "v": "Oslo"}],
"values": [
{"t": "str", "v": "Oslo"},
{"t": "u64", "v": 2},
{"t": "f64", "v": 12.0}
]
}
],
"source_rows": [
{
"pk": [{"t": "u64", "v": 1}],
"row": {
"id": {"t": "u64", "v": 1},
"city": {"t": "str", "v": "Oslo"},
"score": {"t": "u64", "v": 5}
}
}
],
"last_refresh_ms": 1730000000000,
"last_change_seq": 42,
"stale": false,
"last_refresh_mode": "incremental"
}
]
}
Compatibility notes:
- Format v2 persists dependency-usage breakdown (projection_columns, predicate_columns, group_by_columns) per dependency plus grouped-view source_rows shadow state used by incremental maintenance.
- v0.3.15 adds the read-only view.evaluate oracle/benchmark and compatibility catalogs without changing this on-disk format.
- Older format v1 files still load; missing dependency-usage arrays are rebuilt from the stored query on load, and source_rows defaults to empty.
- If the file is missing or has an unknown format_version, it is ignored.
11.5 schema_changes.json
Prototype persisted schema-evolution proposals for schema.propose_change,
schema.merge_status, and schema.apply_merge.
Format:
{
"format_version": 2,
"next_id": 3,
"changes": [
{
"id": "sch_1",
"table": {"db": "app", "table": "users"},
"base_version": 1,
"changes": [
{"op": "add_column", "name": "region", "type": {"kind": "str"}, "nullable": true, "auto_increment": false},
{"op": "add_index", "name": "region_lookup", "columns": ["region"], "unique": false}
],
"message": "roll out region",
"created_at_ms": 1730000000000,
"status": "pending"
}
]
}
Compatibility notes:
- Format v2 is current because persisted schema changes may now include add_index operations as well as add_column.
- Persisted schema-change status values now include pending, applied, and rejected; deterministic losers are marked rejected during schema.apply_merge without changing the file shape.
- Legacy format v1 files are still accepted on load and are rewritten to format v2 on the next persist.
- Missing files mean there are no pending schema-change proposals.
- Unknown format_version values are ignored by the current loader.
11.6 schema_flags.json
Prototype schema metadata for opt-in execution hints that should survive reopen.
Format:
{
"format_version": 1,
"tables": [
{
"db": "app",
"table": "users",
"interned_columns": ["email", "city"]
}
]
}
Compatibility notes:
- Added in v0.3 as an optional metadata file for Phase 15 T150.
- Missing files mean no interned-column flags are active.
- Unknown format_version values are ignored by the current loader.
- Column names are normalized against the live catalog on load; dropped or renamed columns are pruned from the file on the next persist.
11.7 wasm_catalog.json
Prototype metadata catalog for general Wasm UDF modules. Module bytes are not
embedded here; they live in the ValueStore and are referenced by value_id.
Format:
{
"format_version": 1,
"modules": [
{
"module_id": "math_abs",
"name": "math abs",
"kind": "scalar",
"abi": "skein.wasm.udf.v1",
"entrypoint": "skein_scalar",
"value_id": "0123abcd0123abcd0123abcd0123abcd",
"size_bytes": 1234,
"capabilities": {
"allowed_hostcalls": ["log.debug"],
"allowed_tables": [
{ "db": "app", "table": "users", "read": true, "write": false }
],
"deterministic": true,
"max_fuel": 1000,
"max_memory_bytes": 65536,
"max_output_bytes": 4096
},
"created_at_ms": 1730000000000
}
]
}
Compatibility notes:
- Added in v0.3 as an optional metadata file for Phase 8 T080.
- The catalog stores only typed metadata plus value_id; the Wasm bytes are
stored separately in .vseg data managed by ValueStore.
- Unknown format_version values are rejected by the current core loader.