mirror of https://github.com/Kpa-clawbot/meshcore-analyzer.git synced 2026-07-02 20:51:37 +00:00

Files

T

Kpa-clawbot e438451dc9 feat(preflight): hard-fail gate on sync schema migrations + async runner (#1541 )

Closes the recurring "sync migration on large table" regression class
(#791-style, #1483-style).

## Problem

Pattern that keeps repeating:

1. A perf/feature PR adds `CREATE INDEX` / `ALTER TABLE` / `UPDATE ...
WHERE` in a migration file (typically `cmd/ingestor/db.go`).
2. Local dev DB has ~100 rows. Migration returns in milliseconds. CI is
green.
3. Reviewers approve on plan correctness; nobody knows what the prod
table size is.
4. First prod boot at scale (Cascadia: ~2600 nodes, 80K+ obs; previous
prod: 1.9M+ obs) pins the ingestor at `[migration] Adding index...` for
minutes.
5. Healthcheck times out → container restart → loop. Operator pages.
Hotfix.

Most recent case: `obs_observer_ts_idx_v1` in v3.8.3 — release notes
already document an "expect a longer first boot" warning because we knew
it would hit prod hard.

## What this PR adds

**Async helper (`cmd/ingestor/async_migration.go`):**
- `Store.RunAsyncMigration(ctx, name, fn)` — registers the migration as
`pending_async` in a new `_async_migrations` bookkeeping table, returns
to caller immediately, schedules `fn` in a goroutine on the shared
backfill `WaitGroup`, transitions to `done` (or `failed` with error
captured) on completion.
- `Store.AsyncMigrationStatus(name)` and
`Store.WaitForAsyncMigrations()` for tests/shutdown.
- Idempotent: `done` rows short-circuit; `pending_async`/`failed` rows
are retried on next boot.

**Retroactive #1483 conversion (`cmd/ingestor/db.go`):**
- `obs_observer_ts_idx_v1` (the composite `(observer_idx, timestamp)`
index build on `observations`) is now scheduled via `RunAsyncMigration`
from `OpenStore()` so the ingestor accepts packets immediately while the
index builds in the background.
- Legacy `_migrations` gate is preserved by the async fn → DBs that
already completed the sync build stay no-op.

**Annotation convention (`MIGRATIONS.md`):**
Every new `CREATE INDEX` / `ALTER TABLE` / data-rewrite in a migration
file must do ONE of:
1. Run via `Store.RunAsyncMigration(...)` (preferred for backfills).
2. Carry a `// PREFLIGHT: async=true reason="..."` comment directly
above the migration block.
3. Include a `PREFLIGHT-MIGRATION-SCALE: <30s N=<scale>` line in the PR
body.

**TDD pair:**
- Red commit `2c6744cc` — `TestRunAsyncMigration_PendingThenDone`
against a stub helper. Build passes, assertion fails (`async migration
fn did not start within 2s`).
- Green commit `38354f32` — real helper + retroactive fix + docs. Test
green.

**Fixtures (`cmd/ingestor/testdata/preflight-migrations/`):**
- `bad_sync_migration.go` — known-bad sample with no annotation.
- `good_annotated_migration.go` — known-good sample with annotation.
The preflight gate script can be unit-tested against these.

## Gate location (NOT in this PR)

The actual `check-async-migrations.sh` lives in the OpenClaw skills
directory at `~/.openclaw/skills/pr-preflight/scripts/` (separate from
the repo) and is wired into `run-all.sh`. It greps the diff for
new/modified migration blocks and hard-fails (exit 1) on any sync schema
mutation lacking one of the three opt-outs above. The fixtures in this
PR give maintainers a reproducible target.

## Why annotation-discipline, not size detection

You cannot determine table size from a diff. The gate enforces that
every author who adds a schema migration must consciously decide which
bucket it falls into and write that down. That is the cheapest possible
intervention that breaks the cycle.

## Testing

- `go test ./...` in `cmd/ingestor` — all tests pass including the new
`TestRunAsyncMigration_PendingThenDone`.
- Manual: red commit fails on assertion (not build), green commit passes
— verifiable by `git checkout 2c6744cc --
cmd/ingestor/async_migration.go && go test -run TestRunAsync
./cmd/ingestor` from the green commit.

## Preflight overrides

None — clean run after the convention is applied.

---------

Co-authored-by: clawbot <bot@openclaw.local>
Co-authored-by: clawbot <bot@openclaw>

2026-06-03 21:03:59 +00:00

6.2 KiB

Raw Blame History

MIGRATIONS — async vs sync policy

CoreScope's ingestor applies schema/data migrations inline at boot in cmd/ingestor/db.go. Every migration that runs synchronously blocks the ingestor from accepting packets until it returns. On a dev DB that's milliseconds; at prod scale (1.9M+ observations, 80K+ adverts, 2600+ nodes on Cascadia) it can pin the boot for minutes and trigger restart loops — the "upgrade broke prod" failure class (#791, #1483, and others).

The rule

Any new CREATE INDEX, ALTER TABLE, or data-rewriting UPDATE/DELETE in a migration file MUST do ONE of the following:

Option 1 — Run via `Store.RunAsyncMigration` (preferred for backfills)

// Scheduled in OpenStore() AFTER the *Store is constructed.
if err := s.RunAsyncMigration(ctx, "my_migration_v1",
    func(ctx context.Context, db *sql.DB) error {
        _, err := db.ExecContext(ctx, `CREATE INDEX IF NOT EXISTS ...`)
        return err
    }); err != nil {
    log.Printf("[migration/async] scheduling failed: %v", err)
}

The migration is recorded as pending_async in the _async_migrations table immediately — the ingestor boots and starts ingesting.
fn runs in a goroutine; the WaitGroup is shared with the rest of the ingestor (Store.WaitForAsyncMigrations() waits for everything).
On success the row flips to done; on error/panic to failed with the error message captured.
Idempotent: rows in done state short-circuit; failed/pending_async rows are retried on the next boot.

Reference implementations: Store.BackfillPathJSONAsync (path_json backfill) and the converted obs_observer_ts_idx_v1 index build in OpenStore.

Option 2 — Annotate as preflight-cheap

Some migrations are genuinely cheap at any scale (e.g. ALTER TABLE ADD COLUMN, CREATE INDEX on a table you know is bounded to a few thousand rows). Annotate the migration block with a comment on the line immediately above the migration block so the preflight gate recognises the opt-out:

// PREFLIGHT: async=true reason="ALTER ADD COLUMN — O(1) sqlite operation"
if r := db.QueryRow("SELECT 1 FROM _migrations WHERE name = 'foo_v1'"); ...

The reason MUST be a real one-line justification you can defend in review. "It's fine" is not a reason.

Option 3 — Opt out per PR

If the migration is genuinely safe and you don't want to add an inline annotation, put a single line in the PR body:

PREFLIGHT-MIGRATION-SCALE: <30s N=80K verified on Cascadia staging snapshot

This must include both <30s and N=<some scale> so a reviewer can challenge the measurement.

The gate

~/.openclaw/skills/pr-preflight/scripts/check-async-migrations.sh runs on every PR via the preflight orchestrator. It greps the diff for new or modified migration blocks (files matching cmd/ingestor/db.go, cmd/ingestor/maintenance.go, internal/dbschema/**, **/migrations/**, **/*.sql, plus any Go file touching CREATE INDEX / ALTER TABLE / CREATE UNIQUE INDEX). For each hit it requires one of the three opt-outs above. Hard-fail (exit 1) — no warning-only mode.

Concurrency model

CoreScope runs one ingestor process per deployment (cmd/ingestor/, single binary, single *Store). There is no cluster mode, no leader election, no second writer. SQLite is opened with SetMaxOpenConns(1) and a 5s busy_timeout; all writes (live MQTT ingest + async migration goroutines + maintenance backfills) serialize through the one connection in a single process.

What this means for async migrations:

No cross-process race to worry about. Two ingestor instances running against the same DB is not a supported deployment shape.
Within a single process, concurrent RunAsyncMigration(name=X) callers race the initial SELECT status → UPDATE/INSERT step. The current implementation re-schedules fn on a pending/failed row so a duplicate caller may legitimately re-run it; once status is done all further calls short-circuit. See TestRunAsyncMigration_ConcurrentSameNameSerialized for the contract.
fn runs concurrently with live ingest writers. Because MaxOpenConns=1, a long CREATE INDEX will serialize behind / ahead of insert batches via SQLite's busy-timeout. This is acceptable for index builds (the boot path is unblocked, which was the whole point), but it means long migrations DO add latency to live writes. Document expected runtime in the reason= annotation and prefer batched/chunked fn implementations for multi-minute work (see BackfillPathJSONAsync for the canonical batched pattern with inter-batch time.Sleep).

Scale budgets

Per-migration target: <30s at current prod scale (Cascadia: ~2,600 nodes, ~80K observations; previous prod snapshot: ~1.9M observations).

Worked example (#1483, obs_observer_ts_idx_v1): composite index build on observations(observer_idx, timestamp). At ~1.9M rows the sync build pinned ingestor boot for several minutes → restart loop. Converted to async via RunAsyncMigration in OpenStore so boot returns immediately and the index materializes in the background; the existing _migrations short-circuit at the top of the migration block ensures DBs that already completed the sync v3.8.3 build do NOT re-run it through the goroutine path on subsequent boots.

If you cannot meet the <30s budget, document the expected upper bound and operator runbook expectation (e.g. "index build expected ~10 min on a 5M-row table; ingestor remains responsive; monitor via SELECT status, error FROM _async_migrations WHERE name = ...").

Why this exists

Pattern that keeps repeating:

Author writes CREATE INDEX foo ON observations(...) in a migration.
Local dev DB has ~100 rows. Migration returns in 1ms. CI is green.
Reviewer focuses on plan correctness, not scale.
Ship.
Prod boots, sqlite scans 1.9M rows, the ingestor sits at [migration] Adding index... for 8 minutes, healthcheck times out, container restarts, loops.
Operator pages. Hotfix. Apology.

The gate doesn't try to detect table size (undecidable from a diff). It enforces annotation discipline: every author who adds a migration must consciously decide which bucket it falls into and write that down. That is the cheapest possible intervention that breaks the cycle.

6.2 KiB Raw Blame History