meshcore-analyzer

mirror of https://github.com/Kpa-clawbot/meshcore-analyzer.git synced 2026-06-06 11:21:37 +00:00

Author	SHA1	Message	Date
efiten	52f131e2dc	fix(ingestor): add hourly WAL checkpoint to prevent unbounded WAL growth (#1435 ) Fixes #1434. ## Problem The ingestor's `Checkpoint()` (`PRAGMA wal_checkpoint(TRUNCATE)`) was only called on shutdown. SQLite's built-in auto-checkpoint runs in PASSIVE mode which cannot truncate the WAL while the server holds an active read connection. Result: the WAL grows at ~40–50 MB/hour and is never reset during a running instance. Observed on analyzer.on8ar.eu: 183.4 MB WAL after ~4h uptime. ## Changes `cmd/ingestor/main.go` - Add a periodic goroutine that calls `Checkpoint()` every hour, staggered 30s after startup - Hoist `walCheckpointTicker` to function scope so it is stopped cleanly at shutdown alongside all other tickers `cmd/ingestor/db.go` - Switch `Checkpoint()` from `Exec` to `QueryRow(...).Scan` to capture SQLite's 3-column result (`busy`, `log`, `checkpointed`) - Return the checkpointed frame count (callers that discard it are unaffected) - Log only when `walFrames > 0` — silent when WAL is already empty, avoiding log spam - Log `blocked=true/false` instead of raw `busy` integer to make it clear when the server's read lock is preventing full truncation ## Behaviour after fix Each hourly tick flushes all WAL frames not held by an active server reader. Worst-case WAL size is now bounded to roughly one hour of write traffic (~45 MB) instead of unbounded growth. If the server holds a read lock at checkpoint time, the log shows `blocked=true` and remaining frames are retried on the next tick. ## Test plan - [x] `go build ./...` (ingestor module) - [x] `go test ./...` passes - [x] Code review addressed (ticker stop on shutdown, log message clarity) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 15:01:54 -07:00
Eric Muehlstein	29432d4fe0	feat(ingestor): document and test ws:// / wss:// WebSocket MQTT broker support (#902 ) ## Summary CoreScope's ingestor already supports WebSocket MQTT connections today — `paho.mqtt.golang` v1.5.0 handles `ws://` and `wss://` natively via gorilla/websocket. However this support was undocumented, untested, and had a TLS gap for `wss://` connections. This PR closes those gaps without any breaking changes. ## Changes ### `cmd/ingestor/config.go` - Added godoc comment to `ResolvedSources()` explaining all four supported schemes and which ones require translation vs. pass-through - `ws://` and `wss://` explicitly documented as native paho schemes requiring no mapping ### `cmd/ingestor/main.go` - Extended TLS config to cover `wss://` in addition to `ssl://` - Before: `wss://` connections would use paho's default TLS (no explicit `tls.Config` set), which works for valid certs but doesn't apply the same predictable setup as `ssl://` - After: both `ssl://` and `wss://` get `tls.Config{}` (system CA pool), matching behavior; `rejectUnauthorized: false` still works for self-signed certs on both schemes ### `cmd/ingestor/config_test.go` Two new tests: - `TestResolvedSourcesSchemeMapping`: validates all six scheme variations (`mqtt://`, `mqtts://`, `tcp://`, `ssl://`, `ws://`, `wss://`) including paths like `wss://host/mqtt` - `TestLoadConfigWSSource`: full round-trip of a dual-source config (TCP + wss:// with username/password), verifies scheme unchanged through `LoadConfig` and `ResolvedSources` ### `config.example.json` - Added `wsmqtt` example entry showing `wss://` with username/password - Updated `_comment_mqttSources` to enumerate all supported schemes: `mqtt://`, `mqtts://`, `ws://`, `wss://` ## Motivation We run [meshcore-mqtt-broker](https://github.com/andrewjfreyer/meshcore-mqtt-broker) (a WebSocket MQTT bridge with JWT auth) alongside Mosquitto, and subscribe to both via `mqttSources`. The dual-source config works in production but nothing in the docs or example config made this discoverable for other operators. ## Testing ``` cd cmd/ingestor && go test ./... ok github.com/corescope/ingestor 1.568s ``` All existing tests pass. Two new tests added. ## No breaking changes - Existing configs: no change in behavior - `ws://` / `wss://` configs that were already working: same behavior + explicit TLS setup for `wss://`	2026-05-28 14:58:52 -07:00
Kpa-clawbot	2627bd053b	fix(#1465 ): observer.last_seen always uses ingest time, not envelope (#1466 ) ## Summary `observer.last_seen` (and `last_packet_at`) answer "when did the analyzer last hear from this observer" — fundamentally an ingest-time question. Previously both the status-message handler and the packet-message handler passed the MQTT envelope timestamp into `UpsertObserverAt` / `stmtUpdateObserverLastSeen`, which let buggy observer clocks drag `last_seen` hours into the past even when the timestamp parsed cleanly as RFC3339 (so #1464's naive-clamp didn't catch it). California observers on `analyzer.00id.net` consistently appeared 3-7h stale for this reason. ## Fix - `cmd/ingestor/main.go` status handler: pass `""` to `UpsertObserverAt` so it falls back to `time.Now()`. - `cmd/ingestor/main.go` packet-path observer upsert: same. - `cmd/ingestor/db.go` `InsertTransmission`'s `stmtUpdateObserverLastSeen.Exec` call: use `ingestNow` for both `last_seen` and `last_packet_at` (was `rxTime`). Per-packet rxTime semantics (`transmissions.first_seen`, `observations.timestamp`) are unchanged — those continue to use envelope time with the naive-clamp / 14h-future / 30d-past guards from #1463 / #1464. Per-hop SNR-vs-time analysis still works. ## TDD - Red: `test(#1465): observer.last_seen uses ingest time even with well-formed envelope (red)` - 3 new tests in `observer_lastseen_1465_test.go`: status-past, status-future, packet-path-past. - Status-past and packet-path-past assertions failed on master (envelope time stored verbatim). - Green: `fix(#1465): observer.last_seen always uses ingest time, not envelope` - All 3 new tests pass. - Pre-existing `TestInsertTransmissionUpdatesObserverLastSeen` and `TestLastPacketAtUpdatedOnPacketOnly` were encoding the buggy behavior; updated to assert ingest-time semantics. - Full `go test ./cmd/ingestor/...` green. ## Refs - Refs #1463 (root-cause investigation) - Refs #1464 (naive-clamp fix that handled malformed timestamps) - Closes #1465 --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-28 12:16:29 -07:00
Kpa-clawbot	7106e1921e	fix(#1463 ): clamp naive envelope timestamps symmetrically (#1464 ) Red commit: `fc6ed65f` (CI fails on `TestResolveRxTimeNaiveTimestampClamp`) Green commit: `80bf1285` ## Problem California observers (UTC−7) had `last_seen` perpetually pinned ~7h behind wall-clock and rendered "Stale" in the UI despite active MQTT status traffic. Root cause: `parseEnvelopeTime` parses zone-less ISO timestamps (python `datetime.now().isoformat()`) as UTC, leaving a residual offset equal to the observer's UTC offset. The existing soft-clamp at `resolveRxTime` only caught the future-skew (UTC+N) mirror case. ## Fix — Option B (symmetric clamp) - `parseEnvelopeTime` now returns a `(time.Time, naive bool, error)` tuple so callers can tell zone-aware from zone-less parses. - `resolveRxTime` applies a 15-minute symmetric tolerance window for `naive==true` values: anything further off than 15 min collapses to ingest time and emits a warning log. - Well-behaved observers (Z-suffixed or explicit `±HH:MM` offset) are completely untouched regardless of skew — legitimate buffered uploads remain accurate to the second. Chose option B over option A (reject naive outright) because some observers may be sending naive UTC strings — those would suddenly lose their own time. Symmetric clamp preserves the well-synced naive case (< 15 min off) and rescues every other zone. ## Tests - New `TestResolveRxTimeNaiveTimestampClamp` covers naive past, naive future, naive w/ microseconds, Z-suffixed past (verbatim), offset-suffixed (canonicalized to UTC), naive within tolerance (verbatim). - `TestParseEnvelopeTime` updated for new signature, asserts `naive` flag. - All existing rxtime tests preserved (factory date, 30-day floor, 14h future, plausible past). - Red commit ran first, failed on assertions, then green commit makes everything pass. ## Operator visibility `naive timestamp "..." off by 7h, using ingest time` now appears in the ingestor log so operators can identify upstream observer scripts that should switch to `datetime.now(timezone.utc).isoformat()`. Fixes #1463 --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-28 09:00:12 -07:00
Kpa-clawbot	f15d2efe81	fix(#1386 ): #1324 follow-up — test coverage + RWMutex + lock-hold-time + dead code + cadence (#1390 ) # #1324 follow-up — test coverage + RWMutex + lock-hold-time + dead code + cadence Addresses the post-merge audit findings in #1386 on PR #1324 (multi-byte capability persistence). Two independent audits (Kent Beck test-quality + Carmack perf) surfaced one top-level test-coverage gap and three perf concerns. This PR closes all of them; cadence cleanup is included. Red commit: `<RED_SHA>` (CI: `<RED_URL>`) ## What 1. Tests (`cmd/ingestor/multibyte_persist_test.go`): - `TestRunMultibyteCapPersist_RoundTrip` — end-to-end persist → close store → reopen → assert DB state survived. - `TestRunMultibyteCapPersist_MalformedSnapshot` — corrupt snapshot must log + no-op, not crash. - `TestRunMultibyteCapPersist_MissingSchemaColumns` — legacy DB without `multibyte_sup` cols must skip with explicit log, not panic / silently swallow. - `TestRunMultibyteCapPersist_PreservesConfirmedOnUnknown` — status=`unknown` MUST NOT clobber an existing `confirmed` row (mutation guard for the data-destruction check). 2. `cmd/server/store.go` - `cacheMu sync.Mutex` → `sync.RWMutex`. The per-node `GetMultibyteCapFor` read path in `/api/nodes` (`routes.go:1215`) uses `RLock` now; no longer serializes against itself or against analytics readers. - Build the multi-byte index map OUTSIDE `cacheMu`, then swap the pointer inside. Removes a 2400-iteration allocation hold from the analytics-cycle critical section. - Drop the dead `GetMultiByteCapMap` (zero callers confirmed by `rg`) and the stale `multibyteStatusToInt` tombstone comment. 3. `cmd/ingestor/multibyte_persist.go` - Replace the per-entry pair of `UPDATE nodes` + `UPDATE inactive_nodes` (50% guaranteed-miss) with a single dispatch-by-table-membership `UPDATE` per entry. ~50% fewer prepared-stmt round-trips. - Explicit `MalformedSnapshot` log line distinct from cold-start. - Defensive schema-presence check via `PRAGMA table_info` once at start; logs `[multibyte-persist] schema missing` and returns clean stats on legacy DBs. 4. `cmd/server/analytics_recomputer.go` / `config.example.json` — bump default snapshot cadence from 15s to 1m (the snapshot is a derived cache the ingestor only reads every 5 min; 4× less disk churn, no observable freshness loss). ## Why Direct quotes from the audit (#1386): > "No end-to-end persist→restart→load round-trip — the documented > value prop of the PR ('survives restart') has no single test > exercising the full path." (Kent Beck) > "`cacheMu` is `sync.Mutex` not `sync.RWMutex` + per-node read in > `handleNodes` — 2400 serialized lock acquisitions per `/api/nodes` > call, contended against every analytics-cache reader/writer. > The O(1) win is consumed by lock contention." (Carmack #1) > "Map construction held under shared `cacheMu` — every 15s > analytics cycle blocks every API cache read for the duration of a > 2400-entry map build. Build outside the lock, swap pointer > inside." (Carmack #2) > "`UPDATE nodes` + `UPDATE inactive_nodes` per entry … 4800 > prepared-stmt round-trips, 2400 guaranteed-empty." (Carmack #3) > "Server writes 20 snapshots for every one the ingestor reads. > Cadence mismatch — server could publish every 1 min and lose > nothing." (Carmack §2) ## TDD Red commit adds the four tests above. Two of the four (`MalformedSnapshot`, `MissingSchemaColumns`) fail on assertions against the pre-fix `multibyte_persist.go`; the other two (`RoundTrip`, `PreservesConfirmedOnUnknown`) are regression coverage of behaviour the original implementation already honoured but never exercised — they exist to guard future mutation (the audit's mutation-suggestion lens). Green commit lands the implementation. ## Bench `go test -bench BenchmarkGetMultibyteCapFor -benchmem -count=10` (local, idle laptop, n=2400-entry index, 8 reader goroutines vs. one analytics writer): \| variant \| ns/op \| allocs/op \| \|--------------------\|------:\|----------:\| \| `sync.Mutex` (pre) \| n/a — see note \| — \| \| `sync.RWMutex` \| n/a — see note \| — \| Note: did not produce a concurrent benchmark in this PR (would require non-trivial test scaffolding around the cache lifecycle). The win is structural — `RLock` allows the ~2400 per-`/api/nodes` reads to proceed in parallel rather than serializing on the same mutex held by every analytics writer. Documenting honestly per AGENTS.md "perf claims require proof": full microbench deferred to a follow-up. ## Manual verification (staging) - New tests: `go test ./... -count=1 -timeout 300s` in `cmd/ingestor` and `cmd/server` — green. - All multibyte-area tests (`#1366`, `#1368`, `#1372` regression suites in `multibyte_capability_test.go`, `multibyte_enrich_test.go`, `multibyte_region_filter_test.go`): green. - Preflight: `bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master` — exit 0. Fixes #1386 --------- Co-authored-by: claw <claw@openclaw.local>	2026-05-25 23:29:35 -07:00
Joel Claw	95d7916530	fix(channels): normalize known channel display names (public → Public) (#777 ) Normalizes well-known channel display names (currently only `public` → `Public`) so existing deployments with pre-#761 lowercase config keys show the canonical firmware-default name `Public` in the UI. Behavior: - `knownChannelCasing` lookup (`decoder.go`) — single-entry map, easy to extend. - `normalizeChannelName()` applied at config load (`loadChannelKeys`) AND at decode time (defense in depth). - One-shot SQLite migration `channel_hash_casing_v1` backfills `channel_hash='public'` → `'Public'` on `payload_type=5` rows so channel-grouping queries don't split across the upgrade boundary. - Hardcoded list intentionally tiny (1 entry); custom/user channels left untouched. Safety: - Channel-hash derivation (`SHA256(channelName)[:16]` for `#`-prefixed `HashChannels`) is unchanged — normalization only renames map keys for explicit `ChannelKeys` entries (which don't feed `deriveHashtagChannelKey`). - PSK lookup is by hash byte, not by name — mesh interop preserved. - Migration is gated by `_migrations.name='channel_hash_casing_v1'`, idempotent. Tests (`cmd/ingestor/normalize_channel_test.go`): - `TestNormalizeChannelName` covers known + hashtag + custom + empty. - `TestLoadChannelKeys_NormalizesKnownDisplayNames` — verifies `public` → `Public` at load. - `TestLoadChannelKeys_LeavesCustomNamesUntouched` — custom names not auto-capitalized. - `TestLoadChannelKeys_DuplicateCasingLogsWarning` — config containing both casings resolves deterministically (canonical wins). Mutation test confirmed: reverting load-time normalize → `TestLoadChannelKeys_NormalizesKnownDisplayNames` and `_DuplicateCasingLogsWarning` both fail on assertions. Related: #761	2026-05-25 23:05:07 -07:00
efiten	0b35c7eef3	feat(server): persist multi-byte capability across restart + O(1) per-key lookup (#903 ) (#1324 ) ## Summary Follows the reconciliation recommendation in #916 — extracts only the NET-NEW persistence layer from that PR (which is now superseded by #1002 for the overlay UI) into a focused 6-file change against current master. What this adds: - `multibyte_sup_v1` migration: `multibyte_sup INTEGER NOT NULL DEFAULT 0` + `multibyte_evidence TEXT` on `nodes`/`inactive_nodes` so capability survives restart - `hasMultibyteSupCols` schema detection gates the persist/load paths - `loadMultibyteCapFromDB()`: pre-populates `mbCapSnapshot`/`mbCapIndex` at startup — cold starts serve last-known capability without waiting for the first ~15s analytics cycle - `maybePersistMultibyteCapability()` + `persistMultibyteCapability()`: after each analytics cycle; TryLock-gated (concurrent cycles coalesce); skips `sup==0` entries (data-destruction guard) - `GetMultibyteCapFor(pk)`: O(1) map lookup; both `handleNodes` and node-detail call sites updated from the O(N)-alloc `GetMultiByteCapMap()` What this explicitly does NOT change: - API field names (`multi_byte_status`, `multi_byte_evidence`, `multi_byte_max_hash_size`) - `EnrichNodeWithMultiByte` — unchanged - `GetMultiByteCapMap` — still present for any external callers - `public/map.js`, `public/live.css`, `Dockerfile`, `docs/` — zero frontend churn ## Test plan - [x] `TestMultibyteCapPersistRoundTrip` — confirmed values survive persist → fresh-store load - [x] `TestMultibyteCapPersistSkipsUnknown` — data-destruction guard: `sup==0` entry does not overwrite DB-confirmed value - [x] `TestMultibyteCapMaybePersistCoalesces` — TryLock coalesces 10 concurrent callers without deadlock - [x] `TestMultibyteCapGetMultibyteCapForO1` — O(1) index returns correct entry / false for unknown pubkey - [x] `TestMultibyteCapLoadFromDB` — only `sup>0` rows loaded; `sup==0` row excluded - [x] `TestSchemaMultibyteSupColumns` — migration adds columns to both tables; idempotent on second `OpenStore` - [x] All existing `TestMultiByteCapability_*` tests pass unchanged - [x] Full ingestor test suite: `ok` in 27s - [x] `go build ./cmd/server/ && go build ./cmd/ingestor/` clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: openclaw-bot <bot@openclaw>	2026-05-25 22:35:35 -07:00
Kpa-clawbot	dc6c79cff8	fix(mqtt): watchdog forces paho reconnect on stall — recovers from half-open TCP (closes #1335 ) (#1336 ) RED `f06887` — GREEN `8f53c1`. CI: (will populate on PR open) `Fixes #1335` ## Problem PR #1216 added per-source stall detection (`LivenessStalled`) but only logged. Staging's `lincomatic` source has been silently losing ~14k pkts/hr behind a half-open TCP socket the Azure NAT abandons: paho reports `IsConnected==true`, no messages arrive for 1h+, container restart is the only known recovery. Prod (MikroTik networking) doesn't see it. ## Fix Make the watchdog actually recover. - `SourceLivenessState.ForceReconnectFn` — per-source closure wired in `main.go` next to `IsConnectedFn`, wraps `client.Disconnect(250) + client.Connect()`. - `processLivenessTransition` — on the `LivenessStalled` edge AND on every heartbeat re-emit while still Stalled, invoke `maybeForceReconnect`. `LivenessNeverReceived` (cold-start ACL deny / wrong hash) is deliberately not force-reconnected — a new TCP socket won't fix an ACL deny and would just churn the broker. - `maybeForceReconnect` — throttled at `forceReconnectThrottle = 60s` per source so a stall→reconnect→re-stall loop self-recovers without hammering the broker. The Disconnect+Connect runs in a goroutine so a single slow source can't stall the watchdog tick. - `buildMQTTOpts` — explicit `SetKeepAlive(30 * time.Second)`. paho's default happens to be 30s, but the #1335 RCA called this out — making it explicit so it can't drift and so operators reading the code know it's intentional. - Telemetry — `WATCHDOG forcing reconnect` (intent), `WATCHDOG reconnect attempt issued` (post-goroutine), `WATCHDOG suppressing forced reconnect` (throttle window). ## TDD - RED `f06887` — `mqtt_watchdog_force_reconnect_test.go`. Stub field + constant added so the file compiles; assertions fail because `processLivenessTransition` never invokes `ForceReconnectFn`. Reverting just the `s.ForceReconnectFn()` call line from GREEN re-fails the same assertion (mutation verified). - GREEN `8f53c1` — wiring + throttle + keepalive. ## Scope discipline Additive only. No regression to currently-flowing sources: `LivenessOK`, `LivenessRecovered`, `LivenessDisconnected`, `LivenessHeartbeat`, and `LivenessNeverReceived` transitions are unchanged. Throttle bound = ≤1 reconnect/min/source = ≤60/hr worst-case across all sources, well within any broker rate limit. Preflight: clean (all gates pass). --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-25 22:31:56 -07:00
Kpa-clawbot	0f7cce3a5f	fix(#1370 ): revert ingestor envelope-timestamp path — server ingest time for packet/observation storage (counters #1233 ) (#1372 ) ## Summary Reverts the part of PR #1233 (commit `498fbc03`) that routed the MQTT envelope's `timestamp` field into `PacketData.Timestamp` for `transmissions.first_seen` and `observations.timestamp`. Packet ordering is restored to server ingest time — the client clock is untrusted. `UpsertObserverAt` + `MAX(MIN(existing, ingestNow), rxTime)` for observer/node `last_seen` (PR #1233's other half) is preserved unchanged. `parseEnvelopeTime` / `resolveRxTime` helpers are preserved — they still feed the observer.last_seen path. ## Diagnosis — Voodoo3 tx 304114 on staging Staging `tx_id = 304114` in channel `#test` has 5 observations: \| # \| observer \| reported timestamp \| comment \| \|---\|-----------\|--------------------\|---------\| \| 1 \| Voodoo3 \| 18:42 \| broken client RTC — ingested first, locks `first_seen` \| \| 2 \| Voodoo3 \| 18:42 \| broken client RTC \| \| 3 \| Voodoo3 \| 18:42 \| broken client RTC \| \| 4 \| Voodoo3 \| 18:42 \| broken client RTC \| \| 5 \| other obs \| 01:42 \| genuine receive time \| 4 of 5 observations carry stale 18:42 timestamps from Voodoo3's own broken clock. Because Voodoo3 ingested first, PR #1233's code wrote `transmissions.first_seen = 18:42` (envelope value). Downstream aggregators that compute `MAX(first_seen)` per channel saw 18:42 as the latest activity, and `/api/channels` for `#test` displayed `lastActivity` ~7h+ in the past plus a stale heartbeat in the row preview — hiding the genuinely-newest message (Voodoo3's `tst hmdpt` at 01:42). ## Why PR #1233's premise fails PR #1233 assumed: > Uploaders stamp `timestamp` when the radio receives the frame and > freeze it; the MQTT message is published late, but the timestamp > field is not re-stamped at publish. A buffered packet uploaded > hours late still carries its true receive time. That holds ONLY when the uploader's wall clock is correct. Observers in the field (Voodoo3 here, surely others) have broken local clocks. Their envelope timestamps are not a true receive time — they're a broken-clock receive time, which is just garbage with extra steps. The server clock is the only one we control, so packet ordering must use it. ## Fix ### `cmd/ingestor/db.go` - `BuildPacketData`: `PacketData.Timestamp = time.Now().UTC().Format(time.RFC3339)`, NOT `msg.Timestamp`. Docstring updated to cite #1370 and explain why `msg.Timestamp` is no longer read here. ### `cmd/ingestor/main.go` - Channel-companion path: `Timestamp: ingestNow` (was `rxTime`). - DM-companion path: `Timestamp: ingestNow` (was `rxTime`). - Local `rxTime := resolveRxTime(msg, tag)` removed from both paths (no remaining consumers in those scopes). ### Preserved (NOT touched) - `resolveRxTime`, `parseEnvelopeTime` — still used by `handleMessage` to populate `mqttMsg.Timestamp` and to call `UpsertObserverAt`, which feeds `observer.last_seen` and `observer.last_packet_at`. - All three `MAX(MIN(existing, ingestNow), rxTime)` guards (#1233 observer.last_seen, observer.last_packet_at, node.last_seen). - `MQTTPacketMessage.Timestamp` struct field. ## Tests \| File \| Asserts \| \|------\|---------\| \| `cmd/ingestor/ingest_time_regression_1370_test.go` (3 cases) \| Raw-packet, channel-companion, and DM-companion `handleMessage` paths. Feed envelope `timestamp = T_now - 7h`; assert stored `transmissions.first_seen` (RFC3339) and `observations.timestamp` (epoch) are server wall clock (±5s). Each case fails on master under PR #1233's premise. \| ### Adjusted test - `cmd/ingestor/db_test.go::TestBuildPacketData` — PR #1233 had asserted `pkt.Timestamp == "2026-05-16T10:00:00Z"` (the envelope value propagating). Now asserts the opposite: `pkt.Timestamp` is non-empty AND is NOT the envelope value. Comment cites #1370 and why the expectation flipped. ### Verified still-green - `cmd/ingestor/rxtime_test.go` (`TestParseEnvelopeTime`, `TestResolveRxTime`) — helpers untouched, still cover envelope parsing for the observer.last_seen path. - `cmd/server/channels_message_order_1366_test.go` (#1366). - `cmd/server/db_channel_messages_perf_test.go` (#1368 perf budget). ## Commits - `a9b7efc3` — RED: 3 `handleMessage` assertion-fail tests + test name collision check. - `5a0891f0` — GREEN: revert envelope→PacketData.Timestamp plumbing in `cmd/ingestor/{db,main}.go` + flip `TestBuildPacketData`. Fixes #1370 --------- Co-authored-by: corescope-bot <bot@corescope.dev>	2026-05-25 19:56:49 -07:00
Kpa-clawbot	eeddf46bc9	fix(ingestor): neighbor-builder delta scan + watermark — recovers 97% packet loss from #1289 (fixes #1339 ) (#1341 ) ## Summary PR #1289 moved neighbor-graph construction into the ingestor with a 60s ticker. `buildAndPersistNeighborEdges` then issued an unbounded `SELECT … FROM observations o JOIN transmissions t …` every tick. On staging (3.7M observations) one tick took ~2 minutes; with `max_open_conns=1`, the SQLite single-writer was held continuously and MQTT ingest collapsed (~6,500 tx/day → ~180 tx/day, 97% loss). ## Fix Watermark-bounded delta scan. Each call derives the watermark from `MAX(neighbor_edges.last_seen)` and restricts the SELECT to `WHERE o.timestamp > ? ORDER BY o.timestamp LIMIT 50000`. `neighbor_edges` itself is the persistence — no new metadata table, no in-memory state, restarts resume cleanly from whatever the table reflects. - Empty edges table → watermark 0 → full warm-up scan (preserves #1289's synchronous warm-up intent). - Warm-up loops the builder until a call returns fewer than the batch cap, so the first server snapshot load sees a fully-populated table even on fresh DBs. - 50k batch cap stops any single tick from monopolising the writer; a backlog drains over successive ticks. - Per-tick wallclock is logged (`tick: N edges in DUR`); a tick >5s is logged loudly as a possible regression of #1339. Broader instrumentation is tracked in #1340. - Output schema unchanged — server's `neighbor_recomputer.go` is unaffected. ## Trade-off An anomalously-old observation that arrives after its timestamp has been crossed by the watermark will be skipped. Acceptable for an approximate neighbor graph; a periodic full-rebuild can land later if needed. ## TDD - RED (`d88e2522`): `TestNeighborEdgesBuilderDeltaScan` seeds 100k observations, asserts an empty-delta tick is a no-op (<1s), and a 100-row delta is upserted in <500ms with no rescan of baseline rows. Baseline builder fails the empty-delta assertion (sees all 200k baseline edges). - GREEN (`cf6fbb4e`): watermark + LIMIT — all assertions pass. - Mutation: revert the `WHERE o.timestamp > ?` clause → the test hangs to lock-contention timeout, confirming the WHERE actually gates the behavior. ## Benchmark (synthetic, 100k observations, local sqlite) \| \| Scan duration \| \|---\|---\| \| Baseline builder, full scan every tick \| ~40s \| \| Patched builder, empty-delta tick \| <50ms \| \| Patched builder, 100-row delta \| <50ms \| Staging projection: 2–3 min ticks → <1s ticks; SQLite writer freed for MQTT ingest. Fixes #1339 --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-23 20:54:16 -07:00
Marcel Verdult	498fbc0321	fix: ingestor uses ingest-time now() instead of observer receive time (#1233 ) ## Problem The ingestor stamps every stored packet with its own ingest-time `time.Now()` (`BuildPacketData` in `db.go`; channel/DM paths in `main.go`), discarding the observer receive time the uploader already puts in the MQTT envelope's `timestamp` field. `MQTTPacketMessage` had no `Timestamp` field and `handleMessage` parsed every envelope field except that one. Observers that buffer packets offline and upload hours later get every buffered packet displayed at upload time, not receive time — a 5-hour deferred upload shows packets 5 hours late. Retained messages and broker backlog hit the same skew. ## Why the envelope timestamp is trustworthy Uploaders stamp `timestamp` when the radio receives the frame and freeze it; the MQTT message is published late, but the `timestamp` field is not re-stamped at publish. A buffered packet uploaded hours late still carries its true receive time. ## Fix New `resolveRxTime` helper reads `msg["timestamp"]` and falls back to `time.Now()` only when it is missing, unparseable, or implausibly in the future. Applied to all three ingest paths (raw packet, channel, DM). No wire-format change — the field already exists. Channel/DM dedup hashes intentionally stay on ingest time, since those bridge messages carry no real packet hash and need ingest-unique input. ## Observer/node last_seen correction Packet timestamps must reflect receive time, but observer/node `last_seen` must not. `InsertTransmission` fed `data.Timestamp` (now rxTime) into `observers.last_seen` and `UpsertNode`'s `last_seen`, so a buffered upload could drag both fields backwards, and retained-message replay on MQTT reconnect could flash long-offline observers as Online. - `UpsertObserverAt` takes an explicit `lastSeen`; the status-packet and BLE companion handlers pass the resolved rxTime. `UpsertObserver` keeps its wall-clock behaviour for other callers. - All three `last_seen` writes are guarded with `MAX(MIN(existing, ingestNow), rxTime)`: `last_seen` never moves backwards from a stale retained message, and never locks in a future value. ## Naive UTC+N timestamps `resolveRxTime` rejects a timestamp only when it is >14h ahead (UTC+14 is the maximum standard offset — anything further is a genuine clock error). A timestamp that is merely in the future is soft-clamped to ingest time: a future rxTime means a live packet from a UTC+N observer whose naive local clock parses as-if UTC, not a buffered packet, so ingest time is correct and no future timestamp reaches the DB. For buffered packets from naive-clock uploaders a bounded residual offset remains (equal to the observer's UTC offset); uploaders emitting zone-aware ISO8601 everywhere would be the full cure but is a separate format change. ## Test `cmd/ingestor/rxtime_test.go` covers `parseEnvelopeTime` (zone-aware, naive, microseconds, garbage, empty) and `resolveRxTime` (plausible past used verbatim, missing/garbage/future → ingest-time fallback). The existing `TestBuildPacketData` is updated to supply an envelope timestamp and assert it propagates, since `BuildPacketData` no longer self-stamps.	2026-05-23 11:22:51 -07:00
Kpa-clawbot	d9ba9937a6	fix(dbschema): canonical source for optional column migrations — fixes startup race (closes #1321 ) (#1322 ) Red commit `2a8102b9` (failing test) → green commit `bb957c9f`. CI: https://github.com/Kpa-clawbot/CoreScope/actions/workflows/ci.yml?query=branch%3Afix%2Fissue-1321 Fixes #1321. ## Why On staging `/api/scope-stats` 500'd with `scope_name column not present` despite the ingestor adding the column ~0.5s after server startup. `cmd/server/db.go detectSchema()` runs in `OpenDB` and caches `hasScopeName`/`hasDefaultScope`/`hasObsRawHex` booleans. With supervisord launching server + ingestor simultaneously, the server's PRAGMA can fire BEFORE the ingestor's `ALTER TABLE` completes — and the boolean stays false until the server restarts. Same race class as #1283; #1289 moved server-side ensures to `dbschema` but the optional columns the ingestor still owned were left out. ## Fix — option (c) from the issue Made `internal/dbschema/dbschema.go` the single source of truth for the optional columns the server detects. Migrations moved from `cmd/ingestor/db.go applySchema` into `dbschema.Apply`: - `transmissions.scope_name` + `idx_tx_scope_name` partial index - `nodes.default_scope` - `inactive_nodes.default_scope` - `observations.raw_hex` `AssertReady` now asserts every one of those columns. The server cannot start with stale-false booleans because `AssertReady` will fatal first if the columns are missing. The ingestor's old gated blocks are replaced with pointer comments so anyone hunting for them lands in `dbschema.go`. The `_migrations` marker rows are preserved (`INSERT OR IGNORE`) to keep legacy DBs idempotent. Documented invariant in the package doc: any new optional column the server PRAGMA-detects belongs in `internal/dbschema/dbschema.go`, NOT in `cmd/ingestor/db.go applySchema`. ## Tests Added `internal/dbschema/dbschema_test.go` (RED in `2a8102b9`): - `TestApplyAddsOptionalColumns_CanonicalSource` — post-`Apply`, all four columns must exist. - `TestAssertReady_RequiresOptionalColumns` — `AssertReady` must refuse a DB missing them AND pass after full `Apply`. `cmd/ingestor` and `cmd/server` full suites green. --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-23 08:33:21 -07:00
efiten	2329639f45	feat: scoped/unscoped transport-route statistics (#899 ) (#915 ) @ ## What this PR does Implements region-scoped transport-route packet tracking with two sub-features: ### Feature 1 — Scope statistics (`scope_name`) - At ingest, transport-route packets (route_type 0/3) with Code1 != `0000` are HMAC-matched against configured `hashRegions` keys (mirroring the `hashChannels` pattern). Matched region name (or `""` for unknown) stored in new `transmissions.scope_name` column via migration `scope_name_v1`. - New `GET /api/scope-stats?window=` endpoint (1h/24h/7d, 30s server-side TTL) returning transport totals, scoped/unscoped counts, per-region breakdown, and time-series. - New Scopes tab in Analytics with summary cards, per-region table, and two-line SVG chart. Auto-refreshes every 60s. ### Feature 2 — Node default scope (`default_scope`) - Per-node `default_scope` column on `nodes`/`inactive_nodes` (migration `nodes_default_scope_v1`) tracks the most recently matched region for each node, derived from transport-scoped ADVERT packets. - `GET /api/nodes` response includes `default_scope` field when column is present. - Node detail panel displays the default scope badge. - Async startup backfill (`BackfillDefaultScopeAsync`) populates the column for nodes with pre-existing ADVERT data. ### Config Add `hashRegions` to `config.json` (see `config.example.json`). One entry per region name (with or without leading `#`). @ --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Kpa-clawbot <kpaclawbot@outlook.com> Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-21 14:00:06 -07:00
efiten	51f823bf7e	feat: one-click prune nodes outside geofilter (#669 M4) (#738 ) ## Summary - Adds `POST /api/admin/prune-geo-filter` endpoint — dry-run by default, `?confirm=true` to permanently delete nodes outside the current geofilter polygon + buffer. Requires `X-API-Key` header. - Adds Prune nodes section inside the GeoFilter customizer tab (write-access only, same `writeEnabled` gate as PUT). Preview lists affected nodes; Confirm delete removes them. - Adds `GetNodesForGeoPrune` and `DeleteNodesByPubkeys` DB helpers. - Updates `docs/user-guide/geofilter.md` — documents the UI button as primary workflow, CLI script as alternative. > Depends on M3 (`feat/geofilter-m3-customizer`, PR #736). Merge M3 first. ## Test plan - [x] `cd cmd/server && go test ./...` — all pass - [x] Customizer GeoFilter tab without `apiKey` — Prune section not visible - [x] With `apiKey` + polygon active — Prune section visible - [x] Preview returns list of nodes outside polygon (no deletions) - [x] Confirm delete removes nodes, list clears - [x] `POST /api/admin/prune-geo-filter` without `X-API-Key` → 401 - [x] `POST /api/admin/prune-geo-filter` with no polygon configured → 400 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 03:19:31 +00:00
Kpa-clawbot	9383201c07	refactor(db): finish #1283 — Option 4: ingestor owns neighbor-graph + schema migrations; server is read-only (fixes #1287 ) (#1289 ) Red commit: https://github.com/Kpa-clawbot/CoreScope/commit/eae179b99b5fd34924547632aa8f8025c405aa53 (CI: pending — opens with this PR) Finishes #1283. RED test `TestServerSourceHasNoCachedRWCalls` goes from failing (13 writer call-sites) to GREEN (zero). Per #1287 Option 4 (https://github.com/Kpa-clawbot/CoreScope/issues/1287#issuecomment-4485099992): ingestor owns the neighbor graph build + persist; server reads the snapshot. Category A — Schema migrations → new `internal/dbschema` package. `dbschema.Apply(rw)` runs in `cmd/ingestor` startup (in `OpenStore`). `dbschema.AssertReady(ro)` runs in `cmd/server/main.go` and FATAL-LOG-EXITS if any expected column/index/table is missing — the operator must restart the ingestor first. Covers indexes, `neighbor_edges`, `observations.resolved_path`, `observers.{inactive,last_packet_at,iata}`, `(inactive_)nodes.foreign_advert`, `transmissions.from_pubkey`. Category B — Backfill → ingestor. `BackfillFromPubkey` and observer-blacklist soft-delete moved to `cmd/ingestor/maintenance.go`. Server keeps an inert `fromPubkeyBackfillSnapshot` stub for `/api/healthz` API compatibility. Category C — Neighbor-graph persistence (Option 4) → ingestor writes, server reads. - Ingestor (`cmd/ingestor/neighbor_builder.go`): every 60s scans `observations + transmissions`, extracts edges (originator↔first-hop for ADVERTs; observer↔last-hop for all), resolves hop prefixes via a node-table prefix index, upserts into `neighbor_edges`. - Server (`cmd/server/neighbor_recomputer.go`): every 60s re-reads `neighbor_edges` and atomic-swaps the resulting `NeighborGraph` into `s.graph`. Initial load is synchronous on startup. All server-side incremental edge writers (the two `asyncPersistResolvedPathsAndEdges` paths in `cmd/server/store.go`) are gone. - Neighbor-edge daily prune (`PruneNeighborEdges`) moved to ingestor. Why Option 4: clean read/write separation, no startup CPU spike (server loads existing snapshot instead of rebuilding from history), no IPC/delta-protocol churn. Staleness budget ~60s — same model as the analytics recomputers in #1240 / #1248 / #672 axis 2. Recomputer interval default for neighbor graph: 60s (`NeighborGraphRecomputerDefaultInterval`, `NeighborEdgesBuilderInterval`). Invariants added: - `TestServerSourceHasNoCachedRWCalls` (RED commit `eae179b9`): grep enforces zero `cachedRW(`, `mode=rw`, or `sql.Open(_journal_mode=WAL…)` in non-test `cmd/server/` sources. - `TestServerStartupRequiresMigratedSchema`: server refuses to start against an unmigrated DB. - `TestNeighborGraphRecomputerLoadsSnapshot`: post-write snapshot is picked up on the next refresh. - `TestNeighborEdgesBuilderUpsertsFromObservations`: end-to-end pipeline writes the expected edge. `grep cachedRW cmd/server/*.go \| grep -v _test.go` → 0 matches. Fixes #1287. --------- Co-authored-by: MeshCore Bot <bot@meshcore.local> Co-authored-by: Kpa-clawbot <Kpa-clawbot@users.noreply.github.com> Co-authored-by: corescope-bot <bot@corescope.local>	2026-05-19 23:53:41 -07:00
Kpa-clawbot	749fdc114f	feat(decoder+ui): close remaining P2 items from #1279 — payloadTypeNames, legend, TransportCodes, Feat1/2, RAW_CUSTOM, sensor docs (#1291 ) RED commit: `dc4c0800` — CI: https://github.com/Kpa-clawbot/CoreScope/actions?query=branch%3Afix%2Fissue-1279-p2 Closes the remaining six 🟢 P2 items in umbrella #1279 (PR #1280 shipped P0+P1, PR #1276 shipped ACK/RESPONSE/PATH legend rows). ### Item-by-item \| # \| Item \| Where \| Test \| \|---\|---\|---\|---\| \| 1 \| `payloadTypeNames` parity \| `cmd/server/store.go` \| `cmd/server/issue1279_p2_test.go::TestPayloadTypeNamesAll13` \| \| 2 \| Legend rows: Anon Req / Grp Data / Multipart / Control / Raw Custom \| `public/live.js` \| `test-issue-1279-legend-p2-e2e.js` (Playwright) \| \| 3 \| TransportCodes detail-row + `code1=` / `code2=` filter grammar \| `public/packets.js`, `public/packet-filter.js` \| `test-issue-1279-p2-code-filter.js` (6 cases) \| \| 4 \| Multibyte capability badge on node detail/list rows \| `public/nodes.js::renderNodeBadges` \| `n.hash_size >= 2` (observable Feat1/Feat2 proxy; firmware `AdvertDataHelpers.h:14-16`) \| \| 5 \| RAW_CUSTOM (0x0F) `{rawLength, firstByteTag}` decode + detail-row \| `cmd/server/decoder.go`, `cmd/ingestor/decoder.go`, `public/packets.js` \| `TestDecodeRawCustomExposesLengthAndTag` × 2 + updated `TestDecodePayloadRAWCustom` \| \| 6 \| Sensor advert telemetry firmware-derivation comments \| `cmd/ingestor/decoder.go:363-380` \| pure comments — exempt per AGENTS \| ### Firmware refs cited inline - `firmware/src/Packet.h:19-32` — PAYLOAD_TYPE_* constants - `firmware/src/Packet.h:46` — TransportCodes wire layout - `firmware/src/Mesh.cpp:577` — `createRawData` - `firmware/src/helpers/SensorMesh.{h,cpp}` — sensor advert telemetry derivation - `firmware/src/helpers/AdvertDataHelpers.h:14-16` — Feat1/Feat2 ### TDD Red `dc4c0800` proves the assertions gate behavior: - `payloadTypeNames` had only 12 entries (no 0x0F). - RAW_CUSTOM decoded as `UNKNOWN` with no envelope fields. Green `<HEAD>` makes both green; per-item tests included. ### Cross-stack note Cross-stack: justified — items 1/5 add decoder output fields; items 2/3/4/5 surface those fields in the UI in the same PR per #1279 acceptance. ### Out of scope Item 4 surfaces the observable multibyte capability via the persisted `hash_size` (Feat1/Feat2 wire bits are only on transient adverts and not stored per-node today); persisting raw Feat1/Feat2 per-node is left for a follow-up. Fixes #1279 --------- Co-authored-by: bot <bot@corescope>	2026-05-19 08:08:28 -07:00
Kpa-clawbot	1da2034341	refactor(db): move all writes from server to ingestor; server truly read-only (fixes #1283 ) (#1286 ) Red commit: `f6290b63` — CI run will appear at https://github.com/Kpa-clawbot/CoreScope/actions Fixes #1283. ## What Moves all four DB write operations out of `cmd/server/` into `cmd/ingestor/`, making the server truly read-only and eliminating the SQLITE_BUSY VACUUM bug at its root: the server can no longer race the ingestor for the write lock because the server has no write path. ## The four operations \| # \| Was in \| Now in \| \|---\|--------\|--------\| \| 1 \| `cmd/server/vacuum.go` (`checkAutoVacuum`, full VACUUM + `auto_vacuum=INCREMENTAL` migration) \| `cmd/ingestor/db.go` `Store.CheckAutoVacuum` (already existed; ingestor runs it at startup before the MQTT subscriber starts → no contention) \| \| 2 \| `cmd/server/db.go` `PruneOldPackets` (`DELETE FROM transmissions`) \| `cmd/ingestor/maintenance.go` `Store.PruneOldPackets` (new) + 24h ticker in `cmd/ingestor/main.go` \| \| 3 \| `cmd/server/db.go` `PruneOldMetrics` (`DELETE FROM observer_metrics`) \| `cmd/ingestor/db.go` `Store.PruneOldMetrics` (already existed) \| \| 4 \| `cmd/server/db.go` `RemoveStaleObservers` (`UPDATE observers SET inactive=1`) \| `cmd/ingestor/db.go` `Store.RemoveStaleObservers` (already existed) \| ## HTTP surface - Removed: `POST /api/admin/prune` (`handleAdminPrune`, route, openapi entry). Operators trigger an ad-hoc prune by restarting the ingestor. - Kept: `GET /api/backup` — uses `VACUUM INTO` which writes to a separate file, not the live DB; read-only-safe. ## Tests - `cmd/server/readonly_invariant_test.go` (RED gate) — reflect-asserts `PruneOldPackets`/`PruneOldMetrics`/`RemoveStaleObservers` are NOT methods on the server's `DB`. Fails on master, passes after this PR. - `cmd/ingestor/issue1283_test.go` — exercises `Store.PruneOldPackets` and the auto_vacuum=NONE → INCREMENTAL migration through `Store.CheckAutoVacuum` with `vacuumOnStartup=true`. ## Why the bug is gone The SQLITE_BUSY VACUUM failure happened because supervisord launched both ingestor + server in one container; the ingestor took the write lock for INSERTs and the server's `checkAutoVacuum` then failed to acquire it within `busy_timeout=5000`. After this PR, only the ingestor ever opens a writable connection, and it runs `CheckAutoVacuum` before* spawning the MQTT subscriber → no contention possible. ## Scope notes - `cachedRW()` still has three pre-existing callers in `cmd/server/` (`neighbor_persist.go`, `ensure_indexes.go`, `from_pubkey_migration.go`). These pre-date #1283 and are not in the issue's four-operation list. Leaving them for follow-up keeps this PR honest about scope; AGENTS.md documents the invariant so new write paths can't sneak in. - PII preflight reports false positives on the Go method name `requireAPIKey` in `routes.go` diff context — no real PII. - Server-side neighbor-edge prune (`PruneNeighborEdges`) intentionally left in place — out of scope of #1283. --------- Co-authored-by: MeshCore Bot <bot@meshcore.local>	2026-05-18 23:52:27 -07:00
Kpa-clawbot	e6c30e1a7e	feat(decoder): GRP_DATA + MULTIPART + advertRole fix + CONTROL flags (#1279 P0+P1) (#1280 ) Addresses the four P0+P1 firmware reconciliation gaps from the umbrella audit (issue #1279). RED commit: `0a4c084e` (asserts on stub returns; all 13 assertions fail). GREEN commit: `13867681`. ## What's in this PR ### P0 — silently dropped data - #1 GRP_DATA (0x06) decoder. Outer envelope is the same shape as GRP_TXT (`channel_hash(1)+MAC(2)+ciphertext`) per `firmware/src/helpers/BaseChatMesh.cpp:476,500`. Factored `decryptChannelBlock(...)` helper used by both 5 and 6. When a channel key matches, the inner is parsed per `firmware/src/helpers/BaseChatMesh.cpp:382-385` as `data_type(uint16 LE) + data_len(1) + blob(data_len)`. Surfaces `{channelHash, MAC, dataType, dataLen, decryptedBlob}` on decrypt or `{channelHash, MAC, encryptedData}` otherwise. Server-side decoder surfaces envelope only (no key store). - #2 MULTIPART (0x0A) decoder. Per `firmware/src/Mesh.cpp:289`, byte0 = `(remaining<<4) \| inner_type`. When `inner_type == PAYLOAD_TYPE_ACK (0x03)`, next 4 bytes are the LE ack_crc per `firmware/src/Mesh.cpp:292-307`. Surfaces `{remaining, innerType, innerTypeName, innerAckCrc \| innerPayload}`. ### P1 — mis-classified / opaque - #3 `advertRole()` raw-type fix. Per `firmware/src/helpers/AdvertDataHelpers.h:7-12`, ADV_TYPE_NONE = 0 and 5-15 are FUTURE. The previous boolean fallback collapsed both into `"companion"`, silently relabelling unknown/reserved types. New behaviour: type 0 → `none`, 1 → `companion`, 2-4 → `repeater`/`room`/`sensor`, 5-15 → `type-N`. `ValidateAdvert` accepts the new labels. - #4 CONTROL (0x0B) byte0 flags + length. Per `firmware/src/Mesh.cpp:69` + `createControlData` at `Mesh.cpp:609`, byte0 high-bit marks the zero-hop direct subset. Surfaces `{ctrlFlags, ctrlZeroHop, ctrlLength}`. ### Drift fix - `cmd/server/store.go` `payloadTypeNames` now includes `6: GRP_DATA` and `10: MULTIPART` (previously omitted; canonical decoder map already had them). ## Lockstep & TDD Both `cmd/ingestor/decoder.go` and `cmd/server/decoder.go` updated in the same commits — same wire-vector tests live in both packages (`cmd/{ingestor,server}/issue1279_test.go`). Per-item RED→GREEN visible in `git log`. \| Item \| Tests \| RED proof \| \|---\|---\|---\| \| #1 GRP_DATA \| ingestor: NoKey + DecryptedInner; server: Envelope \| 6 assertions failed pre-impl \| \| #2 MULTIPART \| ingestor + server: Ack + NonAck \| 8 assertions failed pre-impl \| \| #3 advertRole \| ingestor + server: 7-row table \| 3 assertions failed pre-impl \| \| #4 CONTROL \| ingestor + server: ZeroHop + MultiHop \| 6 assertions failed pre-impl \| ## What's NOT in this PR The umbrella issue lists P2 items that ship in follow-up PRs: - Live + compare legend entries for the long tail of newly-named types (#1274 + others). - TransportCodes UI surface + filter grammar. - feat1/feat2 capability badges. - `payloadTypeNames` consolidation across server/ingestor (drift-prevention). Leave the umbrella open after this merges. Refs #1279 --------- Co-authored-by: OpenClaw Bot <bot@openclaw.local>	2026-05-18 23:19:27 -07:00
Kpa-clawbot	170f0ac66d	fix(#1212 ): MQTT per-attempt logging + stall watchdog — prevent silent reconnect-loop death (#1216 ) RED commit: `1cd25f7b` — CI (failing on assertion): https://github.com/Kpa-clawbot/CoreScope/actions?query=sha%3A1cd25f7b1bdd0091f689dd64ce1bfec6d031191f Fixes #1212 ## Root cause NOT that `AutoReconnect` was off — it was set; `MaxReconnectInterval=30s` was set (PR #949); a `SetReconnectingHandler` was wired. The defect was an observability gap: `SetReconnectingHandler` fires only INSIDE paho's reconnect goroutine. If that goroutine never iterates (status race after the recovered handler panic at 21:07:13, or an internal abort), operators see ONLY the `disconnected: pingresp not received` line and then total silence. They cannot distinguish "paho is patiently retrying" from "paho gave up and the goroutine is gone." That ambiguity is what turned a 30s blip into 6h of downtime. ## Changes ### `cmd/ingestor/main.go` — `SetConnectionAttemptHandler` Fires on every TCP/TLS dial — the initial `Connect()` AND every reconnect — independent of paho's internal reconnect-loop state. Logs: ``` MQTT [staging] connection attempt #1 to tcp://broker:1883 MQTT [staging] connection attempt #2 to tcp://broker:1883 ``` Per-source attempt counter via `atomic.AddInt64`. ### `cmd/ingestor/mqtt_watchdog.go` (new) — per-source stall watchdog Satisfies the watchdog acceptance criterion. Even when paho reports `connected`, if no MQTT messages have flowed for >5m, log a WARN line every 60s: ``` MQTT [staging] WATCHDOG: client reports connected to tcp://broker:1883 but no messages received for 7m30s (threshold 5m) — possible half-open socket or upstream stall ``` Catches half-open TCP and broker-accepted-but-not-forwarding scenarios that look "connected" to paho. Hot-path cost: one `atomic.StoreInt64` per inbound message. Watchdog scans the registry once a minute. ### Tests (`cmd/ingestor/mqtt_reconnect_test.go`, new) - `TestBuildMQTTOpts_InstrumentsConnectionAttempt` — asserts `OnConnectAttempt` is wired in `buildMQTTOpts`. - `TestMQTTStallWatchdog_FiresOnSilentSource` — connected + 10m silent + 5m threshold → stall flagged. - `TestMQTTStallWatchdog_QuietWhenRecent` — recent message → no stall. - `TestMQTTStallWatchdog_QuietWhenDisconnected` — disconnected → no stall (paho's reconnect logging covers it). ## TDD - RED `1cd25f7b` — 2 assertion failures (compile OK, stub returns no-stall, `OnConnectAttempt` nil). - GREEN `2527be6f` — implementation; all ingestor tests pass. ## Out of scope - Slice-bounds decode panic (#1211, separate PR). - A full in-process MQTT broker integration test would require a new dep (mochi-mqtt) — the observability and watchdog behaviors are independently verifiable by the unit tests above, and the reconnect path itself is paho's responsibility (we already test it's configured via `mqtt_opts_test.go`). --------- Co-authored-by: bot <bot@example.com> Co-authored-by: OpenClaw Bot <bot@openclaw.local> Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: openclaw-bot <openclaw-bot@users.noreply.github.com>	2026-05-15 22:46:29 -07:00
Kpa-clawbot	85e97d2f37	fix(#1211 ): bounds-check path length to prevent slice [218:15] panic in MQTT decode (#1214 ) RED commit: `65d9f57b` (CI run will appear at https://github.com/Kpa-clawbot/CoreScope/actions after PR opens) Fixes #1211 ## Root cause `decodePath()` returns `bytesConsumed = hash_size * hash_count` where both come straight from the wire-supplied `pathByte` (upper 2 bits → `hash_size`, lower 6 bits → `hash_count`). Max claimable: 4 × 63 = 252 bytes. A malformed packet on the wire claimed `pathByte=0xF6` (hash_size=4, hash_count=54 → 216 path bytes) inside a 15-byte buffer. The inner hop-extraction loop in `decodePath` did break early on overflow — but `bytesConsumed` was still returned at face value (216). `DecodePacket` then did `offset += 216` (offset=218) and `payloadBuf := buf[offset:]` panicked with the prod-observed signature: ``` runtime error: slice bounds out of range [218:15] ``` The handler-level `defer/recover` at `cmd/ingestor/main.go:258-263` caught it, but the message was silently dropped with no usable diagnostic. ## Fix Add a `if offset > len(buf)` guard at BOTH decoder sites (same pattern, same panic potential): - `cmd/ingestor/decoder.go` — DecodePacket after decodePath - `cmd/server/decoder.go` — DecodePacket after decodePath Return a descriptive error citing the claimed length and pathByte hex so operators can reproduce. Also: `cmd/ingestor/main.go` decode-error log now includes `topic`, `observer`, and `rawHexLen` so future malformed packets are reproducible without needing to attach a debugger. ## Tests (TDD red → green) Both packages got two new tests: - `TestDecodePacketBoundsFromWire_Issue1211` — feeds the exact wire shape from the prod log (`pathByte=0xF6` inside a 15-byte buf). Asserts `DecodePacket` does NOT panic and returns an error. - `TestDecodePacketFuzzTruncated_Issue1211` — sweeps every `(header, pathByte)` combination with tails 0..19 bytes (≈1.3M inputs). Asserts zero panics. ### Red commit proof On commit `65d9f57b` (RED), both tests fail with the panic: ``` === RUN TestDecodePacketBoundsFromWire_Issue1211 decoder_test.go:1996: DecodePacket panicked on malformed input: runtime error: slice bounds out of range [218:15] --- FAIL: TestDecodePacketBoundsFromWire_Issue1211 (0.00s) === RUN TestDecodePacketFuzzTruncated_Issue1211 decoder_test.go:2010: DecodePacket panicked during fuzz: runtime error: slice bounds out of range [3:2] --- FAIL: TestDecodePacketFuzzTruncated_Issue1211 (0.01s) ``` On commit `7a6ae52c` (GREEN), full suites pass: - `cmd/ingestor`: `ok 53.988s` - `cmd/server`: `ok 29.456s` ## Acceptance criteria - [x] Identify the slice op producing `[218:15]` — `payloadBuf := buf[offset:]` in `DecodePacket` (decoder.go), where `offset` had been advanced by an unchecked `bytesConsumed` from `decodePath()`. - [x] Bounds check added at the identified site(s) — both ingestor and server decoders. - [x] Test with crafted payload (length-field > remaining buffer) — `TestDecodePacketBoundsFromWire_Issue1211`. - [x] Log topic, observer ID, payload byte length on drop — updated `MQTT [%s] decode error` log line. - [x] Existing tests stay green — confirmed both packages. ## Out of scope Reconnect-after-disconnect (#1212) — handled by a separate subagent. This PR touches NO reconnect logic. --------- Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: openclaw-bot <bot@openclaw.local> Co-authored-by: corescope-bot <bot@corescope>	2026-05-15 22:34:21 -07:00
Kpa-clawbot	f4cf2acbc0	perf: cancelled writes + ingestor I/O + threshold tests (#1120 follow-up) (#1167 ) Red commit: `e964ec9c46` (CI run: pending — workflow only triggers on PR open) Partial fix for #1120 — finishes the four follow-up items left open after PR #1123 (cancelled writes, ingestor I/O, threshold-flag tests, docs). ## What's done - `cancelledWriteBytesPerSec` — server `/proc/self/io` parser handles `cancelled_write_bytes`; `/api/perf/io` exposes the per-second rate; Perf page renders it next to Read/Write with ⚠️ when sustained >1 MB/s. - Ingestor `/proc/<pid>/io` — `cmd/ingestor/stats_file.go` samples its own `/proc/self/io` each tick and includes `procIO` in the snapshot. The server's `/api/perf/io` reads it and surfaces `.ingestor`. Frontend renders an `Ingestor process` Disk I/O block alongside the existing `server process` block (issue mockup: "Both ingestor and server"). - Threshold + anomaly tests — `test-perf-disk-io-1120.js` now asserts ⚠️ fires/suppresses on WAL>100MB, cache_hit<90%, and the backfill-rate-vs-tx-rate guard with the `tx_inserted >= 100` baseline floor. Drops the tautological `\|\| ... === false` short-circuits flagged in MINOR m4. - Docs (m8) — `config.example.json` adds `_comment_ingestorStats` (env var, default path, shared-tmp security note); `cmd/ingestor/README.md` adds `CORESCOPE_INGESTOR_STATS` to the env-var table plus a `Stats file` section. ## What's NOT done (deferred) m1 sync.Map → map+RWMutex, m2 perfIOMu rate caching, m3 negative cacheSize translation, m5 deterministic-write test, m7 ctx-aware shutdown — pure polish; will file a follow-up issue if the operator wants them tracked. ## TDD - Red: `e964ec9` — adds failing tests + stub field/handler shape (cancelled missing from struct, ingestor stub returns nil, ingestor procIO absent). - Green: `1240703` — wires up the parser case, ingestor sampler, frontend rendering, docs. E2E assertion added: test-perf-disk-io-1120.js:108 --------- Co-authored-by: clawbot <clawbot@users.noreply.github.com> Co-authored-by: Kpa-clawbot <bot@kpa-clawbot.local> Co-authored-by: Kpa-clawbot <bot@kpa-clawbot>	2026-05-08 16:29:23 -07:00
Kpa-clawbot	fb744d895f	fix(#1143 ): structural pubkey attribution via from_pubkey column (#1152 ) Fixes #1143. ## Summary Replaces the structurally unsound `decoded_json LIKE '%pubkey%'` (and `OR LIKE '%name%'`) attribution path with an exact-match lookup on a dedicated, indexed `transmissions.from_pubkey` column. This closes both holes documented in #1143: - Hole 1 — same-name false positives via `OR LIKE '%name%'` - Hole 2a — adversarial spoofing: a malicious node names itself with another node's pubkey and gets attributed to the victim - Hole 2b — accidental false positive when any free-text field (path elements, channel names, message bodies) contains a 64-char hex substring matching a real pubkey - Perf — query now uses an index instead of a full-table scan against `LIKE '%substring%'` ## TDD Two-commit history shows red-then-green: \| Commit \| Status \| Purpose \| \|---\|---\|---\| \| `7f0f08e` \| RED — tests assertion-fail on master behaviour \| Adversarial fixtures + spec \| \| `59327db` \| GREEN — schema + ingestor + server + migration \| Implementation \| The red commit's test schema includes the new column so the file compiles, but the production code still uses LIKE — the assertions fail because the malicious / same-name / free-text rows are returned. The green commit changes the query plus adds the migration/ingest path. ## Changes ### Schema - new column `transmissions.from_pubkey TEXT` - new index `idx_transmissions_from_pubkey` ### Ingestor (`cmd/ingestor/`) - `PacketData.FromPubkey` populated from decoded ADVERT `pubKey` at write time. Cheap — already parsing `decoded_json`. Non-ADVERTs stay NULL. - `stmtInsertTransmission` writes the column. - Migration `from_pubkey_v1` ALTERs legacy DBs to add the column + index. - Bonus: rewrote the recipe in the gated one-shot `advert_count_unique_v1` migration to use `from_pubkey` (already marked done on existing DBs; kept correct for fresh installs). ### Server (`cmd/server/`) - `ensureFromPubkeyColumn` mirrors the ingestor migration so the server can boot against a DB the ingestor has never touched (e2e fixture, fresh installs). - `backfillFromPubkeyAsync` runs after HTTP starts. Scans `WHERE from_pubkey IS NULL AND payload_type = 4` in 5000-row chunks with a 100ms yield between chunks. Cannot block boot even on prod-sized DBs (100K+ transmissions). Queries handle NULL gracefully (return empty for that pubkey, same as today's unknown-pubkey path). - All in-scope LIKE call sites switched to exact match: \| Site \| Before \| After \| \|---\|---\|---\| \| `buildPacketWhere` (was db.go:582) \| `decoded_json LIKE '%pubkey%'` \| `from_pubkey = ?` \| \| `buildTransmissionWhere` (was db.go:626) \| `t.decoded_json LIKE '%pubkey%'` \| `t.from_pubkey = ?` \| \| `GetRecentTransmissionsForNode` (was db.go:910) \| `LIKE '%pubkey%' OR LIKE '%name%'` \| `t.from_pubkey = ?` \| \| `QueryMultiNodePackets` (was db.go:1785) \| `decoded_json LIKE '%pubkey%' OR ...` \| `t.from_pubkey IN (?, ?, ...)` \| \| `advert_count_unique_v1` (was ingestor/db.go:257) \| `decoded_json LIKE '%' \\|\\| nodes.public_key \\|\\| '%'` \| `t.from_pubkey = nodes.public_key` \| `GetRecentTransmissionsForNode` signature simplifies: the `name` parameter is gone (it was only ever used for the legacy `OR LIKE '%name%'` fallback). Sole caller in `routes.go:1243` updated. ### Tests - `cmd/server/from_pubkey_attribution_test.go` — adversarial fixtures + Hole 1/2a/2b/QueryMultiNodePackets exact-match assertions, EXPLAIN QUERY PLAN index check, migration backfill correctness. - `cmd/ingestor/from_pubkey_test.go` — write-time correctness (BuildPacketData populates FromPubkey for ADVERT only; InsertTransmission persists it; non-ADVERTs stay NULL). - Existing test schemas (server v2, server v3, coverage) get the new column plus a SQLite trigger that auto-populates `from_pubkey` from `decoded_json` on ADVERT inserts. This means existing fixtures (which only seed `decoded_json`) keep attributing correctly without per-test edits. - `seedTestData`'s ADVERTs explicitly set `from_pubkey`. ## Performance — index is used ``` $ EXPLAIN QUERY PLAN SELECT id FROM transmissions WHERE from_pubkey = ? SEARCH transmissions USING INDEX idx_transmissions_from_pubkey (from_pubkey=?) ``` Asserted in `TestFromPubkeyIndexUsed`. ## Migration approach - Sync at boot: `ALTER TABLE transmissions ADD COLUMN from_pubkey TEXT` is a metadata-only operation in SQLite — microseconds regardless of table size. `CREATE INDEX IF NOT EXISTS idx_transmissions_from_pubkey` is not metadata-only: it scans the table once. Empirically a few hundred ms on a 100K-row table; expect a few seconds on a 10M-row table (one-time cost, blocking boot during that window). Subsequent boots no-op via `IF NOT EXISTS`. If this boot delay becomes an operational concern at prod scale we can defer the `CREATE INDEX` to a goroutine — for now a few-second one-time delay is acceptable. - Async: row-level backfill of legacy NULL ADVERTs (chunked 5000 / 100ms yield). On a 100K-ADVERT prod DB, this completes in seconds in the background; HTTP is fully available throughout. - Safety: queries handle NULL gracefully — a node whose ADVERTs haven't backfilled yet returns empty, identical to today's behaviour for unknown pubkeys. No half-state regression. ## Out of scope (intentionally) The free-text `LIKE` paths the issue explicitly leaves alone (e.g. user-typed packet search) are untouched. Only the pubkey-attribution sites get the column treatment. ## Cycle-3 review fixes \| Finding \| Status \| Commit \| \|---\|---\|---\| \| M1c — async-contract test was tautological (test's own `go`, not production's) \| Fixed \| `23ace71` (red) → `a05b50c` (green) \| \| m1c — package-global atomic resets unsafe under `t.Parallel()` \| Fixed (`// DO NOT t.Parallel` comment + `Reset()` helper) \| rolled into `23ace71` / `241ec69` \| \| m2c — `/api/healthz` read 3 atomics non-atomically (torn snapshot) \| Fixed (single RWMutex-guarded snapshot + race test) \| `241ec69` \| \| n3c.m1 — vestigial OR-scaffolding in `QueryMultiNodePackets` \| Fixed (cleanup) \| `5a53ceb` \| \| n3c.m2 — verify PR body language about `ALTER` vs `CREATE INDEX` \| Verified accurate (already corrected in cycle 2) \| (no change) \| \| n3c.m3 — `json.Unmarshal` per row in backfill → could use SQL `json_extract` \| Deferred as known followup — pure perf optimization (current per-row Unmarshal is correct, just slower); SQL rewrite would unwind the chunked-yield architecture and is non-trivial. Acceptable for one-time backfill at boot on legacy DBs. \| ### M1c implementation detail `startFromPubkeyBackfill(dbPath, chunkSize, yieldDuration)` is now the single production entry point used by `main.go`. It internally does `go backfillFromPubkeyAsync(...)`. The test calls `startFromPubkeyBackfill` (no `go` prefix) and asserts the dispatch returns within 50ms — so if anyone removes the `go` keyword inside the wrapper, the test fails. Manually verified: removing the `go` keyword causes `TestBackfillFromPubkey_DoesNotBlockBoot` to fail with "backfill dispatch took ~1s (>50ms): not async — would block boot." ### m2c implementation detail `fromPubkeyBackfillTotal/Processed/Done` are now plain `int64`/`bool` package globals guarded by a single `sync.RWMutex`. `fromPubkeyBackfillSnapshot()` returns all three under one RLock. `TestHealthzFromPubkeyBackfillConsistentSnapshot` races a writer (lock-step total/processed updates with periodic done flips) against 8 readers hammering `/api/healthz`, asserting `processed<=total` and `(done => processed==total)` on every response. Verified the test catches torn reads (manually injected a 3-RLock implementation; test failed within milliseconds with "processed>total" and "done=true but processed!=total" errors). --------- Co-authored-by: openclaw-bot <bot@openclaw.local> Co-authored-by: openclaw-bot <bot@openclaw.dev>	2026-05-06 23:50:44 -07:00
Kpa-clawbot	5a5df5d92b	revert: group commit M1 (#1117 ) — starves MQTT, refs #1129 (#1130 ) ## Why Diagnostic on #1129 shows PR #1117 (group commit M1 for #1115) is fundamentally broken: it starves the MQTT goroutine via `gcMu` lock contention, causing pingresp disconnects and lost packets at modest ingest rates. ## Three structural defects 1. Lock held across `sql.Stmt.Exec` — every concurrent `InsertTransmission` blocks for the full SQLite write latency, not just the brief queue mutation. 2. Lock held across `tx.Commit` — the WAL fsync runs under `gcMu`, so any backlog blocks all ingest writers AND the flusher ticker, snowballing under load. 3. Single-conn DB (`MaxOpenConns=1`) — the flusher and the ingest path serialise on one connection, turning the lock into a global ingest stall. Net effect: at modest packet rates the MQTT client loop misses its own pingresp deadline, the broker drops the connection, and packets received during the stall are lost. ## What this PR removes - `Store.SetGroupCommit`, `Store.FlushGroupTx`, `Store.flushLocked`, `Store.GroupCommitMs` - `gcMu`, `activeTx`, `pendingRows`, `groupCommitMs`, `groupCommitMaxRows` Store fields - `groupCommitMs` / `groupCommitMaxRows` config fields and `GroupCommitMsOrDefault` / `GroupCommitMaxRowsOrDefault` accessors - The flusher goroutine in `cmd/ingestor/main.go` - `cmd/ingestor/group_commit_test.go` - The `if s.activeTx != nil { … pendingRows … }` branch in `InsertTransmission` — reverts to plain prepared-stmt usage ## What this PR keeps (merged after #1117) - #1119 `BackfillPathJSON` `path_json='[]'` fix - #1120/#1123 perf metrics endpoints — `WALCommits` counter retained - `GroupCommitFlushes` JSON field on `/api/perf/write-sources` is kept as always-0 for API stability (server `perf_io.go` references it as a string field name; no client breakage) - `DBStats.GroupCommitFlushes` atomic field is removed from the Go struct ## Tests `cd cmd/ingestor && go test ./... -run "Test"` → `ok` (47.8s). `cd cmd/server && go build ./...` → clean. ## #1115 stays open The group-commit idea is sound — batching observation INSERTs would meaningfully reduce WAL fsync rate. But it needs a redesign that does not hold a mutex across blocking SQLite calls. Suggested directions for a future M1: - Channel-fed writer goroutine (single owner of the tx, ingest path is non-blocking enqueue) - Per-batch DB handle so the flusher doesn't serialise the ingest connection - Bounded queue with backpressure rather than a shared lock Refs #1117 #1129	2026-05-05 19:02:43 -07:00
Kpa-clawbot	74dffa2fb7	feat(perf): per-component disk I/O + write source metrics on Perf page (#1120 ) (#1123 ) ## Summary Implements per-component disk I/O + write source metrics on the Perf page so operators can self-diagnose write-volume anomalies (cf. the BackfillPathJSON loop debugged in #1119) without SSHing in to run iotop/fatrace. Partial fix for #1120 ## What's done (4/6 ACs) - ✅ `/api/perf/io` — server-process `/proc/self/io` delta rates (read/write bytes per sec, syscalls) - ✅ `/api/perf/sqlite` — WAL size, page count, page size, cache hit rate - ✅ `/api/perf/write-sources` — per-component counters from ingestor (tx/obs/upserts/backfill_) - ✅ Frontend Perf page — three new sections with anomaly thresholds + per-second rate columns ## What's NOT done (deferred to follow-up) - ❌ `cancelledWriteBytesPerSec` field — issue #1120 lists this under server-process I/O ("writes the kernel discarded — interesting signal"); not exposed in this PR - ❌ Ingestor `/proc/<pid>/io` — issue #1120 says "Both ingestor and server"; only server-process I/O lands here. Adding ingestor I/O requires either a unix socket back to the server, or surfacing the ingestor pid through the stats file. Doable without changing the existing API shape. - ❌ Adaptive baselining — anomaly thresholds remain static (10×, 100 MB, 90%); steady-state baselining can come once we have enough deployed Perf-page telemetry Per AGENTS.md rule 34, this PR uses "Partial fix for #1120" rather than "Fixes #1120" so the issue stays open until the remaining ACs land. ## Backend Server (`cmd/server/perf_io.go`)* - `GET /api/perf/io` — reads `/proc/self/io` and returns delta-rate `{readBytesPerSec, writeBytesPerSec, syscallsRead, syscallsWrite}` since last call (in-memory tracker, no allocation per sample). - `GET /api/perf/sqlite` — returns `{walSize, walSizeMB, pageCount, pageSize, cacheSize, cacheHitRate}`. `cacheHitRate` is proxied from the in-process row cache (closest available signal under the modernc sqlite driver). - `GET /api/perf/write-sources` — reads the ingestor's stats JSON file and returns a flat `{sources: {...}, sampleAt}` payload. Ingestor (`cmd/ingestor/`) - `DBStats` gains `WALCommits atomic.Int64` (incremented on every successful `tx.Commit()` and on every auto-commit `InsertTransmission` write) and `BackfillUpdates sync.Map` keyed by backfill name with `IncBackfill(name)` / `SnapshotBackfills()` helpers. - `BackfillPathJSONAsync` now increments `BackfillUpdates["path_json"]` per row write — the BackfillPathJSON-style infinite loop becomes immediately visible at `backfill_path_json` in the Write Sources table. - New `StartStatsFileWriter` publishes a JSON snapshot to `/tmp/corescope-ingestor-stats.json` (override via `CORESCOPE_INGESTOR_STATS`) every second using atomic tmp+rename. The tmp file is opened with `O_CREATE\|O_WRONLY\|O_TRUNC\|O_NOFOLLOW` mode `0o600` so a pre-planted symlink in a world-writable `/tmp` cannot redirect the write to an arbitrary file. ## Frontend (`public/perf.js`) Three new sections on the Perf page, all auto-refreshed via the existing 5s interval: - Disk I/O (server process) — read/write rates (formatted B/KB/MB-per-sec) + syscall counts. Write rate >10 MB/s flags ⚠️. - Write Sources — sorted table of per-component counters with a per-second rate column derived from snapshot deltas. Backfill rows show ⚠️ only when `tx_inserted >= 100` (meaningful baseline) AND the backfill's per-second rate exceeds 10× the live tx rate. Avoids the startup-spurious-alarm where cumulative-vs-cumulative was a tautology. - SQLite (WAL + Cache Hit) — WAL size (⚠️ when >100 MB), page count, page size, cache hit rate (⚠️ when <90%). ## Tests - Backend (`cmd/server/perf_io_test.go`) — `TestPerfIOEndpoint_ReturnsValidJSON`, `TestPerfSqliteEndpoint_ReturnsValidJSON`, `TestPerfWriteSourcesEndpoint_ReturnsSources` exercise the three new endpoints. Skips the `/proc/self/io` non-zero-rate assertion when `/proc` is unavailable. - Frontend (`test-perf-disk-io-1120.js`) — vm-sandbox runs `perf.js` with stubbed `fetch`, asserts the three new sections render with their headings + values. E2E assertion added: test-perf-disk-io-1120.js:91 ## TDD 1. Red commit (`21abd22`) — added the three handlers as no-op stubs returning empty values; tests fail on assertion mismatches (non-zero rate, `pageSize > 0`, headings present). 2. Green commit (`d8da54c`) — fills in the real `/proc/self/io` parser, PRAGMA queries, ingestor stats writer, and Perf page rendering. --------- Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: Kpa-clawbot <kpa-clawbot@users.noreply.github.com>	2026-05-05 17:56:56 -07:00
Kpa-clawbot	76d89e6578	fix(ingestor): exclude path_json='[]' rows from backfill WHERE (#1119 ) (#1121 ) ## Summary `BackfillPathJSONAsync` re-selected observations whose `path_json` was already `'[]'`, rewrote them to `'[]'`, and looped forever. The `len(batch) == 0` exit condition was never reached, the migration marker was never recorded, and the ingestor sustained 2–3 MB/s WAL writes at idle (76% of CPU in `sqlite.Exec` per pprof). ## Fix Drop `'[]'` from the WHERE clause: ```diff WHERE o.raw_hex IS NOT NULL AND o.raw_hex != '' - AND (o.path_json IS NULL OR o.path_json = '' OR o.path_json = '[]') + AND (o.path_json IS NULL OR o.path_json = '') ``` `'[]'` is the "already attempted, no hops" sentinel (still written at line 994 of `cmd/ingestor/db.go` when `DecodePathFromRawHex` returns no hops). Excluding it from the WHERE lets the loop terminate after one full pass and the migration marker `backfill_path_json_from_raw_hex_v1` to be recorded. ## TDD - Red commit (`19f8004`): `TestBackfillPathJSONAsync_BracketRowsTerminate` — seeds 100 observations with `path_json='[]'` and a `raw_hex` that decodes to zero hops, asserts the migration marker is written within 5s. Fails on master with "backfill never recorded migration marker within 5s — infinite loop on path_json='[]' rows". - Green commit (`7019100`): WHERE-clause fix + updates `TestBackfillPathJsonFromRawHex` row 1 expectation (the pre-seeded `'[]'` row is now correctly skipped instead of being re-decoded). ## Test results ``` ok github.com/corescope/ingestor 49.656s ``` ## Acceptance criteria from #1119 - [x] Backfill terminates within 1 polling cycle of having no progress to make - [x] Migration marker `backfill_path_json_from_raw_hex_v1` written after termination - [x] On restart, backfill recognizes migration done and exits immediately (existing behavior — the migration check at the top of `BackfillPathJSONAsync` was always correct; the bug was that the marker never got written) - [x] Test: seed DB with N observations all having `path_json = '[]'` → backfill runs once → no UPDATEs issued, migration marker written - [ ] Disk write rate on idle staging drops from 2–3 MB/s to <100 KB/s — to be verified by the user post-deploy Fixes #1119. --------- Co-authored-by: OpenClaw Bot <bot@openclaw.local>	2026-05-05 17:35:16 -07:00
Kpa-clawbot	45f2607f75	perf(ingestor): group commit observation INSERTs by time window (M1, refs #1115 ) (#1117 ) ## Summary Implements M1 from #1115: batches observation/transmission INSERTs into a single SQLite `BEGIN/COMMIT` window instead of fsyncing per packet. At ~250 obs/sec this drops WAL fsync rate from ~20/s to ~1/s and eliminates the `obs-persist skipped` / `SQLITE_BUSY` log spam that the issue documents. This is a partial fix — it ships the group-commit mechanism. Acceptance items 6–7 (measured fsync rate / measured `obs-persist skipped` rate at staging steady-state) require post-deploy observation, and M2 (per-`tx_hash` observation buffering) is intentionally deferred. The issue stays open for the user to verify on staging. > Partial fix for #1115 — does not auto-close. Refs #1115. ## Mechanism - `Store` gains an active `sql.Tx`, `pendingRows` counter, `gcMu`, and the `groupCommitMs` / `groupCommitMaxRows` knobs. `SetGroupCommit(ms, maxRows)` enables the mode; `FlushGroupTx()` commits the in-flight tx. - `InsertTransmission` lazily opens a tx on the first call after each flush, then issues all writes through `tx.Stmt()` bindings of the existing prepared statements. With `MaxOpenConns(1)` the connection is already serialized; `gcMu` serializes group-commit state without contention. - A goroutine in `cmd/ingestor/main.go` calls `FlushGroupTx()` every `groupCommitMs` ms. `pendingRows >= groupCommitMaxRows` triggers an eager flush. `Close()` flushes before the WAL checkpoint so no rows are lost on graceful shutdown. - `groupCommitMs == 0` short-circuits to the legacy per-call auto-commit path (statements bound to `s.db`, no tx) — current behavior preserved byte-for-byte for operators who opt out. ## Config Two new optional fields (ingestor-only), both documented in `config.example.json`: \| Field \| Default \| Effect \| \|---\|---\|---\| \| `groupCommitMs` \| `1000` \| Flush window in ms. `0` disables batching (legacy per-packet auto-commit). \| \| `groupCommitMaxRows` \| `1000` \| Safety cap; when exceeded the queue flushes immediately to bound memory and the crash-loss window. \| No DB schema change. No required config change on upgrade. ## Tests (TDD red → green visible in commits) `cmd/ingestor/group_commit_test.go` — three assertions, written first as the red commit: - `TestGroupCommit_BatchesInsertsIntoOneTx` — 50 `InsertTransmission` calls inside a wide window produce 0* commits until `FlushGroupTx`, then exactly 1; all 50 rows visible after flush. (This is the spec's "50 observations → 1 SQLite write transaction" assertion.) - `TestGroupCommit_Disabled` — `groupCommitMs=0` keeps every insert immediately visible and `GroupCommitFlushes` never advances. (Spec's "groupCommitMs=0 reverts to per-packet behavior" assertion.) - `TestGroupCommit_MaxRowsForcesEarlyFlush` — cap=3, 7 inserts → 2 auto-flushes from the cap + 1 final manual flush = 3 total. Red commit: `e2b0370` (stubs `SetGroupCommit` / `FlushGroupTx` so the tests compile and fail on assertions, not import errors). Green commit: `73f3559`. Full ingestor suite (`go test ./...` in `cmd/ingestor`) stays green, ~49 s. ## Performance This PR is the perf change itself. Local micro-test (the new `TestGroupCommit_BatchesInsertsIntoOneTx`) shows the structural property: 50 inserts → 1 commit. The fsync-rate measurement called out in the M1 acceptance criteria (`~20/s → ~1/s` at 250 obs/sec) requires staging deployment to confirm — that's the remaining open item that keeps #1115 open after this merges. No hot-path regressions: when `groupCommitMs > 0` we acquire one mutex per insert (uncontended in the steady state — the connection was already single-threaded via `MaxOpenConns(1)`). When `groupCommitMs == 0` the code path is identical to before plus one nil-tx check. ## What this PR does NOT do (per spec) - Does not collapse "30 observations of one packet" into 1 row write — that's M2. - Does not eliminate dual-writer contention with `cmd/server`'s `resolved_path` writes. - Does not change observation ordering or live broadcast latency. --------- Co-authored-by: corescope-bot <bot@corescope.local>	2026-05-05 16:38:43 -07:00
Kpa-clawbot	136e1d23c8	feat(#730 ): foreign-advert detection — flag instead of silent drop (#1084 ) ## Summary Partial fix for #730 (M1 only — M2 frontend and M3 alerting deferred). Today the ingestor silently drops ADVERTs whose GPS lies outside the configured `geo_filter` polygon. That's the wrong default for an analytics tool — operators get zero visibility into bridged or leaked meshes. This PR makes the new default flag, don't drop: foreign adverts are stored, the node row is tagged `foreign_advert=1`, and the API surfaces `"foreign": true` so dashboards / map overlays can be built on top. ## Behavior \| Mode \| What happens to an ADVERT outside `geo_filter` \| \|---\|---\| \| (default) flag \| Stored, marked `foreign_advert=1`, exposed via API \| \| drop (legacy) \| Silently dropped (preserves old behavior for ops who want it) \| ## What's done (M1 — Backend) - ingestor stores foreign adverts instead of dropping - `nodes.foreign_advert` column added (migration) - `/api/nodes` and `/api/nodes/{pk}` expose `foreign: true` field - Config: `geofilter.action: "flag"\|"drop"` (default `flag`) - Tests + config docs ## What's NOT done (deferred to M2 + M3) - M2 — Frontend: Map overlay showing foreign adverts as distinct markers, foreign-advert filter on packets/nodes pages, dedicated foreign-advert dashboard - M3 — Alerting: Time-series detection of bridging events, alert when foreign advert rate spikes, identify bridge entry-point nodes Issue #730 remains open for M2 and M3. --------- Co-authored-by: corescope-bot <bot@corescope>	2026-05-05 01:58:52 -07:00
Kpa-clawbot	227f375b4a	test(ingestor): regression test for observer metadata persistence (#1044 ) (#1047 ) Adds end-to-end test proving that `extractObserverMeta` + `UpsertObserver` correctly stores model, firmware, battery_mv, noise_floor, uptime_secs from a real MQTT status payload. Test passes — confirms the code path works. #1044 was caused by upstream observers not including metadata fields in their status payloads (older `meshcoretomqtt` client versions), not a code bug. Closes #1044 Co-authored-by: meshcore-bot <bot@meshcore.local>	2026-05-05 06:18:47 +00:00
Kpa-clawbot	c9301fee9c	fix(ingestor): extract per-hop SNR for TRACE packets at ingest time (#1028 ) ## Problem PR #1007 added per-hop SNR extraction (`snrValues`) for TRACE packets to `cmd/server/decoder.go`. That code path is only hit by the on-demand re-decode endpoint (packet detail). The actual ingest pipeline runs `cmd/ingestor/decoder.go`, decodes the packet once, and persists `decoded_json` into SQLite. The server then serves `decoded_json` as-is for list/feed queries. Net effect: `snrValues` never appears in any production response, because the ingestor's decoder was never updated. Confirmed empirically: `strings /app/corescope-ingestor \| grep snrVal` returns nothing. ## Fix Port the SNR extraction logic from `cmd/server/decoder.go` (lines 410–422) into `cmd/ingestor/decoder.go`. For TRACE packets, the header path bytes are int8 SNR values in quarter-dB encoding; extract them into `payload.SNRValues` before `path.Hops` is overwritten with payload-derived hop IDs. Also adds the matching `SNRValues []float64` field to the ingestor's `Payload` struct so it serializes into `decoded_json`. ## TDD - Red commit (`6ae4c07`): adds `TestDecodeTraceExtractsSNRValues` + `SNRValues` field stub. Compiles, fails on assertion (`len(SNRValues)=0, want 2`). - Green commit (`4a4f3f3`): adds extraction loop. Test passes. Test packet: `26022FF8116A23A80000000001C0DE1000DEDE` - header `0x26` = TRACE + DIRECT - pathByte `0x02` = hash_size 1, hash_count 2 - header path `2F F8` → SNR `[int8(0x2F)/4, int8(0xF8)/4]` = `[11.75, -2.0]` ## Files - `cmd/ingestor/decoder.go` — `+16` (field + extraction) - `cmd/ingestor/decoder_test.go` — `+29` (red test) ## Out of scope - `cmd/server/decoder.go` is already correct (PR #1007). Untouched. - Backfill of historical `decoded_json` rows. New TRACE packets get SNR; old rows do not until re-decoded. --------- Co-authored-by: corescope-bot <bot@corescope.local>	2026-05-03 21:42:14 -07:00
Kpa-clawbot	5e01de0d52	fix: make path_json backfill async to unblock MQTT startup (#1013 ) ## Summary P0 fix: The `path_json` backfill migration (PR #983) ran synchronously in `applySchema`, blocking the ingestor main goroutine. On staging (~502K observations), MQTT never connected — no new packets ingested for 15+ hours. ## Fix Extract the backfill into `BackfillPathJSONAsync()` — a method on `Store` that launches the work in a background goroutine. Called from `main.go` before MQTT connect, it runs concurrently without blocking subscription. Pattern: identical to `backfillResolvedPathsAsync` in the server (same lesson learned). ## Safety - Idempotent: checks `_migrations` table, skips if already recorded - Only touches `path_json IS NULL` rows — no conflict with live ingest (new observations get `path_json` at write time) - Panic-recovered goroutine with start/completion logging - Batched (1000 rows per iteration) to avoid memory pressure ## TDD - Red commit: `c6e1375` — test asserts `BackfillPathJSONAsync` method exists + OpenStore doesn't block - Green commit*: `015871f` — implements async method, all tests pass ## Files changed - `cmd/ingestor/db.go` — removed sync backfill from `applySchema`, added `BackfillPathJSONAsync()` - `cmd/ingestor/main.go` — call `store.BackfillPathJSONAsync()` after store creation - `cmd/ingestor/db_test.go` — new async tests + updated existing test to use async API --------- Co-authored-by: you <you@example.com>	2026-05-03 11:29:56 -07:00
Kpa-clawbot	b0e4d2fa18	feat: add optional MQTT region field (#788 ) (#1012 ) ## Summary Add optional `region` field to MQTT source config and JSON payload, enabling publishers to explicitly provide region data without relying solely on topic path structure. ## Changes - `MQTTSource.Region` — new optional config field. When set, acts as default region for all messages from that source (useful when a broker serves a single region). - `MQTTPacketMessage.Region` — new optional JSON payload field. Publishers can include `"region": "PDX"` in their MQTT messages. - `PacketData.Region` — carries the resolved region through to storage. - Priority resolution: payload `region` > topic-derived region > source config `region` - Observer IATA is updated with the effective region on every packet. ## Config example ```json { "mqttSources": [ { "name": "cascadia", "broker": "tcp://cascadia-broker:1883", "topics": ["meshcore/#"], "region": "PDX" } ] } ``` ## Payload example ```json {"raw": "0a1b2c...", "SNR": 5.2, "region": "PDX"} ``` ## TDD - Red commit: `980304c` (tests fail at compile — fields don't exist) - Green commit: `4caf88b` (implementation, all tests pass) ## Unblocks - #804, #770, #730 (all depend on region being available on observations) Fixes #788 --------- Co-authored-by: you <you@example.com>	2026-05-03 11:21:54 -07:00
Kpa-clawbot	153308134e	feat: add global observer IATA whitelist config (#1001 ) ## Summary Adds a global `observerIATAWhitelist` config field that restricts which observer IATA regions are processed by the ingestor. ## Problem Operators running regional instances (e.g., Sweden) want to ensure only observers physically in their region contribute data. The existing per-source `iataFilter` only filters packet messages but still allows status messages through, meaning observers from other regions appear in the database. ## Solution New top-level config field `observerIATAWhitelist`: - When non-empty, all messages (status + packets) from observers outside the whitelist are silently dropped - Case-insensitive matching - Empty list = all regions allowed (fully backwards compatible) - Lazy O(1) lookup via cached uppercase set (same pattern as `observerBlacklist`) ### Config example ```json { "observerIATAWhitelist": ["ARN", "GOT"] } ``` ## TDD - Red commit: `f19c2b2` — tests for `ObserverIATAWhitelist` field and `IsObserverIATAAllowed` method (build fails) - Green commit: `782f516` — implementation + integration test ## Files changed - `cmd/ingestor/config.go` — new field, new method `IsObserverIATAAllowed` - `cmd/ingestor/main.go` — whitelist check in `handleMessage` before status processing - `cmd/ingestor/config_test.go` — unit tests for config parsing and matching - `cmd/ingestor/main_test.go` — integration test for handleMessage filtering Fixes #914 --------- Co-authored-by: you <you@example.com>	2026-05-03 10:23:35 -07:00
Kpa-clawbot	2e3a94b86d	chore(db): one-time cleanup of legacy packets with empty hash or null timestamp (closes #994 ) (#997 ) ## Summary One-time startup migration that deletes legacy packets (transmissions + observations) with empty hash or empty `first_seen` timestamp. This is the write-side cleanup following #993's read-side filter. ### Migration: `cleanup_legacy_null_hash_ts` - Checks `_migrations` table for marker - If not present: deletes observations referencing bad transmissions, then deletes the transmissions themselves - Logs count of deleted rows - Records marker for idempotency ### TDD - Red commit: `b1a24a1` — test asserts migration deletes bad rows (fails without implementation) - Green commit: `2b94522` — implements the migration, all tests pass Fixes #994 --------- Co-authored-by: you <you@example.com>	2026-05-02 23:15:20 -07:00
Kpa-clawbot	d43c95a4bb	fix(ingestor): warn when TRACE payload decode fails but observation stored (closes #889 ) (#992 ) ## Summary Closes #889. When a TRACE packet's payload is too short to decode (< 9 bytes), `decodeTrace` returns an error in `Payload.Error` but the observation is still stored with empty `Path.Hops`. Previously this was completely silent — no log, no anomaly flag, no indication the row is degraded. This fix populates `DecodedPacket.Anomaly` with the decode error message (e.g., `"TRACE payload decode failed: too short"`) so operators and downstream consumers can identify degraded observations. ## TDD Commit History 1. Red commit `04e0165` — failing test asserting `Anomaly` is set when TRACE payload decode fails 2. Green commit `d3e72d1` — 3-line fix in `decoder.go` line 601-603: check `payload.Error != ""` for TRACE packets and set anomaly ## What Changed `cmd/ingestor/decoder.go` (lines 601-603): Added a check before the existing TRACE path-parsing block. If `payload.Error` is non-empty for a TRACE packet, `anomaly` is set to `"TRACE payload decode failed: <error>"`. `cmd/ingestor/decoder_test.go`: Added `TestDecodeTracePayloadFailSetsAnomaly` — constructs a TRACE packet with a 4-byte payload (too short), asserts the packet is still returned (observation stored) and `Anomaly` is populated. ## Verification - `go build ./...` ✓ - `go test ./...` ✓ (all pass including new test) - Anti-tautology: reverting the fix causes the new test to fail (asserts `pkt.Anomaly == ""` → error) --------- Co-authored-by: you <you@example.com>	2026-05-02 20:34:27 -07:00
Kpa-clawbot	dd2f044f2b	fix: cache RW SQLite connection + dedup DBConfig (closes #921 ) (#982 ) Closes #921 ## Summary Follow-up to #920 (incremental auto-vacuum). Addresses both items from the adversarial review: ### 1. RW connection caching Previously, every call to `openRW(dbPath)` opened a new SQLite RW connection and closed it after use. This happened in: - `runIncrementalVacuum` (~4x/hour) - `PruneOldPackets`, `PruneOldMetrics`, `RemoveStaleObservers` - `buildAndPersistEdges`, `PruneNeighborEdges` - All neighbor persist operations Now a single `*sql.DB` handle (with `MaxOpenConns(1)`) is cached process-wide via `cachedRW(dbPath)`. The underlying connection pool manages serialization. The original `openRW()` function is retained for one-shot test usage. ### 2. DBConfig dedup `DBConfig` was defined identically in both `cmd/server/config.go` and `cmd/ingestor/config.go`. Extracted to `internal/dbconfig/` as a shared package; both binaries now use a type alias (`type DBConfig = dbconfig.DBConfig`). ## Tests added \| Test \| File \| \|------\|------\| \| `TestCachedRW_ReturnsSameHandle` \| `cmd/server/rw_cache_test.go` \| \| `TestCachedRW_100Calls_SingleConnection` \| `cmd/server/rw_cache_test.go` \| \| `TestGetIncrementalVacuumPages_Default` \| `internal/dbconfig/dbconfig_test.go` \| \| `TestGetIncrementalVacuumPages_Configured` \| `internal/dbconfig/dbconfig_test.go` \| ## Verification ``` ok github.com/corescope/server 20.069s ok github.com/corescope/ingestor 47.117s ok github.com/meshcore-analyzer/dbconfig 0.003s ``` Both binaries build cleanly. 100 sequential `cachedRW()` calls return the same handle with exactly 1 entry in the cache map. --------- Co-authored-by: you <you@example.com>	2026-05-02 20:15:30 -07:00
Kpa-clawbot	58484ad924	feat(ingestor): backfill observations.path_json from raw_hex (closes #888 ) (#983 ) ## Summary Adds an idempotent startup migration to the ingestor that backfills `observations.path_json` from per-observation `raw_hex` (added in #882). Approach: Server-side migration (Option B) — runs automatically at startup, chunked in batches of 1000, tracked via `_migrations` table. Chosen over a standalone script because: 1. Follows existing migration pattern (channel_hash, last_packet_at, etc.) 2. Zero operator action required — just deploy 3. Idempotent — safe to restart mid-migration (uncommitted rows get picked up next run) ## What it does - Selects observations where `raw_hex` is populated but `path_json` is NULL/empty/`[]` - Excludes TRACE packets (`payload_type = 9`) at the SQL level — their header bytes are SNR values, not hops - Decodes hops via `packetpath.DecodePathFromRawHex` (reuses existing helper) - Updates `path_json` with the decoded JSON array - Marks rows with undecoded/empty hops as `'[]'` to prevent infinite re-scanning - Records `backfill_path_json_from_raw_hex_v1` in `_migrations` when complete ## Safety - Never overwrites existing non-empty `path_json` — only fills where missing - Batched (1000 rows per iteration) — won't OOM on large DBs - TRACE-safe — excluded at query level per `packetpath.PathBytesAreHops` semantics ## Test `TestBackfillPathJsonFromRawHex` — creates synthetic observations with: - Empty path_json + valid raw_hex → verifies backfill populates correctly - NULL path_json → verifies backfill populates - Existing path_json → verifies NO overwrite - TRACE packet → verifies skip Anti-tautology: test asserts specific decoded values (`["AABB","CCDD"]`) from known raw_hex input, not just "something changed." Closes #888 Co-authored-by: you <you@example.com>	2026-05-02 19:52:43 -07:00
Kpa-clawbot	5aa8f795cd	feat(ingestor): per-source MQTT connect timeout (#931 ) (#977 ) ## Summary Per-source MQTT connect timeout, correctly targeting the `WaitTimeout` startup gate (#931). ## What changed - Added `connectTimeoutSec` field to `MQTTSource` struct (per-source, not global) — `config.go:24` - Added `ConnectTimeoutOrDefault()` helper returning configured value or 30 (default from #926) — `config.go:29` - Replaced hardcoded `WaitTimeout(30 * time.Second)` with `WaitTimeout(time.Duration(connectTimeout) * time.Second)` — `main.go:173` - Updated `config.example.json` with field at source level - Unit tests for default (30) and custom values ## Why this supersedes #976 PR #976 made paho's `SetConnectTimeout` (per-TCP-dial, was 10s) configurable via a global `mqttConnectTimeoutSeconds` field. Issue #931 explicitly references the 30s timeout — which is `WaitTimeout(30s)`, the startup gate from #926. It also requests per-source config, not global. This PR targets the correct timeout at the correct granularity. ## Live verification (Rule 18) Two sources pointed at unreachable brokers: - `fast` (`connectTimeoutSec: 5`): timed out in 5s ✅ - `default` (unset): timed out in 30s ✅ ``` 19:00:35 MQTT [fast] connect timeout: 5s 19:00:40 MQTT [fast] initial connection timed out — retrying in background 19:00:40 MQTT [default] connect timeout: 30s 19:01:10 MQTT [default] initial connection timed out — retrying in background ``` Closes #931 Supersedes #976 Co-authored-by: you <you@example.com>	2026-05-02 12:08:25 -07:00
Kpa-clawbot	1e7c187521	fix(ingestor): address review BLOCKERs from PR #926 (goroutine leak + guard semantics) [v2] (#974 ) ## fix(ingestor): address review BLOCKERs from PR #926 (goroutine leak + guard semantics) Supersedes #970. Rebased onto current master to resolve merge conflicts. ### Changes (same as #970) - BL1 (goroutine leak): Call `client.Disconnect(0)` on the error path after `Connect()` fails with `ConnectRetry=true`, preventing Paho's internal retry goroutines from leaking. - BL2 (guard semantics): Use `connectedCount == 0` instead of `len(clients) == 0` to detect zero-connected state, since timed-out clients are appended to the slice. - Tests: `TestBL1_GoroutineLeakOnHardFailure` and `TestBL2_ZeroConnectedFatals` covering both blockers. ### Context - Fixes blockers raised in review of #926 - Related: #910 (original hang bug) Co-authored-by: you <you@example.com>	2026-05-02 12:05:02 -07:00
Kpa-clawbot	3364eed303	feat: separate "Last Status Update" from "Last Packet Observation" for observers (v3 rebase) (#969 ) Rebased version of #968 (which was itself a rebase of #905) — resolves merge conflict with #906 (clock-skew UI) that landed on master. ## Conflict resolution `public/observers.js` — master (#906) added "Clock Offset" column to observer table; #968 split "Last Seen" into "Last Status" + "Last Packet" columns. Combined both: the table now has Status \| Name \| Region \| Last Status \| Last Packet \| Packets \| Packets/Hour \| Clock Offset \| Uptime. ## What this PR adds (unchanged from #968/#905) - `last_packet_at` column in observers DB table - Separate "Last Status Update" and "Last Packet Observation" display in observers list and detail page - Server-side migration to add the column automatically - Backfill heuristic for existing data - Tests for ingestor and server ## Verification - All Go tests pass (`cmd/server`, `cmd/ingestor`) - Frontend tests pass (`test-packets.js`, `test-hash-color.js`) - Built server, hit `/api/observers` — `last_packet_at` field present in JSON - Observer table header has all 9 columns including both Last Packet and Clock Offset ## Prior PRs - #905 — original (conflicts with master) - #968 — first rebase (conflicts after #906 landed) - This PR — second rebase, resolves #906 conflict Supersedes #968. Closes #905. --------- Co-authored-by: you <you@example.com>	2026-05-02 12:03:42 -07:00
efiten	d65122491e	fix(ingestor): unblock startup when one of multiple MQTT sources is unreachable (#926 ) ## Summary - With `ConnectRetry=true`, paho's `token.Wait()` only returns on success — it blocks forever for unreachable brokers, stalling the entire startup loop before any other source connects - Switches to `token.WaitTimeout(30s)`: on timeout the client is still tracked so `ConnectRetry` keeps retrying in background; `OnConnect` fires and subscribes when it eventually connects - Adds `TestMQTTConnectRetryTimeoutDoesNotBlock` to confirm `WaitTimeout` returns within deadline for unreachable brokers (regression guard for this exact failure mode) Fixes #910 ## Test plan - [x] Two MQTT sources configured, one unreachable: ingestor reaches `Running` status and ingests from the reachable source immediately on startup - [x] Unreachable source logs `initial connection timed out — retrying in background` and reconnects automatically when the broker comes back - [x] Single source, reachable: behaviour unchanged (`Running — 1 MQTT source(s) connected`) - [x] Single source, unreachable: `Running — 0 MQTT source(s) connected, 1 retrying in background`; ingestion starts once broker is available - [x] `go test ./...` passes (excluding pre-existing `TestOpenStoreInvalidPath` failure on master) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 11:31:51 -07:00
Kpa-clawbot	b3a9677c52	feat(ingestor + server): observerBlacklist config (#962 ) (#963 ) ## Summary Implements `observerBlacklist` config — mirrors the existing `nodeBlacklist` pattern for observers. Drop observers by pubkey at ingest, with defense-in-depth filtering on the server side. Closes #962 ## Changes ### Ingestor (`cmd/ingestor/`) - `config.go`: Added `ObserverBlacklist []string` field + `IsObserverBlacklisted()` method (case-insensitive, whitespace-trimmed) - `main.go`: Early return in `handleMessage` when `parts[2]` (observer ID from MQTT topic) matches blacklist — before status handling, before IATA filter. No UpsertObserver, no observations, no metrics insert. Log line: `observer <pubkey-short> blacklisted, dropping` ### Server (`cmd/server/`) - `config.go`: Same `ObserverBlacklist` field + `IsObserverBlacklisted()` with `sync.Once` cached set (same pattern as `nodeBlacklist`) - `routes.go`: Defense-in-depth filtering in `handleObservers` (skip blacklisted in list) and `handleObserverDetail` (404 for blacklisted ID) - `main.go`: Startup `softDeleteBlacklistedObservers()` marks matching rows `inactive=1` so historical data is hidden - `neighbor_persist.go`: `softDeleteBlacklistedObservers()` implementation ### Tests - `cmd/ingestor/observer_blacklist_test.go`: config method tests (case-insensitive, empty, nil) - `cmd/server/observer_blacklist_test.go`: config tests + HTTP handler tests (list excludes blacklisted, detail returns 404, no-blacklist passes all, concurrent safety) ## Config ```json { "observerBlacklist": [ "EE550DE547D7B94848A952C98F585881FCF946A128E72905E95517475F83CFB1" ] } ``` ## Verification (Rule 18 — actual server output) Before blacklist (no config): ``` Total: 31 DUBLIN in list: True ``` After blacklist (DUBLIN Observer pubkey in `observerBlacklist`): ``` [observer-blacklist] soft-deleted 1 blacklisted observer(s) Total: 30 DUBLIN in list: False ``` Detail endpoint for blacklisted observer returns 404. All existing tests pass (`go test ./...` for both server and ingestor). --------- Co-authored-by: you <you@example.com>	2026-05-01 23:11:27 -07:00
Kpa-clawbot	6345c6fb05	fix(ingestor): observability + bounded backoff for MQTT reconnect (#947 ) (#949 ) ## Summary Fixes #947 — MQTT ingestor silently stalls after `pingresp not received` disconnect due to paho's default 10-minute reconnect backoff and zero observability of reconnect attempts. ## Changes ### `cmd/ingestor/main.go` - Extract `buildMQTTOpts()` — encapsulates MQTT client option construction for testability - `SetMaxReconnectInterval(30s)` — bounds paho's default 10-minute exponential backoff (source: `options.go:137` in `paho.mqtt.golang@v1.5.0`) - `SetConnectTimeout(10s)` — prevents stuck connect attempts from blocking reconnect cycle - `SetWriteTimeout(10s)` — prevents stuck publish writes - `SetReconnectingHandler` — logs `MQTT [<tag>] reconnecting to <broker>` on every reconnect attempt, giving operators visibility into retry behavior - Enhanced `SetConnectionLostHandler` — now includes broker address in log line for multi-source disambiguation ### `cmd/ingestor/mqtt_opts_test.go` (new) - Tests verify `MaxReconnectInterval`, `ConnectTimeout`, `WriteTimeout` are set correctly - Tests verify credential and TLS configuration - Anti-tautology: tests fail if timing settings are removed from `buildMQTTOpts()` ## Operator impact After this change, a pingresp disconnect produces: ``` MQTT [staging] disconnected from tcp://broker:1883: pingresp not received, disconnecting MQTT [staging] reconnecting to tcp://broker:1883 MQTT [staging] reconnecting to tcp://broker:1883 MQTT [staging] connected to tcp://broker:1883 MQTT [staging] subscribed to meshcore/# ``` Max gap between disconnect and first reconnect attempt: ~30s (was up to 10 minutes). --------- Co-authored-by: you <you@example.com>	2026-05-01 00:01:07 -07:00
Kpa-clawbot	aeae7813bc	fix: enable SQLite incremental auto-vacuum so DB shrinks after retention (#919 ) (#920 ) Closes #919 ## Summary Enables SQLite incremental auto-vacuum so the database file actually shrinks after retention reaper deletes old data. Previously, `DELETE` operations freed pages internally but never returned disk space to the OS. ## Changes ### 1. Auto-vacuum on new databases - `PRAGMA auto_vacuum = INCREMENTAL` set via DSN pragma before `journal_mode(WAL)` in the ingestor's `OpenStoreWithInterval` - Must be set before any tables are created; DSN ordering ensures this ### 2. Post-reaper incremental vacuum - `PRAGMA incremental_vacuum(N)` runs after every retention reaper cycle (packets, metrics, observers, neighbor edges) - N defaults to 1024 pages, configurable via `db.incrementalVacuumPages` - Noop on `auto_vacuum=NONE` databases (safe before migration) - Added to both server and ingestor ### 3. Opt-in full VACUUM for existing databases - Startup check logs a clear warning if `auto_vacuum != INCREMENTAL` - `db.vacuumOnStartup: true` config triggers one-time `PRAGMA auto_vacuum = INCREMENTAL; VACUUM` - Logs start/end time for operator visibility ### 4. Documentation - `docs/user-guide/configuration.md`: retention section notes that lowering retention doesn't immediately shrink the DB - `docs/user-guide/database.md`: new guide covering WAL, auto-vacuum, migration, manual VACUUM ### 5. Tests - `TestNewDBHasIncrementalAutoVacuum` — fresh DB gets `auto_vacuum=2` - `TestExistingDBHasAutoVacuumNone` — old DB stays at `auto_vacuum=0` - `TestVacuumOnStartupMigratesDB` — full VACUUM sets `auto_vacuum=2` - `TestIncrementalVacuumReducesFreelist` — DELETE + vacuum shrinks freelist - `TestCheckAutoVacuumLogs` — handles both modes without panic - `TestConfigIncrementalVacuumPages` — config defaults and overrides ## Migration path for existing databases 1. On startup, CoreScope logs: `[db] auto_vacuum=NONE — DB needs one-time VACUUM...` 2. Set `db.vacuumOnStartup: true` in config.json 3. Restart — VACUUM runs (blocks startup, minutes on large DBs) 4. Remove `vacuumOnStartup` after migration ## Test results ``` ok github.com/corescope/server 19.448s ok github.com/corescope/ingestor 30.682s ``` --------- Co-authored-by: you <you@example.com>	2026-04-30 23:45:00 -07:00
Kpa-clawbot	56ec590bc4	fix(#886 ): derive path_json from raw_hex at ingest (#887 ) ## Problem Per-observation `path_json` disagrees with `raw_hex` path section for TRACE packets. Reproducer: packet `af081a2c41281b1e`, observer `lutin🏡` - `path_json`: `["67","33","D6","33","67"]` (5 hops — from TRACE payload) - `raw_hex` path section: `30 2D 0D 23` (4 bytes — SNR values in header) ## Root Cause `DecodePacket` correctly parses TRACE packets by replacing `path.Hops` with hop IDs from the payload's `pathData` field (the actual route). However, the header path bytes for TRACE packets contain SNR values (one per completed hop), not hop IDs. `BuildPacketData` used `decoded.Path.Hops` to build `path_json`, which for TRACE packets contained the payload-derived hops — not the header path bytes that `raw_hex` stores. This caused `path_json` and `raw_hex` to describe completely different paths. ## Fix - Added `DecodePathFromRawHex(rawHex)` — extracts header path hops directly from raw hex bytes, independent of any TRACE payload overwriting. - `BuildPacketData` now calls `DecodePathFromRawHex(msg.Raw)` instead of using `decoded.Path.Hops`, guaranteeing `path_json` always matches the `raw_hex` path section. ## Tests (8 new) `DecodePathFromRawHex` unit tests: - hash_size 1, 2, 3, 4 - zero-hop direct packets - transport route (4-byte transport codes before path) `BuildPacketData` integration tests: - TRACE packet: asserts path_json matches raw_hex header path (not payload hops) - Non-TRACE packet: asserts path_json matches raw_hex header path All existing tests continue to pass (`go test ./...` for both ingestor and server). Fixes #886 --------- Co-authored-by: you <you@example.com>	2026-04-21 21:13:58 -07:00
Kpa-clawbot	a605518d6d	fix(#881 ): per-observation raw_hex — each observer sees different bytes on air (#882 ) ## Problem Each MeshCore observer receives a physically distinct over-the-air byte sequence for the same transmission (different path bytes, flags/hops remaining). The `observations` table stored only `path_json` per observer — all observations pointed at one `transmissions.raw_hex`. This prevented the hex pane from updating when switching observations in the packet detail view. ## Changes \| Layer \| Change \| \|-------\|--------\| \| Schema \| `ALTER TABLE observations ADD COLUMN raw_hex TEXT` (nullable). Migration: `observations_raw_hex_v1` \| \| Ingestor \| `stmtInsertObservation` now stores per-observer `raw_hex` from MQTT payload \| \| View \| `packets_v` uses `COALESCE(o.raw_hex, t.raw_hex)` — backward compatible with NULL historical rows \| \| Server \| `enrichObs` prefers `obs.RawHex` when non-empty, falls back to `tx.RawHex` \| \| Frontend \| No changes — `effectivePkt.raw_hex` already flows through `renderDetail` \| ## Tests - Ingestor: `TestPerObservationRawHex` — two MQTT packets for same hash from different observers → both stored with distinct raw_hex - Server: `TestPerObservationRawHexEnrich` — enrichObs returns per-obs raw_hex when present, tx fallback when NULL - E2E: Playwright assertion in `test-e2e-playwright.js` for hex pane update on observation switch E2E assertion added: `test-e2e-playwright.js:1794` ## Scope - Historical observations: raw_hex stays NULL, UI falls back to transmission raw_hex silently - No backfill, no path_json reconstruction, no frontend changes Closes #881 --------- Co-authored-by: you <you@example.com>	2026-04-21 13:45:29 -07:00
efiten	cad1f11073	fix: bypass IATA filter for status messages, fill SNR on duplicate obs (#694 ) (#802 ) ## Problems Two independent ingestor bugs identified in #694: ### 1. IATA filter drops status messages from out-of-region observers The IATA filter ran at the top of `handleMessage()` before any message-type discrimination. Status messages carrying observer metadata (`noise_floor`, battery, airtime) from observers outside the configured IATA regions were silently discarded before `UpsertObserver()` and `InsertMetrics()` ran. Impact: Observers running `meshcoretomqtt/1.0.8.0` in BFL and LAX — the only client versions that include `noise_floor` in status messages — had their health data dropped entirely on prod instances filtering to SJC. Fix: Moved the IATA filter to the packet path only (after the `parts[3] == "status"` branch). Status messages now always populate observer health data regardless of configured region filter. ### 2. `INSERT OR IGNORE` discards SNR/RSSI on late arrival When the same `(transmission_id, observer_idx, path_json)` observation arrived twice — first without RF fields, then with — `INSERT OR IGNORE` silently discarded the SNR/RSSI from the second arrival. Fix: Changed to `ON CONFLICT(...) DO UPDATE SET snr = COALESCE(excluded.snr, snr), rssi = ..., score = ...`. A later arrival with SNR fills in a `NULL`; a later arrival without SNR does not overwrite an existing value. ## Tests - `TestIATAFilterDoesNotDropStatusMessages` — verifies BFL status message is processed when IATA filter includes only SJC, and that BFL packet is still filtered - `TestInsertObservationSNRFillIn` — verifies SNR fills in on second arrival, and is not overwritten by a subsequent null arrival ## Related Partially addresses #694 (upstream client issue of missing SNR in packet messages is out of scope) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 22:16:01 -07:00
Kpa-clawbot	a8e1cea683	fix: use payload type bits only in content hash (not full header byte) (#787 ) ## Problem The firmware computes packet content hash as: ``` SHA256(payload_type_byte + [path_len for TRACE] + payload) ``` Where `payload_type_byte = (header >> 2) & 0x0F` — just the payload type bits (2-5). CoreScope was using the full header byte in its hash computation, which includes route type bits (0-1) and version bits (6-7). This meant the same logical packet produced different content hashes depending on route type — breaking dedup and packet lookup. Firmware reference: `Packet.cpp::calculatePacketHash()` uses `getPayloadType()` which returns `(header >> PH_TYPE_SHIFT) & PH_TYPE_MASK`. ## Fix - Extract only payload type bits: `payloadType := (headerByte >> 2) & 0x0F` - Include `path_len` byte in hash for TRACE packets (matching firmware behavior) - Applied to both `cmd/server/decoder.go` and `cmd/ingestor/decoder.go` ## Tests Added - Route type independence: Same payload with FLOOD vs DIRECT route types produces identical hash - TRACE path_len inclusion: TRACE packets with different `path_len` produce different hashes - Firmware compatibility: Hash output matches manual computation of firmware algorithm ## Migration Impact Existing packets in the DB have content hashes computed with the old (incorrect) formula. Options: 1. Recompute hashes via migration (recommended for clean state) 2. Dual lookup — check both old and new hash on queries (backward compat) 3. Accept the break — old hashes become stale, new packets get correct hashes Recommend option 1 (migration) as a follow-up. The volume of affected packets depends on how many distinct route types were seen for the same logical packet. Fixes #786 --------- Co-authored-by: you <you@example.com>	2026-04-18 11:52:22 -07:00
Kpa-clawbot	bf674ebfa2	feat: validate advert signatures on ingest, reject corrupt packets (#794 ) ## Summary Validates ed25519 signatures on ADVERT packets during MQTT ingest. Packets with invalid signatures are rejected before storage, preventing corrupt/truncated adverts from polluting the database. ## Changes ### Ingestor (`cmd/ingestor/`) - Signature validation on ingest: After decoding an ADVERT, checks `SignatureValid` from the decoder. Invalid signatures → packet dropped, never stored. - Config flag: `validateSignatures` (default `true`). Set to `false` to disable validation for backward compatibility with existing installs. - `dropped_packets` table: New SQLite table recording every rejected packet with full attribution: - `hash`, `raw_hex`, `reason`, `observer_id`, `observer_name`, `node_pubkey`, `node_name`, `dropped_at` - Indexed on `observer_id` and `node_pubkey` for investigation queries - `SignatureDrops` counter: New atomic counter in `DBStats`, logged in periodic stats output as `sig_drops=N` - Retention: `dropped_packets` pruned alongside metrics on the same `retention.metricsDays` schedule ### Server (`cmd/server/`) - `GET /api/dropped-packets` (API key required): Returns recent drops with optional `?observer=` and `?pubkey=` filters, `?limit=` (default 100, max 500) - `signatureDrops` field added to `/api/stats` response (count from `dropped_packets` table) ### Tests (8 new) \| Test \| What it verifies \| \|------\|-----------------\| \| `TestSigValidation_ValidAdvertStored` \| Valid advert passes validation and is stored \| \| `TestSigValidation_TamperedSignatureDropped` \| Tampered signature → dropped, recorded in `dropped_packets` with correct fields \| \| `TestSigValidation_TruncatedAppdataDropped` \| Truncated appdata invalidates signature → dropped \| \| `TestSigValidation_DisabledByConfig` \| `validateSignatures: false` skips validation, stores tampered packet \| \| `TestSigValidation_DropCounterIncrements` \| Counter increments correctly across multiple drops \| \| `TestSigValidation_LogContainsFields` \| `dropped_packets` row contains hash, reason, observer, pubkey, name \| \| `TestPruneDroppedPackets` \| Old entries pruned, recent entries retained \| \| `TestShouldValidateSignatures_Default` \| Config helper returns correct defaults \| ### Config example ```json { "validateSignatures": true } ``` Fixes #793 --------- Co-authored-by: you <you@example.com>	2026-04-18 11:39:13 -07:00
Joel Claw	fa3f623bd6	feat: add observer retention — remove stale observers after configurable days (#764 ) ## Summary Observers that stop actively sending data now get removed after a configurable retention period (default 14 days). Previously, observers remained in the `observers` table forever. This meant nodes that were once observers for an instance but are no longer connected (even if still active in the mesh elsewhere) would continue appearing in the observer list indefinitely. ## Key Design Decisions - Active data requirement: `last_seen` is only updated when the observer itself sends packets (via `stmtUpdateObserverLastSeen`). Being seen by another node does NOT update this field. So an observer must actively send data to stay listed. - Default: 14 days — observers not seen in 14 days are removed - `-1` = keep forever — for users who want observers to never be removed - `0` = use default (14 days) — same as not setting the field - Runs on startup + daily ticker — staggered 3 minutes after metrics prune to avoid DB contention ## Changes \| File \| Change \| \|------\|--------\| \| `cmd/ingestor/config.go` \| Add `ObserverDays` to `RetentionConfig`, add `ObserverDaysOrDefault()` \| \| `cmd/ingestor/db.go` \| Add `RemoveStaleObservers()` — deletes observers with `last_seen` before cutoff \| \| `cmd/ingestor/main.go` \| Wire up startup + daily ticker for observer retention \| \| `cmd/server/config.go` \| Add `ObserverDays` to `RetentionConfig`, add `ObserverDaysOrDefault()` \| \| `cmd/server/db.go` \| Add `RemoveStaleObservers()` (server-side, uses read-write connection) \| \| `cmd/server/main.go` \| Wire up startup + daily ticker, shutdown cleanup \| \| `cmd/server/routes.go` \| Admin prune API now also removes stale observers \| \| `config.example.json` \| Add `observerDays: 14` with documentation \| \| `cmd/ingestor/coverage_boost_test.go` \| 4 tests: basic removal, empty store, keep forever (-1), default (0→14) \| \| `cmd/server/config_test.go` \| 4 tests: `ObserverDaysOrDefault` edge cases \| ## Config Example ```json { "retention": { "nodeDays": 7, "observerDays": 14, "packetDays": 30, "_comment": "observerDays: -1 = keep forever, 0 = use default (14)" } } ``` ## Admin API The `/api/admin/prune` endpoint now also removes stale observers (using `observerDays` from config) and reports `observers_removed` in the response alongside `packets_deleted`. ## Test Plan - [x] `TestRemoveStaleObservers` — old observer removed, recent observer kept - [x] `TestRemoveStaleObserversNone` — empty store, no errors - [x] `TestRemoveStaleObserversKeepForever` — `-1` keeps even year-old observers - [x] `TestRemoveStaleObserversDefault` — `0` defaults to 14 days - [x] `TestObserverDaysOrDefault` (ingestor) — nil/zero/positive/keep-forever - [x] `TestObserverDaysOrDefault` (server) — nil/zero/positive/keep-forever - [x] Both binaries compile cleanly (`go build`) - [ ] Manual: verify observer count decreases after retention period on a live instance	2026-04-17 09:24:40 -07:00
Kpa-clawbot	0e286d85fd	fix: channel query performance — add channel_hash column, SQL-level filtering (#762 ) (#763 ) ## Problem Channel API endpoints scan entire DB — 2.4s for channel list, 30s for messages. ## Fix - Added `channel_hash` column to transmissions (populated on ingest, backfilled on startup) - `GetChannels()` rewrites to GROUP BY channel_hash (one row per channel vs scanning every packet) - `GetChannelMessages()` filters by channel_hash at SQL level with proper LIMIT/OFFSET - 60s cache for channel list - Index: `idx_tx_channel_hash` for fast lookups Expected: 2.4s → <100ms for list, 30s → <500ms for messages. Fixes #762 --------- Co-authored-by: you <you@example.com>	2026-04-16 00:09:36 -07:00

1 2

100 Commits