meshcore-analyzer

mirror of https://github.com/Kpa-clawbot/meshcore-analyzer.git synced 2026-06-07 19:51:43 +00:00

Author	SHA1	Message	Date
Kpa-clawbot	749fdc114f	feat(decoder+ui): close remaining P2 items from #1279 — payloadTypeNames, legend, TransportCodes, Feat1/2, RAW_CUSTOM, sensor docs (#1291 ) RED commit: `dc4c0800` — CI: https://github.com/Kpa-clawbot/CoreScope/actions?query=branch%3Afix%2Fissue-1279-p2 Closes the remaining six 🟢 P2 items in umbrella #1279 (PR #1280 shipped P0+P1, PR #1276 shipped ACK/RESPONSE/PATH legend rows). ### Item-by-item \| # \| Item \| Where \| Test \| \|---\|---\|---\|---\| \| 1 \| `payloadTypeNames` parity \| `cmd/server/store.go` \| `cmd/server/issue1279_p2_test.go::TestPayloadTypeNamesAll13` \| \| 2 \| Legend rows: Anon Req / Grp Data / Multipart / Control / Raw Custom \| `public/live.js` \| `test-issue-1279-legend-p2-e2e.js` (Playwright) \| \| 3 \| TransportCodes detail-row + `code1=` / `code2=` filter grammar \| `public/packets.js`, `public/packet-filter.js` \| `test-issue-1279-p2-code-filter.js` (6 cases) \| \| 4 \| Multibyte capability badge on node detail/list rows \| `public/nodes.js::renderNodeBadges` \| `n.hash_size >= 2` (observable Feat1/Feat2 proxy; firmware `AdvertDataHelpers.h:14-16`) \| \| 5 \| RAW_CUSTOM (0x0F) `{rawLength, firstByteTag}` decode + detail-row \| `cmd/server/decoder.go`, `cmd/ingestor/decoder.go`, `public/packets.js` \| `TestDecodeRawCustomExposesLengthAndTag` × 2 + updated `TestDecodePayloadRAWCustom` \| \| 6 \| Sensor advert telemetry firmware-derivation comments \| `cmd/ingestor/decoder.go:363-380` \| pure comments — exempt per AGENTS \| ### Firmware refs cited inline - `firmware/src/Packet.h:19-32` — PAYLOAD_TYPE_* constants - `firmware/src/Packet.h:46` — TransportCodes wire layout - `firmware/src/Mesh.cpp:577` — `createRawData` - `firmware/src/helpers/SensorMesh.{h,cpp}` — sensor advert telemetry derivation - `firmware/src/helpers/AdvertDataHelpers.h:14-16` — Feat1/Feat2 ### TDD Red `dc4c0800` proves the assertions gate behavior: - `payloadTypeNames` had only 12 entries (no 0x0F). - RAW_CUSTOM decoded as `UNKNOWN` with no envelope fields. Green `<HEAD>` makes both green; per-item tests included. ### Cross-stack note Cross-stack: justified — items 1/5 add decoder output fields; items 2/3/4/5 surface those fields in the UI in the same PR per #1279 acceptance. ### Out of scope Item 4 surfaces the observable multibyte capability via the persisted `hash_size` (Feat1/Feat2 wire bits are only on transient adverts and not stored per-node today); persisting raw Feat1/Feat2 per-node is left for a follow-up. Fixes #1279 --------- Co-authored-by: bot <bot@corescope>	2026-05-19 08:08:28 -07:00
Kpa-clawbot	1da2034341	refactor(db): move all writes from server to ingestor; server truly read-only (fixes #1283 ) (#1286 ) Red commit: `f6290b63` — CI run will appear at https://github.com/Kpa-clawbot/CoreScope/actions Fixes #1283. ## What Moves all four DB write operations out of `cmd/server/` into `cmd/ingestor/`, making the server truly read-only and eliminating the SQLITE_BUSY VACUUM bug at its root: the server can no longer race the ingestor for the write lock because the server has no write path. ## The four operations \| # \| Was in \| Now in \| \|---\|--------\|--------\| \| 1 \| `cmd/server/vacuum.go` (`checkAutoVacuum`, full VACUUM + `auto_vacuum=INCREMENTAL` migration) \| `cmd/ingestor/db.go` `Store.CheckAutoVacuum` (already existed; ingestor runs it at startup before the MQTT subscriber starts → no contention) \| \| 2 \| `cmd/server/db.go` `PruneOldPackets` (`DELETE FROM transmissions`) \| `cmd/ingestor/maintenance.go` `Store.PruneOldPackets` (new) + 24h ticker in `cmd/ingestor/main.go` \| \| 3 \| `cmd/server/db.go` `PruneOldMetrics` (`DELETE FROM observer_metrics`) \| `cmd/ingestor/db.go` `Store.PruneOldMetrics` (already existed) \| \| 4 \| `cmd/server/db.go` `RemoveStaleObservers` (`UPDATE observers SET inactive=1`) \| `cmd/ingestor/db.go` `Store.RemoveStaleObservers` (already existed) \| ## HTTP surface - Removed: `POST /api/admin/prune` (`handleAdminPrune`, route, openapi entry). Operators trigger an ad-hoc prune by restarting the ingestor. - Kept: `GET /api/backup` — uses `VACUUM INTO` which writes to a separate file, not the live DB; read-only-safe. ## Tests - `cmd/server/readonly_invariant_test.go` (RED gate) — reflect-asserts `PruneOldPackets`/`PruneOldMetrics`/`RemoveStaleObservers` are NOT methods on the server's `DB`. Fails on master, passes after this PR. - `cmd/ingestor/issue1283_test.go` — exercises `Store.PruneOldPackets` and the auto_vacuum=NONE → INCREMENTAL migration through `Store.CheckAutoVacuum` with `vacuumOnStartup=true`. ## Why the bug is gone The SQLITE_BUSY VACUUM failure happened because supervisord launched both ingestor + server in one container; the ingestor took the write lock for INSERTs and the server's `checkAutoVacuum` then failed to acquire it within `busy_timeout=5000`. After this PR, only the ingestor ever opens a writable connection, and it runs `CheckAutoVacuum` before* spawning the MQTT subscriber → no contention possible. ## Scope notes - `cachedRW()` still has three pre-existing callers in `cmd/server/` (`neighbor_persist.go`, `ensure_indexes.go`, `from_pubkey_migration.go`). These pre-date #1283 and are not in the issue's four-operation list. Leaving them for follow-up keeps this PR honest about scope; AGENTS.md documents the invariant so new write paths can't sneak in. - PII preflight reports false positives on the Go method name `requireAPIKey` in `routes.go` diff context — no real PII. - Server-side neighbor-edge prune (`PruneNeighborEdges`) intentionally left in place — out of scope of #1283. --------- Co-authored-by: MeshCore Bot <bot@meshcore.local>	2026-05-18 23:52:27 -07:00
Kpa-clawbot	e6c30e1a7e	feat(decoder): GRP_DATA + MULTIPART + advertRole fix + CONTROL flags (#1279 P0+P1) (#1280 ) Addresses the four P0+P1 firmware reconciliation gaps from the umbrella audit (issue #1279). RED commit: `0a4c084e` (asserts on stub returns; all 13 assertions fail). GREEN commit: `13867681`. ## What's in this PR ### P0 — silently dropped data - #1 GRP_DATA (0x06) decoder. Outer envelope is the same shape as GRP_TXT (`channel_hash(1)+MAC(2)+ciphertext`) per `firmware/src/helpers/BaseChatMesh.cpp:476,500`. Factored `decryptChannelBlock(...)` helper used by both 5 and 6. When a channel key matches, the inner is parsed per `firmware/src/helpers/BaseChatMesh.cpp:382-385` as `data_type(uint16 LE) + data_len(1) + blob(data_len)`. Surfaces `{channelHash, MAC, dataType, dataLen, decryptedBlob}` on decrypt or `{channelHash, MAC, encryptedData}` otherwise. Server-side decoder surfaces envelope only (no key store). - #2 MULTIPART (0x0A) decoder. Per `firmware/src/Mesh.cpp:289`, byte0 = `(remaining<<4) \| inner_type`. When `inner_type == PAYLOAD_TYPE_ACK (0x03)`, next 4 bytes are the LE ack_crc per `firmware/src/Mesh.cpp:292-307`. Surfaces `{remaining, innerType, innerTypeName, innerAckCrc \| innerPayload}`. ### P1 — mis-classified / opaque - #3 `advertRole()` raw-type fix. Per `firmware/src/helpers/AdvertDataHelpers.h:7-12`, ADV_TYPE_NONE = 0 and 5-15 are FUTURE. The previous boolean fallback collapsed both into `"companion"`, silently relabelling unknown/reserved types. New behaviour: type 0 → `none`, 1 → `companion`, 2-4 → `repeater`/`room`/`sensor`, 5-15 → `type-N`. `ValidateAdvert` accepts the new labels. - #4 CONTROL (0x0B) byte0 flags + length. Per `firmware/src/Mesh.cpp:69` + `createControlData` at `Mesh.cpp:609`, byte0 high-bit marks the zero-hop direct subset. Surfaces `{ctrlFlags, ctrlZeroHop, ctrlLength}`. ### Drift fix - `cmd/server/store.go` `payloadTypeNames` now includes `6: GRP_DATA` and `10: MULTIPART` (previously omitted; canonical decoder map already had them). ## Lockstep & TDD Both `cmd/ingestor/decoder.go` and `cmd/server/decoder.go` updated in the same commits — same wire-vector tests live in both packages (`cmd/{ingestor,server}/issue1279_test.go`). Per-item RED→GREEN visible in `git log`. \| Item \| Tests \| RED proof \| \|---\|---\|---\| \| #1 GRP_DATA \| ingestor: NoKey + DecryptedInner; server: Envelope \| 6 assertions failed pre-impl \| \| #2 MULTIPART \| ingestor + server: Ack + NonAck \| 8 assertions failed pre-impl \| \| #3 advertRole \| ingestor + server: 7-row table \| 3 assertions failed pre-impl \| \| #4 CONTROL \| ingestor + server: ZeroHop + MultiHop \| 6 assertions failed pre-impl \| ## What's NOT in this PR The umbrella issue lists P2 items that ship in follow-up PRs: - Live + compare legend entries for the long tail of newly-named types (#1274 + others). - TransportCodes UI surface + filter grammar. - feat1/feat2 capability badges. - `payloadTypeNames` consolidation across server/ingestor (drift-prevention). Leave the umbrella open after this merges. Refs #1279 --------- Co-authored-by: OpenClaw Bot <bot@openclaw.local>	2026-05-18 23:19:27 -07:00
Kpa-clawbot	170f0ac66d	fix(#1212 ): MQTT per-attempt logging + stall watchdog — prevent silent reconnect-loop death (#1216 ) RED commit: `1cd25f7b` — CI (failing on assertion): https://github.com/Kpa-clawbot/CoreScope/actions?query=sha%3A1cd25f7b1bdd0091f689dd64ce1bfec6d031191f Fixes #1212 ## Root cause NOT that `AutoReconnect` was off — it was set; `MaxReconnectInterval=30s` was set (PR #949); a `SetReconnectingHandler` was wired. The defect was an observability gap: `SetReconnectingHandler` fires only INSIDE paho's reconnect goroutine. If that goroutine never iterates (status race after the recovered handler panic at 21:07:13, or an internal abort), operators see ONLY the `disconnected: pingresp not received` line and then total silence. They cannot distinguish "paho is patiently retrying" from "paho gave up and the goroutine is gone." That ambiguity is what turned a 30s blip into 6h of downtime. ## Changes ### `cmd/ingestor/main.go` — `SetConnectionAttemptHandler` Fires on every TCP/TLS dial — the initial `Connect()` AND every reconnect — independent of paho's internal reconnect-loop state. Logs: ``` MQTT [staging] connection attempt #1 to tcp://broker:1883 MQTT [staging] connection attempt #2 to tcp://broker:1883 ``` Per-source attempt counter via `atomic.AddInt64`. ### `cmd/ingestor/mqtt_watchdog.go` (new) — per-source stall watchdog Satisfies the watchdog acceptance criterion. Even when paho reports `connected`, if no MQTT messages have flowed for >5m, log a WARN line every 60s: ``` MQTT [staging] WATCHDOG: client reports connected to tcp://broker:1883 but no messages received for 7m30s (threshold 5m) — possible half-open socket or upstream stall ``` Catches half-open TCP and broker-accepted-but-not-forwarding scenarios that look "connected" to paho. Hot-path cost: one `atomic.StoreInt64` per inbound message. Watchdog scans the registry once a minute. ### Tests (`cmd/ingestor/mqtt_reconnect_test.go`, new) - `TestBuildMQTTOpts_InstrumentsConnectionAttempt` — asserts `OnConnectAttempt` is wired in `buildMQTTOpts`. - `TestMQTTStallWatchdog_FiresOnSilentSource` — connected + 10m silent + 5m threshold → stall flagged. - `TestMQTTStallWatchdog_QuietWhenRecent` — recent message → no stall. - `TestMQTTStallWatchdog_QuietWhenDisconnected` — disconnected → no stall (paho's reconnect logging covers it). ## TDD - RED `1cd25f7b` — 2 assertion failures (compile OK, stub returns no-stall, `OnConnectAttempt` nil). - GREEN `2527be6f` — implementation; all ingestor tests pass. ## Out of scope - Slice-bounds decode panic (#1211, separate PR). - A full in-process MQTT broker integration test would require a new dep (mochi-mqtt) — the observability and watchdog behaviors are independently verifiable by the unit tests above, and the reconnect path itself is paho's responsibility (we already test it's configured via `mqtt_opts_test.go`). --------- Co-authored-by: bot <bot@example.com> Co-authored-by: OpenClaw Bot <bot@openclaw.local> Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: openclaw-bot <openclaw-bot@users.noreply.github.com>	2026-05-15 22:46:29 -07:00
Kpa-clawbot	85e97d2f37	fix(#1211 ): bounds-check path length to prevent slice [218:15] panic in MQTT decode (#1214 ) RED commit: `65d9f57b` (CI run will appear at https://github.com/Kpa-clawbot/CoreScope/actions after PR opens) Fixes #1211 ## Root cause `decodePath()` returns `bytesConsumed = hash_size * hash_count` where both come straight from the wire-supplied `pathByte` (upper 2 bits → `hash_size`, lower 6 bits → `hash_count`). Max claimable: 4 × 63 = 252 bytes. A malformed packet on the wire claimed `pathByte=0xF6` (hash_size=4, hash_count=54 → 216 path bytes) inside a 15-byte buffer. The inner hop-extraction loop in `decodePath` did break early on overflow — but `bytesConsumed` was still returned at face value (216). `DecodePacket` then did `offset += 216` (offset=218) and `payloadBuf := buf[offset:]` panicked with the prod-observed signature: ``` runtime error: slice bounds out of range [218:15] ``` The handler-level `defer/recover` at `cmd/ingestor/main.go:258-263` caught it, but the message was silently dropped with no usable diagnostic. ## Fix Add a `if offset > len(buf)` guard at BOTH decoder sites (same pattern, same panic potential): - `cmd/ingestor/decoder.go` — DecodePacket after decodePath - `cmd/server/decoder.go` — DecodePacket after decodePath Return a descriptive error citing the claimed length and pathByte hex so operators can reproduce. Also: `cmd/ingestor/main.go` decode-error log now includes `topic`, `observer`, and `rawHexLen` so future malformed packets are reproducible without needing to attach a debugger. ## Tests (TDD red → green) Both packages got two new tests: - `TestDecodePacketBoundsFromWire_Issue1211` — feeds the exact wire shape from the prod log (`pathByte=0xF6` inside a 15-byte buf). Asserts `DecodePacket` does NOT panic and returns an error. - `TestDecodePacketFuzzTruncated_Issue1211` — sweeps every `(header, pathByte)` combination with tails 0..19 bytes (≈1.3M inputs). Asserts zero panics. ### Red commit proof On commit `65d9f57b` (RED), both tests fail with the panic: ``` === RUN TestDecodePacketBoundsFromWire_Issue1211 decoder_test.go:1996: DecodePacket panicked on malformed input: runtime error: slice bounds out of range [218:15] --- FAIL: TestDecodePacketBoundsFromWire_Issue1211 (0.00s) === RUN TestDecodePacketFuzzTruncated_Issue1211 decoder_test.go:2010: DecodePacket panicked during fuzz: runtime error: slice bounds out of range [3:2] --- FAIL: TestDecodePacketFuzzTruncated_Issue1211 (0.01s) ``` On commit `7a6ae52c` (GREEN), full suites pass: - `cmd/ingestor`: `ok 53.988s` - `cmd/server`: `ok 29.456s` ## Acceptance criteria - [x] Identify the slice op producing `[218:15]` — `payloadBuf := buf[offset:]` in `DecodePacket` (decoder.go), where `offset` had been advanced by an unchecked `bytesConsumed` from `decodePath()`. - [x] Bounds check added at the identified site(s) — both ingestor and server decoders. - [x] Test with crafted payload (length-field > remaining buffer) — `TestDecodePacketBoundsFromWire_Issue1211`. - [x] Log topic, observer ID, payload byte length on drop — updated `MQTT [%s] decode error` log line. - [x] Existing tests stay green — confirmed both packages. ## Out of scope Reconnect-after-disconnect (#1212) — handled by a separate subagent. This PR touches NO reconnect logic. --------- Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: openclaw-bot <bot@openclaw.local> Co-authored-by: corescope-bot <bot@corescope>	2026-05-15 22:34:21 -07:00
Kpa-clawbot	f4cf2acbc0	perf: cancelled writes + ingestor I/O + threshold tests (#1120 follow-up) (#1167 ) Red commit: `e964ec9c46` (CI run: pending — workflow only triggers on PR open) Partial fix for #1120 — finishes the four follow-up items left open after PR #1123 (cancelled writes, ingestor I/O, threshold-flag tests, docs). ## What's done - `cancelledWriteBytesPerSec` — server `/proc/self/io` parser handles `cancelled_write_bytes`; `/api/perf/io` exposes the per-second rate; Perf page renders it next to Read/Write with ⚠️ when sustained >1 MB/s. - Ingestor `/proc/<pid>/io` — `cmd/ingestor/stats_file.go` samples its own `/proc/self/io` each tick and includes `procIO` in the snapshot. The server's `/api/perf/io` reads it and surfaces `.ingestor`. Frontend renders an `Ingestor process` Disk I/O block alongside the existing `server process` block (issue mockup: "Both ingestor and server"). - Threshold + anomaly tests — `test-perf-disk-io-1120.js` now asserts ⚠️ fires/suppresses on WAL>100MB, cache_hit<90%, and the backfill-rate-vs-tx-rate guard with the `tx_inserted >= 100` baseline floor. Drops the tautological `\|\| ... === false` short-circuits flagged in MINOR m4. - Docs (m8) — `config.example.json` adds `_comment_ingestorStats` (env var, default path, shared-tmp security note); `cmd/ingestor/README.md` adds `CORESCOPE_INGESTOR_STATS` to the env-var table plus a `Stats file` section. ## What's NOT done (deferred) m1 sync.Map → map+RWMutex, m2 perfIOMu rate caching, m3 negative cacheSize translation, m5 deterministic-write test, m7 ctx-aware shutdown — pure polish; will file a follow-up issue if the operator wants them tracked. ## TDD - Red: `e964ec9` — adds failing tests + stub field/handler shape (cancelled missing from struct, ingestor stub returns nil, ingestor procIO absent). - Green: `1240703` — wires up the parser case, ingestor sampler, frontend rendering, docs. E2E assertion added: test-perf-disk-io-1120.js:108 --------- Co-authored-by: clawbot <clawbot@users.noreply.github.com> Co-authored-by: Kpa-clawbot <bot@kpa-clawbot.local> Co-authored-by: Kpa-clawbot <bot@kpa-clawbot>	2026-05-08 16:29:23 -07:00
Kpa-clawbot	fb744d895f	fix(#1143 ): structural pubkey attribution via from_pubkey column (#1152 ) Fixes #1143. ## Summary Replaces the structurally unsound `decoded_json LIKE '%pubkey%'` (and `OR LIKE '%name%'`) attribution path with an exact-match lookup on a dedicated, indexed `transmissions.from_pubkey` column. This closes both holes documented in #1143: - Hole 1 — same-name false positives via `OR LIKE '%name%'` - Hole 2a — adversarial spoofing: a malicious node names itself with another node's pubkey and gets attributed to the victim - Hole 2b — accidental false positive when any free-text field (path elements, channel names, message bodies) contains a 64-char hex substring matching a real pubkey - Perf — query now uses an index instead of a full-table scan against `LIKE '%substring%'` ## TDD Two-commit history shows red-then-green: \| Commit \| Status \| Purpose \| \|---\|---\|---\| \| `7f0f08e` \| RED — tests assertion-fail on master behaviour \| Adversarial fixtures + spec \| \| `59327db` \| GREEN — schema + ingestor + server + migration \| Implementation \| The red commit's test schema includes the new column so the file compiles, but the production code still uses LIKE — the assertions fail because the malicious / same-name / free-text rows are returned. The green commit changes the query plus adds the migration/ingest path. ## Changes ### Schema - new column `transmissions.from_pubkey TEXT` - new index `idx_transmissions_from_pubkey` ### Ingestor (`cmd/ingestor/`) - `PacketData.FromPubkey` populated from decoded ADVERT `pubKey` at write time. Cheap — already parsing `decoded_json`. Non-ADVERTs stay NULL. - `stmtInsertTransmission` writes the column. - Migration `from_pubkey_v1` ALTERs legacy DBs to add the column + index. - Bonus: rewrote the recipe in the gated one-shot `advert_count_unique_v1` migration to use `from_pubkey` (already marked done on existing DBs; kept correct for fresh installs). ### Server (`cmd/server/`) - `ensureFromPubkeyColumn` mirrors the ingestor migration so the server can boot against a DB the ingestor has never touched (e2e fixture, fresh installs). - `backfillFromPubkeyAsync` runs after HTTP starts. Scans `WHERE from_pubkey IS NULL AND payload_type = 4` in 5000-row chunks with a 100ms yield between chunks. Cannot block boot even on prod-sized DBs (100K+ transmissions). Queries handle NULL gracefully (return empty for that pubkey, same as today's unknown-pubkey path). - All in-scope LIKE call sites switched to exact match: \| Site \| Before \| After \| \|---\|---\|---\| \| `buildPacketWhere` (was db.go:582) \| `decoded_json LIKE '%pubkey%'` \| `from_pubkey = ?` \| \| `buildTransmissionWhere` (was db.go:626) \| `t.decoded_json LIKE '%pubkey%'` \| `t.from_pubkey = ?` \| \| `GetRecentTransmissionsForNode` (was db.go:910) \| `LIKE '%pubkey%' OR LIKE '%name%'` \| `t.from_pubkey = ?` \| \| `QueryMultiNodePackets` (was db.go:1785) \| `decoded_json LIKE '%pubkey%' OR ...` \| `t.from_pubkey IN (?, ?, ...)` \| \| `advert_count_unique_v1` (was ingestor/db.go:257) \| `decoded_json LIKE '%' \\|\\| nodes.public_key \\|\\| '%'` \| `t.from_pubkey = nodes.public_key` \| `GetRecentTransmissionsForNode` signature simplifies: the `name` parameter is gone (it was only ever used for the legacy `OR LIKE '%name%'` fallback). Sole caller in `routes.go:1243` updated. ### Tests - `cmd/server/from_pubkey_attribution_test.go` — adversarial fixtures + Hole 1/2a/2b/QueryMultiNodePackets exact-match assertions, EXPLAIN QUERY PLAN index check, migration backfill correctness. - `cmd/ingestor/from_pubkey_test.go` — write-time correctness (BuildPacketData populates FromPubkey for ADVERT only; InsertTransmission persists it; non-ADVERTs stay NULL). - Existing test schemas (server v2, server v3, coverage) get the new column plus a SQLite trigger that auto-populates `from_pubkey` from `decoded_json` on ADVERT inserts. This means existing fixtures (which only seed `decoded_json`) keep attributing correctly without per-test edits. - `seedTestData`'s ADVERTs explicitly set `from_pubkey`. ## Performance — index is used ``` $ EXPLAIN QUERY PLAN SELECT id FROM transmissions WHERE from_pubkey = ? SEARCH transmissions USING INDEX idx_transmissions_from_pubkey (from_pubkey=?) ``` Asserted in `TestFromPubkeyIndexUsed`. ## Migration approach - Sync at boot: `ALTER TABLE transmissions ADD COLUMN from_pubkey TEXT` is a metadata-only operation in SQLite — microseconds regardless of table size. `CREATE INDEX IF NOT EXISTS idx_transmissions_from_pubkey` is not metadata-only: it scans the table once. Empirically a few hundred ms on a 100K-row table; expect a few seconds on a 10M-row table (one-time cost, blocking boot during that window). Subsequent boots no-op via `IF NOT EXISTS`. If this boot delay becomes an operational concern at prod scale we can defer the `CREATE INDEX` to a goroutine — for now a few-second one-time delay is acceptable. - Async: row-level backfill of legacy NULL ADVERTs (chunked 5000 / 100ms yield). On a 100K-ADVERT prod DB, this completes in seconds in the background; HTTP is fully available throughout. - Safety: queries handle NULL gracefully — a node whose ADVERTs haven't backfilled yet returns empty, identical to today's behaviour for unknown pubkeys. No half-state regression. ## Out of scope (intentionally) The free-text `LIKE` paths the issue explicitly leaves alone (e.g. user-typed packet search) are untouched. Only the pubkey-attribution sites get the column treatment. ## Cycle-3 review fixes \| Finding \| Status \| Commit \| \|---\|---\|---\| \| M1c — async-contract test was tautological (test's own `go`, not production's) \| Fixed \| `23ace71` (red) → `a05b50c` (green) \| \| m1c — package-global atomic resets unsafe under `t.Parallel()` \| Fixed (`// DO NOT t.Parallel` comment + `Reset()` helper) \| rolled into `23ace71` / `241ec69` \| \| m2c — `/api/healthz` read 3 atomics non-atomically (torn snapshot) \| Fixed (single RWMutex-guarded snapshot + race test) \| `241ec69` \| \| n3c.m1 — vestigial OR-scaffolding in `QueryMultiNodePackets` \| Fixed (cleanup) \| `5a53ceb` \| \| n3c.m2 — verify PR body language about `ALTER` vs `CREATE INDEX` \| Verified accurate (already corrected in cycle 2) \| (no change) \| \| n3c.m3 — `json.Unmarshal` per row in backfill → could use SQL `json_extract` \| Deferred as known followup — pure perf optimization (current per-row Unmarshal is correct, just slower); SQL rewrite would unwind the chunked-yield architecture and is non-trivial. Acceptable for one-time backfill at boot on legacy DBs. \| ### M1c implementation detail `startFromPubkeyBackfill(dbPath, chunkSize, yieldDuration)` is now the single production entry point used by `main.go`. It internally does `go backfillFromPubkeyAsync(...)`. The test calls `startFromPubkeyBackfill` (no `go` prefix) and asserts the dispatch returns within 50ms — so if anyone removes the `go` keyword inside the wrapper, the test fails. Manually verified: removing the `go` keyword causes `TestBackfillFromPubkey_DoesNotBlockBoot` to fail with "backfill dispatch took ~1s (>50ms): not async — would block boot." ### m2c implementation detail `fromPubkeyBackfillTotal/Processed/Done` are now plain `int64`/`bool` package globals guarded by a single `sync.RWMutex`. `fromPubkeyBackfillSnapshot()` returns all three under one RLock. `TestHealthzFromPubkeyBackfillConsistentSnapshot` races a writer (lock-step total/processed updates with periodic done flips) against 8 readers hammering `/api/healthz`, asserting `processed<=total` and `(done => processed==total)` on every response. Verified the test catches torn reads (manually injected a 3-RLock implementation; test failed within milliseconds with "processed>total" and "done=true but processed!=total" errors). --------- Co-authored-by: openclaw-bot <bot@openclaw.local> Co-authored-by: openclaw-bot <bot@openclaw.dev>	2026-05-06 23:50:44 -07:00
Kpa-clawbot	5a5df5d92b	revert: group commit M1 (#1117 ) — starves MQTT, refs #1129 (#1130 ) ## Why Diagnostic on #1129 shows PR #1117 (group commit M1 for #1115) is fundamentally broken: it starves the MQTT goroutine via `gcMu` lock contention, causing pingresp disconnects and lost packets at modest ingest rates. ## Three structural defects 1. Lock held across `sql.Stmt.Exec` — every concurrent `InsertTransmission` blocks for the full SQLite write latency, not just the brief queue mutation. 2. Lock held across `tx.Commit` — the WAL fsync runs under `gcMu`, so any backlog blocks all ingest writers AND the flusher ticker, snowballing under load. 3. Single-conn DB (`MaxOpenConns=1`) — the flusher and the ingest path serialise on one connection, turning the lock into a global ingest stall. Net effect: at modest packet rates the MQTT client loop misses its own pingresp deadline, the broker drops the connection, and packets received during the stall are lost. ## What this PR removes - `Store.SetGroupCommit`, `Store.FlushGroupTx`, `Store.flushLocked`, `Store.GroupCommitMs` - `gcMu`, `activeTx`, `pendingRows`, `groupCommitMs`, `groupCommitMaxRows` Store fields - `groupCommitMs` / `groupCommitMaxRows` config fields and `GroupCommitMsOrDefault` / `GroupCommitMaxRowsOrDefault` accessors - The flusher goroutine in `cmd/ingestor/main.go` - `cmd/ingestor/group_commit_test.go` - The `if s.activeTx != nil { … pendingRows … }` branch in `InsertTransmission` — reverts to plain prepared-stmt usage ## What this PR keeps (merged after #1117) - #1119 `BackfillPathJSON` `path_json='[]'` fix - #1120/#1123 perf metrics endpoints — `WALCommits` counter retained - `GroupCommitFlushes` JSON field on `/api/perf/write-sources` is kept as always-0 for API stability (server `perf_io.go` references it as a string field name; no client breakage) - `DBStats.GroupCommitFlushes` atomic field is removed from the Go struct ## Tests `cd cmd/ingestor && go test ./... -run "Test"` → `ok` (47.8s). `cd cmd/server && go build ./...` → clean. ## #1115 stays open The group-commit idea is sound — batching observation INSERTs would meaningfully reduce WAL fsync rate. But it needs a redesign that does not hold a mutex across blocking SQLite calls. Suggested directions for a future M1: - Channel-fed writer goroutine (single owner of the tx, ingest path is non-blocking enqueue) - Per-batch DB handle so the flusher doesn't serialise the ingest connection - Bounded queue with backpressure rather than a shared lock Refs #1117 #1129	2026-05-05 19:02:43 -07:00
Kpa-clawbot	74dffa2fb7	feat(perf): per-component disk I/O + write source metrics on Perf page (#1120 ) (#1123 ) ## Summary Implements per-component disk I/O + write source metrics on the Perf page so operators can self-diagnose write-volume anomalies (cf. the BackfillPathJSON loop debugged in #1119) without SSHing in to run iotop/fatrace. Partial fix for #1120 ## What's done (4/6 ACs) - ✅ `/api/perf/io` — server-process `/proc/self/io` delta rates (read/write bytes per sec, syscalls) - ✅ `/api/perf/sqlite` — WAL size, page count, page size, cache hit rate - ✅ `/api/perf/write-sources` — per-component counters from ingestor (tx/obs/upserts/backfill_) - ✅ Frontend Perf page — three new sections with anomaly thresholds + per-second rate columns ## What's NOT done (deferred to follow-up) - ❌ `cancelledWriteBytesPerSec` field — issue #1120 lists this under server-process I/O ("writes the kernel discarded — interesting signal"); not exposed in this PR - ❌ Ingestor `/proc/<pid>/io` — issue #1120 says "Both ingestor and server"; only server-process I/O lands here. Adding ingestor I/O requires either a unix socket back to the server, or surfacing the ingestor pid through the stats file. Doable without changing the existing API shape. - ❌ Adaptive baselining — anomaly thresholds remain static (10×, 100 MB, 90%); steady-state baselining can come once we have enough deployed Perf-page telemetry Per AGENTS.md rule 34, this PR uses "Partial fix for #1120" rather than "Fixes #1120" so the issue stays open until the remaining ACs land. ## Backend Server (`cmd/server/perf_io.go`)* - `GET /api/perf/io` — reads `/proc/self/io` and returns delta-rate `{readBytesPerSec, writeBytesPerSec, syscallsRead, syscallsWrite}` since last call (in-memory tracker, no allocation per sample). - `GET /api/perf/sqlite` — returns `{walSize, walSizeMB, pageCount, pageSize, cacheSize, cacheHitRate}`. `cacheHitRate` is proxied from the in-process row cache (closest available signal under the modernc sqlite driver). - `GET /api/perf/write-sources` — reads the ingestor's stats JSON file and returns a flat `{sources: {...}, sampleAt}` payload. Ingestor (`cmd/ingestor/`) - `DBStats` gains `WALCommits atomic.Int64` (incremented on every successful `tx.Commit()` and on every auto-commit `InsertTransmission` write) and `BackfillUpdates sync.Map` keyed by backfill name with `IncBackfill(name)` / `SnapshotBackfills()` helpers. - `BackfillPathJSONAsync` now increments `BackfillUpdates["path_json"]` per row write — the BackfillPathJSON-style infinite loop becomes immediately visible at `backfill_path_json` in the Write Sources table. - New `StartStatsFileWriter` publishes a JSON snapshot to `/tmp/corescope-ingestor-stats.json` (override via `CORESCOPE_INGESTOR_STATS`) every second using atomic tmp+rename. The tmp file is opened with `O_CREATE\|O_WRONLY\|O_TRUNC\|O_NOFOLLOW` mode `0o600` so a pre-planted symlink in a world-writable `/tmp` cannot redirect the write to an arbitrary file. ## Frontend (`public/perf.js`) Three new sections on the Perf page, all auto-refreshed via the existing 5s interval: - Disk I/O (server process) — read/write rates (formatted B/KB/MB-per-sec) + syscall counts. Write rate >10 MB/s flags ⚠️. - Write Sources — sorted table of per-component counters with a per-second rate column derived from snapshot deltas. Backfill rows show ⚠️ only when `tx_inserted >= 100` (meaningful baseline) AND the backfill's per-second rate exceeds 10× the live tx rate. Avoids the startup-spurious-alarm where cumulative-vs-cumulative was a tautology. - SQLite (WAL + Cache Hit) — WAL size (⚠️ when >100 MB), page count, page size, cache hit rate (⚠️ when <90%). ## Tests - Backend (`cmd/server/perf_io_test.go`) — `TestPerfIOEndpoint_ReturnsValidJSON`, `TestPerfSqliteEndpoint_ReturnsValidJSON`, `TestPerfWriteSourcesEndpoint_ReturnsSources` exercise the three new endpoints. Skips the `/proc/self/io` non-zero-rate assertion when `/proc` is unavailable. - Frontend (`test-perf-disk-io-1120.js`) — vm-sandbox runs `perf.js` with stubbed `fetch`, asserts the three new sections render with their headings + values. E2E assertion added: test-perf-disk-io-1120.js:91 ## TDD 1. Red commit (`21abd22`) — added the three handlers as no-op stubs returning empty values; tests fail on assertion mismatches (non-zero rate, `pageSize > 0`, headings present). 2. Green commit (`d8da54c`) — fills in the real `/proc/self/io` parser, PRAGMA queries, ingestor stats writer, and Perf page rendering. --------- Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: Kpa-clawbot <kpa-clawbot@users.noreply.github.com>	2026-05-05 17:56:56 -07:00
Kpa-clawbot	76d89e6578	fix(ingestor): exclude path_json='[]' rows from backfill WHERE (#1119 ) (#1121 ) ## Summary `BackfillPathJSONAsync` re-selected observations whose `path_json` was already `'[]'`, rewrote them to `'[]'`, and looped forever. The `len(batch) == 0` exit condition was never reached, the migration marker was never recorded, and the ingestor sustained 2–3 MB/s WAL writes at idle (76% of CPU in `sqlite.Exec` per pprof). ## Fix Drop `'[]'` from the WHERE clause: ```diff WHERE o.raw_hex IS NOT NULL AND o.raw_hex != '' - AND (o.path_json IS NULL OR o.path_json = '' OR o.path_json = '[]') + AND (o.path_json IS NULL OR o.path_json = '') ``` `'[]'` is the "already attempted, no hops" sentinel (still written at line 994 of `cmd/ingestor/db.go` when `DecodePathFromRawHex` returns no hops). Excluding it from the WHERE lets the loop terminate after one full pass and the migration marker `backfill_path_json_from_raw_hex_v1` to be recorded. ## TDD - Red commit (`19f8004`): `TestBackfillPathJSONAsync_BracketRowsTerminate` — seeds 100 observations with `path_json='[]'` and a `raw_hex` that decodes to zero hops, asserts the migration marker is written within 5s. Fails on master with "backfill never recorded migration marker within 5s — infinite loop on path_json='[]' rows". - Green commit (`7019100`): WHERE-clause fix + updates `TestBackfillPathJsonFromRawHex` row 1 expectation (the pre-seeded `'[]'` row is now correctly skipped instead of being re-decoded). ## Test results ``` ok github.com/corescope/ingestor 49.656s ``` ## Acceptance criteria from #1119 - [x] Backfill terminates within 1 polling cycle of having no progress to make - [x] Migration marker `backfill_path_json_from_raw_hex_v1` written after termination - [x] On restart, backfill recognizes migration done and exits immediately (existing behavior — the migration check at the top of `BackfillPathJSONAsync` was always correct; the bug was that the marker never got written) - [x] Test: seed DB with N observations all having `path_json = '[]'` → backfill runs once → no UPDATEs issued, migration marker written - [ ] Disk write rate on idle staging drops from 2–3 MB/s to <100 KB/s — to be verified by the user post-deploy Fixes #1119. --------- Co-authored-by: OpenClaw Bot <bot@openclaw.local>	2026-05-05 17:35:16 -07:00
Kpa-clawbot	45f2607f75	perf(ingestor): group commit observation INSERTs by time window (M1, refs #1115 ) (#1117 ) ## Summary Implements M1 from #1115: batches observation/transmission INSERTs into a single SQLite `BEGIN/COMMIT` window instead of fsyncing per packet. At ~250 obs/sec this drops WAL fsync rate from ~20/s to ~1/s and eliminates the `obs-persist skipped` / `SQLITE_BUSY` log spam that the issue documents. This is a partial fix — it ships the group-commit mechanism. Acceptance items 6–7 (measured fsync rate / measured `obs-persist skipped` rate at staging steady-state) require post-deploy observation, and M2 (per-`tx_hash` observation buffering) is intentionally deferred. The issue stays open for the user to verify on staging. > Partial fix for #1115 — does not auto-close. Refs #1115. ## Mechanism - `Store` gains an active `sql.Tx`, `pendingRows` counter, `gcMu`, and the `groupCommitMs` / `groupCommitMaxRows` knobs. `SetGroupCommit(ms, maxRows)` enables the mode; `FlushGroupTx()` commits the in-flight tx. - `InsertTransmission` lazily opens a tx on the first call after each flush, then issues all writes through `tx.Stmt()` bindings of the existing prepared statements. With `MaxOpenConns(1)` the connection is already serialized; `gcMu` serializes group-commit state without contention. - A goroutine in `cmd/ingestor/main.go` calls `FlushGroupTx()` every `groupCommitMs` ms. `pendingRows >= groupCommitMaxRows` triggers an eager flush. `Close()` flushes before the WAL checkpoint so no rows are lost on graceful shutdown. - `groupCommitMs == 0` short-circuits to the legacy per-call auto-commit path (statements bound to `s.db`, no tx) — current behavior preserved byte-for-byte for operators who opt out. ## Config Two new optional fields (ingestor-only), both documented in `config.example.json`: \| Field \| Default \| Effect \| \|---\|---\|---\| \| `groupCommitMs` \| `1000` \| Flush window in ms. `0` disables batching (legacy per-packet auto-commit). \| \| `groupCommitMaxRows` \| `1000` \| Safety cap; when exceeded the queue flushes immediately to bound memory and the crash-loss window. \| No DB schema change. No required config change on upgrade. ## Tests (TDD red → green visible in commits) `cmd/ingestor/group_commit_test.go` — three assertions, written first as the red commit: - `TestGroupCommit_BatchesInsertsIntoOneTx` — 50 `InsertTransmission` calls inside a wide window produce 0* commits until `FlushGroupTx`, then exactly 1; all 50 rows visible after flush. (This is the spec's "50 observations → 1 SQLite write transaction" assertion.) - `TestGroupCommit_Disabled` — `groupCommitMs=0` keeps every insert immediately visible and `GroupCommitFlushes` never advances. (Spec's "groupCommitMs=0 reverts to per-packet behavior" assertion.) - `TestGroupCommit_MaxRowsForcesEarlyFlush` — cap=3, 7 inserts → 2 auto-flushes from the cap + 1 final manual flush = 3 total. Red commit: `e2b0370` (stubs `SetGroupCommit` / `FlushGroupTx` so the tests compile and fail on assertions, not import errors). Green commit: `73f3559`. Full ingestor suite (`go test ./...` in `cmd/ingestor`) stays green, ~49 s. ## Performance This PR is the perf change itself. Local micro-test (the new `TestGroupCommit_BatchesInsertsIntoOneTx`) shows the structural property: 50 inserts → 1 commit. The fsync-rate measurement called out in the M1 acceptance criteria (`~20/s → ~1/s` at 250 obs/sec) requires staging deployment to confirm — that's the remaining open item that keeps #1115 open after this merges. No hot-path regressions: when `groupCommitMs > 0` we acquire one mutex per insert (uncontended in the steady state — the connection was already single-threaded via `MaxOpenConns(1)`). When `groupCommitMs == 0` the code path is identical to before plus one nil-tx check. ## What this PR does NOT do (per spec) - Does not collapse "30 observations of one packet" into 1 row write — that's M2. - Does not eliminate dual-writer contention with `cmd/server`'s `resolved_path` writes. - Does not change observation ordering or live broadcast latency. --------- Co-authored-by: corescope-bot <bot@corescope.local>	2026-05-05 16:38:43 -07:00
Kpa-clawbot	136e1d23c8	feat(#730 ): foreign-advert detection — flag instead of silent drop (#1084 ) ## Summary Partial fix for #730 (M1 only — M2 frontend and M3 alerting deferred). Today the ingestor silently drops ADVERTs whose GPS lies outside the configured `geo_filter` polygon. That's the wrong default for an analytics tool — operators get zero visibility into bridged or leaked meshes. This PR makes the new default flag, don't drop: foreign adverts are stored, the node row is tagged `foreign_advert=1`, and the API surfaces `"foreign": true` so dashboards / map overlays can be built on top. ## Behavior \| Mode \| What happens to an ADVERT outside `geo_filter` \| \|---\|---\| \| (default) flag \| Stored, marked `foreign_advert=1`, exposed via API \| \| drop (legacy) \| Silently dropped (preserves old behavior for ops who want it) \| ## What's done (M1 — Backend) - ingestor stores foreign adverts instead of dropping - `nodes.foreign_advert` column added (migration) - `/api/nodes` and `/api/nodes/{pk}` expose `foreign: true` field - Config: `geofilter.action: "flag"\|"drop"` (default `flag`) - Tests + config docs ## What's NOT done (deferred to M2 + M3) - M2 — Frontend: Map overlay showing foreign adverts as distinct markers, foreign-advert filter on packets/nodes pages, dedicated foreign-advert dashboard - M3 — Alerting: Time-series detection of bridging events, alert when foreign advert rate spikes, identify bridge entry-point nodes Issue #730 remains open for M2 and M3. --------- Co-authored-by: corescope-bot <bot@corescope>	2026-05-05 01:58:52 -07:00
Kpa-clawbot	227f375b4a	test(ingestor): regression test for observer metadata persistence (#1044 ) (#1047 ) Adds end-to-end test proving that `extractObserverMeta` + `UpsertObserver` correctly stores model, firmware, battery_mv, noise_floor, uptime_secs from a real MQTT status payload. Test passes — confirms the code path works. #1044 was caused by upstream observers not including metadata fields in their status payloads (older `meshcoretomqtt` client versions), not a code bug. Closes #1044 Co-authored-by: meshcore-bot <bot@meshcore.local>	2026-05-05 06:18:47 +00:00
Kpa-clawbot	c9301fee9c	fix(ingestor): extract per-hop SNR for TRACE packets at ingest time (#1028 ) ## Problem PR #1007 added per-hop SNR extraction (`snrValues`) for TRACE packets to `cmd/server/decoder.go`. That code path is only hit by the on-demand re-decode endpoint (packet detail). The actual ingest pipeline runs `cmd/ingestor/decoder.go`, decodes the packet once, and persists `decoded_json` into SQLite. The server then serves `decoded_json` as-is for list/feed queries. Net effect: `snrValues` never appears in any production response, because the ingestor's decoder was never updated. Confirmed empirically: `strings /app/corescope-ingestor \| grep snrVal` returns nothing. ## Fix Port the SNR extraction logic from `cmd/server/decoder.go` (lines 410–422) into `cmd/ingestor/decoder.go`. For TRACE packets, the header path bytes are int8 SNR values in quarter-dB encoding; extract them into `payload.SNRValues` before `path.Hops` is overwritten with payload-derived hop IDs. Also adds the matching `SNRValues []float64` field to the ingestor's `Payload` struct so it serializes into `decoded_json`. ## TDD - Red commit (`6ae4c07`): adds `TestDecodeTraceExtractsSNRValues` + `SNRValues` field stub. Compiles, fails on assertion (`len(SNRValues)=0, want 2`). - Green commit (`4a4f3f3`): adds extraction loop. Test passes. Test packet: `26022FF8116A23A80000000001C0DE1000DEDE` - header `0x26` = TRACE + DIRECT - pathByte `0x02` = hash_size 1, hash_count 2 - header path `2F F8` → SNR `[int8(0x2F)/4, int8(0xF8)/4]` = `[11.75, -2.0]` ## Files - `cmd/ingestor/decoder.go` — `+16` (field + extraction) - `cmd/ingestor/decoder_test.go` — `+29` (red test) ## Out of scope - `cmd/server/decoder.go` is already correct (PR #1007). Untouched. - Backfill of historical `decoded_json` rows. New TRACE packets get SNR; old rows do not until re-decoded. --------- Co-authored-by: corescope-bot <bot@corescope.local>	2026-05-03 21:42:14 -07:00
Kpa-clawbot	5e01de0d52	fix: make path_json backfill async to unblock MQTT startup (#1013 ) ## Summary P0 fix: The `path_json` backfill migration (PR #983) ran synchronously in `applySchema`, blocking the ingestor main goroutine. On staging (~502K observations), MQTT never connected — no new packets ingested for 15+ hours. ## Fix Extract the backfill into `BackfillPathJSONAsync()` — a method on `Store` that launches the work in a background goroutine. Called from `main.go` before MQTT connect, it runs concurrently without blocking subscription. Pattern: identical to `backfillResolvedPathsAsync` in the server (same lesson learned). ## Safety - Idempotent: checks `_migrations` table, skips if already recorded - Only touches `path_json IS NULL` rows — no conflict with live ingest (new observations get `path_json` at write time) - Panic-recovered goroutine with start/completion logging - Batched (1000 rows per iteration) to avoid memory pressure ## TDD - Red commit: `c6e1375` — test asserts `BackfillPathJSONAsync` method exists + OpenStore doesn't block - Green commit*: `015871f` — implements async method, all tests pass ## Files changed - `cmd/ingestor/db.go` — removed sync backfill from `applySchema`, added `BackfillPathJSONAsync()` - `cmd/ingestor/main.go` — call `store.BackfillPathJSONAsync()` after store creation - `cmd/ingestor/db_test.go` — new async tests + updated existing test to use async API --------- Co-authored-by: you <you@example.com>	2026-05-03 11:29:56 -07:00
Kpa-clawbot	b0e4d2fa18	feat: add optional MQTT region field (#788 ) (#1012 ) ## Summary Add optional `region` field to MQTT source config and JSON payload, enabling publishers to explicitly provide region data without relying solely on topic path structure. ## Changes - `MQTTSource.Region` — new optional config field. When set, acts as default region for all messages from that source (useful when a broker serves a single region). - `MQTTPacketMessage.Region` — new optional JSON payload field. Publishers can include `"region": "PDX"` in their MQTT messages. - `PacketData.Region` — carries the resolved region through to storage. - Priority resolution: payload `region` > topic-derived region > source config `region` - Observer IATA is updated with the effective region on every packet. ## Config example ```json { "mqttSources": [ { "name": "cascadia", "broker": "tcp://cascadia-broker:1883", "topics": ["meshcore/#"], "region": "PDX" } ] } ``` ## Payload example ```json {"raw": "0a1b2c...", "SNR": 5.2, "region": "PDX"} ``` ## TDD - Red commit: `980304c` (tests fail at compile — fields don't exist) - Green commit: `4caf88b` (implementation, all tests pass) ## Unblocks - #804, #770, #730 (all depend on region being available on observations) Fixes #788 --------- Co-authored-by: you <you@example.com>	2026-05-03 11:21:54 -07:00
Kpa-clawbot	153308134e	feat: add global observer IATA whitelist config (#1001 ) ## Summary Adds a global `observerIATAWhitelist` config field that restricts which observer IATA regions are processed by the ingestor. ## Problem Operators running regional instances (e.g., Sweden) want to ensure only observers physically in their region contribute data. The existing per-source `iataFilter` only filters packet messages but still allows status messages through, meaning observers from other regions appear in the database. ## Solution New top-level config field `observerIATAWhitelist`: - When non-empty, all messages (status + packets) from observers outside the whitelist are silently dropped - Case-insensitive matching - Empty list = all regions allowed (fully backwards compatible) - Lazy O(1) lookup via cached uppercase set (same pattern as `observerBlacklist`) ### Config example ```json { "observerIATAWhitelist": ["ARN", "GOT"] } ``` ## TDD - Red commit: `f19c2b2` — tests for `ObserverIATAWhitelist` field and `IsObserverIATAAllowed` method (build fails) - Green commit: `782f516` — implementation + integration test ## Files changed - `cmd/ingestor/config.go` — new field, new method `IsObserverIATAAllowed` - `cmd/ingestor/main.go` — whitelist check in `handleMessage` before status processing - `cmd/ingestor/config_test.go` — unit tests for config parsing and matching - `cmd/ingestor/main_test.go` — integration test for handleMessage filtering Fixes #914 --------- Co-authored-by: you <you@example.com>	2026-05-03 10:23:35 -07:00
Kpa-clawbot	2e3a94b86d	chore(db): one-time cleanup of legacy packets with empty hash or null timestamp (closes #994 ) (#997 ) ## Summary One-time startup migration that deletes legacy packets (transmissions + observations) with empty hash or empty `first_seen` timestamp. This is the write-side cleanup following #993's read-side filter. ### Migration: `cleanup_legacy_null_hash_ts` - Checks `_migrations` table for marker - If not present: deletes observations referencing bad transmissions, then deletes the transmissions themselves - Logs count of deleted rows - Records marker for idempotency ### TDD - Red commit: `b1a24a1` — test asserts migration deletes bad rows (fails without implementation) - Green commit: `2b94522` — implements the migration, all tests pass Fixes #994 --------- Co-authored-by: you <you@example.com>	2026-05-02 23:15:20 -07:00
Kpa-clawbot	d43c95a4bb	fix(ingestor): warn when TRACE payload decode fails but observation stored (closes #889 ) (#992 ) ## Summary Closes #889. When a TRACE packet's payload is too short to decode (< 9 bytes), `decodeTrace` returns an error in `Payload.Error` but the observation is still stored with empty `Path.Hops`. Previously this was completely silent — no log, no anomaly flag, no indication the row is degraded. This fix populates `DecodedPacket.Anomaly` with the decode error message (e.g., `"TRACE payload decode failed: too short"`) so operators and downstream consumers can identify degraded observations. ## TDD Commit History 1. Red commit `04e0165` — failing test asserting `Anomaly` is set when TRACE payload decode fails 2. Green commit `d3e72d1` — 3-line fix in `decoder.go` line 601-603: check `payload.Error != ""` for TRACE packets and set anomaly ## What Changed `cmd/ingestor/decoder.go` (lines 601-603): Added a check before the existing TRACE path-parsing block. If `payload.Error` is non-empty for a TRACE packet, `anomaly` is set to `"TRACE payload decode failed: <error>"`. `cmd/ingestor/decoder_test.go`: Added `TestDecodeTracePayloadFailSetsAnomaly` — constructs a TRACE packet with a 4-byte payload (too short), asserts the packet is still returned (observation stored) and `Anomaly` is populated. ## Verification - `go build ./...` ✓ - `go test ./...` ✓ (all pass including new test) - Anti-tautology: reverting the fix causes the new test to fail (asserts `pkt.Anomaly == ""` → error) --------- Co-authored-by: you <you@example.com>	2026-05-02 20:34:27 -07:00
Kpa-clawbot	dd2f044f2b	fix: cache RW SQLite connection + dedup DBConfig (closes #921 ) (#982 ) Closes #921 ## Summary Follow-up to #920 (incremental auto-vacuum). Addresses both items from the adversarial review: ### 1. RW connection caching Previously, every call to `openRW(dbPath)` opened a new SQLite RW connection and closed it after use. This happened in: - `runIncrementalVacuum` (~4x/hour) - `PruneOldPackets`, `PruneOldMetrics`, `RemoveStaleObservers` - `buildAndPersistEdges`, `PruneNeighborEdges` - All neighbor persist operations Now a single `*sql.DB` handle (with `MaxOpenConns(1)`) is cached process-wide via `cachedRW(dbPath)`. The underlying connection pool manages serialization. The original `openRW()` function is retained for one-shot test usage. ### 2. DBConfig dedup `DBConfig` was defined identically in both `cmd/server/config.go` and `cmd/ingestor/config.go`. Extracted to `internal/dbconfig/` as a shared package; both binaries now use a type alias (`type DBConfig = dbconfig.DBConfig`). ## Tests added \| Test \| File \| \|------\|------\| \| `TestCachedRW_ReturnsSameHandle` \| `cmd/server/rw_cache_test.go` \| \| `TestCachedRW_100Calls_SingleConnection` \| `cmd/server/rw_cache_test.go` \| \| `TestGetIncrementalVacuumPages_Default` \| `internal/dbconfig/dbconfig_test.go` \| \| `TestGetIncrementalVacuumPages_Configured` \| `internal/dbconfig/dbconfig_test.go` \| ## Verification ``` ok github.com/corescope/server 20.069s ok github.com/corescope/ingestor 47.117s ok github.com/meshcore-analyzer/dbconfig 0.003s ``` Both binaries build cleanly. 100 sequential `cachedRW()` calls return the same handle with exactly 1 entry in the cache map. --------- Co-authored-by: you <you@example.com>	2026-05-02 20:15:30 -07:00
Kpa-clawbot	58484ad924	feat(ingestor): backfill observations.path_json from raw_hex (closes #888 ) (#983 ) ## Summary Adds an idempotent startup migration to the ingestor that backfills `observations.path_json` from per-observation `raw_hex` (added in #882). Approach: Server-side migration (Option B) — runs automatically at startup, chunked in batches of 1000, tracked via `_migrations` table. Chosen over a standalone script because: 1. Follows existing migration pattern (channel_hash, last_packet_at, etc.) 2. Zero operator action required — just deploy 3. Idempotent — safe to restart mid-migration (uncommitted rows get picked up next run) ## What it does - Selects observations where `raw_hex` is populated but `path_json` is NULL/empty/`[]` - Excludes TRACE packets (`payload_type = 9`) at the SQL level — their header bytes are SNR values, not hops - Decodes hops via `packetpath.DecodePathFromRawHex` (reuses existing helper) - Updates `path_json` with the decoded JSON array - Marks rows with undecoded/empty hops as `'[]'` to prevent infinite re-scanning - Records `backfill_path_json_from_raw_hex_v1` in `_migrations` when complete ## Safety - Never overwrites existing non-empty `path_json` — only fills where missing - Batched (1000 rows per iteration) — won't OOM on large DBs - TRACE-safe — excluded at query level per `packetpath.PathBytesAreHops` semantics ## Test `TestBackfillPathJsonFromRawHex` — creates synthetic observations with: - Empty path_json + valid raw_hex → verifies backfill populates correctly - NULL path_json → verifies backfill populates - Existing path_json → verifies NO overwrite - TRACE packet → verifies skip Anti-tautology: test asserts specific decoded values (`["AABB","CCDD"]`) from known raw_hex input, not just "something changed." Closes #888 Co-authored-by: you <you@example.com>	2026-05-02 19:52:43 -07:00
Kpa-clawbot	5aa8f795cd	feat(ingestor): per-source MQTT connect timeout (#931 ) (#977 ) ## Summary Per-source MQTT connect timeout, correctly targeting the `WaitTimeout` startup gate (#931). ## What changed - Added `connectTimeoutSec` field to `MQTTSource` struct (per-source, not global) — `config.go:24` - Added `ConnectTimeoutOrDefault()` helper returning configured value or 30 (default from #926) — `config.go:29` - Replaced hardcoded `WaitTimeout(30 * time.Second)` with `WaitTimeout(time.Duration(connectTimeout) * time.Second)` — `main.go:173` - Updated `config.example.json` with field at source level - Unit tests for default (30) and custom values ## Why this supersedes #976 PR #976 made paho's `SetConnectTimeout` (per-TCP-dial, was 10s) configurable via a global `mqttConnectTimeoutSeconds` field. Issue #931 explicitly references the 30s timeout — which is `WaitTimeout(30s)`, the startup gate from #926. It also requests per-source config, not global. This PR targets the correct timeout at the correct granularity. ## Live verification (Rule 18) Two sources pointed at unreachable brokers: - `fast` (`connectTimeoutSec: 5`): timed out in 5s ✅ - `default` (unset): timed out in 30s ✅ ``` 19:00:35 MQTT [fast] connect timeout: 5s 19:00:40 MQTT [fast] initial connection timed out — retrying in background 19:00:40 MQTT [default] connect timeout: 30s 19:01:10 MQTT [default] initial connection timed out — retrying in background ``` Closes #931 Supersedes #976 Co-authored-by: you <you@example.com>	2026-05-02 12:08:25 -07:00
Kpa-clawbot	1e7c187521	fix(ingestor): address review BLOCKERs from PR #926 (goroutine leak + guard semantics) [v2] (#974 ) ## fix(ingestor): address review BLOCKERs from PR #926 (goroutine leak + guard semantics) Supersedes #970. Rebased onto current master to resolve merge conflicts. ### Changes (same as #970) - BL1 (goroutine leak): Call `client.Disconnect(0)` on the error path after `Connect()` fails with `ConnectRetry=true`, preventing Paho's internal retry goroutines from leaking. - BL2 (guard semantics): Use `connectedCount == 0` instead of `len(clients) == 0` to detect zero-connected state, since timed-out clients are appended to the slice. - Tests: `TestBL1_GoroutineLeakOnHardFailure` and `TestBL2_ZeroConnectedFatals` covering both blockers. ### Context - Fixes blockers raised in review of #926 - Related: #910 (original hang bug) Co-authored-by: you <you@example.com>	2026-05-02 12:05:02 -07:00
Kpa-clawbot	3364eed303	feat: separate "Last Status Update" from "Last Packet Observation" for observers (v3 rebase) (#969 ) Rebased version of #968 (which was itself a rebase of #905) — resolves merge conflict with #906 (clock-skew UI) that landed on master. ## Conflict resolution `public/observers.js` — master (#906) added "Clock Offset" column to observer table; #968 split "Last Seen" into "Last Status" + "Last Packet" columns. Combined both: the table now has Status \| Name \| Region \| Last Status \| Last Packet \| Packets \| Packets/Hour \| Clock Offset \| Uptime. ## What this PR adds (unchanged from #968/#905) - `last_packet_at` column in observers DB table - Separate "Last Status Update" and "Last Packet Observation" display in observers list and detail page - Server-side migration to add the column automatically - Backfill heuristic for existing data - Tests for ingestor and server ## Verification - All Go tests pass (`cmd/server`, `cmd/ingestor`) - Frontend tests pass (`test-packets.js`, `test-hash-color.js`) - Built server, hit `/api/observers` — `last_packet_at` field present in JSON - Observer table header has all 9 columns including both Last Packet and Clock Offset ## Prior PRs - #905 — original (conflicts with master) - #968 — first rebase (conflicts after #906 landed) - This PR — second rebase, resolves #906 conflict Supersedes #968. Closes #905. --------- Co-authored-by: you <you@example.com>	2026-05-02 12:03:42 -07:00
efiten	d65122491e	fix(ingestor): unblock startup when one of multiple MQTT sources is unreachable (#926 ) ## Summary - With `ConnectRetry=true`, paho's `token.Wait()` only returns on success — it blocks forever for unreachable brokers, stalling the entire startup loop before any other source connects - Switches to `token.WaitTimeout(30s)`: on timeout the client is still tracked so `ConnectRetry` keeps retrying in background; `OnConnect` fires and subscribes when it eventually connects - Adds `TestMQTTConnectRetryTimeoutDoesNotBlock` to confirm `WaitTimeout` returns within deadline for unreachable brokers (regression guard for this exact failure mode) Fixes #910 ## Test plan - [x] Two MQTT sources configured, one unreachable: ingestor reaches `Running` status and ingests from the reachable source immediately on startup - [x] Unreachable source logs `initial connection timed out — retrying in background` and reconnects automatically when the broker comes back - [x] Single source, reachable: behaviour unchanged (`Running — 1 MQTT source(s) connected`) - [x] Single source, unreachable: `Running — 0 MQTT source(s) connected, 1 retrying in background`; ingestion starts once broker is available - [x] `go test ./...` passes (excluding pre-existing `TestOpenStoreInvalidPath` failure on master) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 11:31:51 -07:00
Kpa-clawbot	b3a9677c52	feat(ingestor + server): observerBlacklist config (#962 ) (#963 ) ## Summary Implements `observerBlacklist` config — mirrors the existing `nodeBlacklist` pattern for observers. Drop observers by pubkey at ingest, with defense-in-depth filtering on the server side. Closes #962 ## Changes ### Ingestor (`cmd/ingestor/`) - `config.go`: Added `ObserverBlacklist []string` field + `IsObserverBlacklisted()` method (case-insensitive, whitespace-trimmed) - `main.go`: Early return in `handleMessage` when `parts[2]` (observer ID from MQTT topic) matches blacklist — before status handling, before IATA filter. No UpsertObserver, no observations, no metrics insert. Log line: `observer <pubkey-short> blacklisted, dropping` ### Server (`cmd/server/`) - `config.go`: Same `ObserverBlacklist` field + `IsObserverBlacklisted()` with `sync.Once` cached set (same pattern as `nodeBlacklist`) - `routes.go`: Defense-in-depth filtering in `handleObservers` (skip blacklisted in list) and `handleObserverDetail` (404 for blacklisted ID) - `main.go`: Startup `softDeleteBlacklistedObservers()` marks matching rows `inactive=1` so historical data is hidden - `neighbor_persist.go`: `softDeleteBlacklistedObservers()` implementation ### Tests - `cmd/ingestor/observer_blacklist_test.go`: config method tests (case-insensitive, empty, nil) - `cmd/server/observer_blacklist_test.go`: config tests + HTTP handler tests (list excludes blacklisted, detail returns 404, no-blacklist passes all, concurrent safety) ## Config ```json { "observerBlacklist": [ "EE550DE547D7B94848A952C98F585881FCF946A128E72905E95517475F83CFB1" ] } ``` ## Verification (Rule 18 — actual server output) Before blacklist (no config): ``` Total: 31 DUBLIN in list: True ``` After blacklist (DUBLIN Observer pubkey in `observerBlacklist`): ``` [observer-blacklist] soft-deleted 1 blacklisted observer(s) Total: 30 DUBLIN in list: False ``` Detail endpoint for blacklisted observer returns 404. All existing tests pass (`go test ./...` for both server and ingestor). --------- Co-authored-by: you <you@example.com>	2026-05-01 23:11:27 -07:00
Kpa-clawbot	6345c6fb05	fix(ingestor): observability + bounded backoff for MQTT reconnect (#947 ) (#949 ) ## Summary Fixes #947 — MQTT ingestor silently stalls after `pingresp not received` disconnect due to paho's default 10-minute reconnect backoff and zero observability of reconnect attempts. ## Changes ### `cmd/ingestor/main.go` - Extract `buildMQTTOpts()` — encapsulates MQTT client option construction for testability - `SetMaxReconnectInterval(30s)` — bounds paho's default 10-minute exponential backoff (source: `options.go:137` in `paho.mqtt.golang@v1.5.0`) - `SetConnectTimeout(10s)` — prevents stuck connect attempts from blocking reconnect cycle - `SetWriteTimeout(10s)` — prevents stuck publish writes - `SetReconnectingHandler` — logs `MQTT [<tag>] reconnecting to <broker>` on every reconnect attempt, giving operators visibility into retry behavior - Enhanced `SetConnectionLostHandler` — now includes broker address in log line for multi-source disambiguation ### `cmd/ingestor/mqtt_opts_test.go` (new) - Tests verify `MaxReconnectInterval`, `ConnectTimeout`, `WriteTimeout` are set correctly - Tests verify credential and TLS configuration - Anti-tautology: tests fail if timing settings are removed from `buildMQTTOpts()` ## Operator impact After this change, a pingresp disconnect produces: ``` MQTT [staging] disconnected from tcp://broker:1883: pingresp not received, disconnecting MQTT [staging] reconnecting to tcp://broker:1883 MQTT [staging] reconnecting to tcp://broker:1883 MQTT [staging] connected to tcp://broker:1883 MQTT [staging] subscribed to meshcore/# ``` Max gap between disconnect and first reconnect attempt: ~30s (was up to 10 minutes). --------- Co-authored-by: you <you@example.com>	2026-05-01 00:01:07 -07:00
Kpa-clawbot	aeae7813bc	fix: enable SQLite incremental auto-vacuum so DB shrinks after retention (#919 ) (#920 ) Closes #919 ## Summary Enables SQLite incremental auto-vacuum so the database file actually shrinks after retention reaper deletes old data. Previously, `DELETE` operations freed pages internally but never returned disk space to the OS. ## Changes ### 1. Auto-vacuum on new databases - `PRAGMA auto_vacuum = INCREMENTAL` set via DSN pragma before `journal_mode(WAL)` in the ingestor's `OpenStoreWithInterval` - Must be set before any tables are created; DSN ordering ensures this ### 2. Post-reaper incremental vacuum - `PRAGMA incremental_vacuum(N)` runs after every retention reaper cycle (packets, metrics, observers, neighbor edges) - N defaults to 1024 pages, configurable via `db.incrementalVacuumPages` - Noop on `auto_vacuum=NONE` databases (safe before migration) - Added to both server and ingestor ### 3. Opt-in full VACUUM for existing databases - Startup check logs a clear warning if `auto_vacuum != INCREMENTAL` - `db.vacuumOnStartup: true` config triggers one-time `PRAGMA auto_vacuum = INCREMENTAL; VACUUM` - Logs start/end time for operator visibility ### 4. Documentation - `docs/user-guide/configuration.md`: retention section notes that lowering retention doesn't immediately shrink the DB - `docs/user-guide/database.md`: new guide covering WAL, auto-vacuum, migration, manual VACUUM ### 5. Tests - `TestNewDBHasIncrementalAutoVacuum` — fresh DB gets `auto_vacuum=2` - `TestExistingDBHasAutoVacuumNone` — old DB stays at `auto_vacuum=0` - `TestVacuumOnStartupMigratesDB` — full VACUUM sets `auto_vacuum=2` - `TestIncrementalVacuumReducesFreelist` — DELETE + vacuum shrinks freelist - `TestCheckAutoVacuumLogs` — handles both modes without panic - `TestConfigIncrementalVacuumPages` — config defaults and overrides ## Migration path for existing databases 1. On startup, CoreScope logs: `[db] auto_vacuum=NONE — DB needs one-time VACUUM...` 2. Set `db.vacuumOnStartup: true` in config.json 3. Restart — VACUUM runs (blocks startup, minutes on large DBs) 4. Remove `vacuumOnStartup` after migration ## Test results ``` ok github.com/corescope/server 19.448s ok github.com/corescope/ingestor 30.682s ``` --------- Co-authored-by: you <you@example.com>	2026-04-30 23:45:00 -07:00
Kpa-clawbot	56ec590bc4	fix(#886 ): derive path_json from raw_hex at ingest (#887 ) ## Problem Per-observation `path_json` disagrees with `raw_hex` path section for TRACE packets. Reproducer: packet `af081a2c41281b1e`, observer `lutin🏡` - `path_json`: `["67","33","D6","33","67"]` (5 hops — from TRACE payload) - `raw_hex` path section: `30 2D 0D 23` (4 bytes — SNR values in header) ## Root Cause `DecodePacket` correctly parses TRACE packets by replacing `path.Hops` with hop IDs from the payload's `pathData` field (the actual route). However, the header path bytes for TRACE packets contain SNR values (one per completed hop), not hop IDs. `BuildPacketData` used `decoded.Path.Hops` to build `path_json`, which for TRACE packets contained the payload-derived hops — not the header path bytes that `raw_hex` stores. This caused `path_json` and `raw_hex` to describe completely different paths. ## Fix - Added `DecodePathFromRawHex(rawHex)` — extracts header path hops directly from raw hex bytes, independent of any TRACE payload overwriting. - `BuildPacketData` now calls `DecodePathFromRawHex(msg.Raw)` instead of using `decoded.Path.Hops`, guaranteeing `path_json` always matches the `raw_hex` path section. ## Tests (8 new) `DecodePathFromRawHex` unit tests: - hash_size 1, 2, 3, 4 - zero-hop direct packets - transport route (4-byte transport codes before path) `BuildPacketData` integration tests: - TRACE packet: asserts path_json matches raw_hex header path (not payload hops) - Non-TRACE packet: asserts path_json matches raw_hex header path All existing tests continue to pass (`go test ./...` for both ingestor and server). Fixes #886 --------- Co-authored-by: you <you@example.com>	2026-04-21 21:13:58 -07:00
Kpa-clawbot	a605518d6d	fix(#881 ): per-observation raw_hex — each observer sees different bytes on air (#882 ) ## Problem Each MeshCore observer receives a physically distinct over-the-air byte sequence for the same transmission (different path bytes, flags/hops remaining). The `observations` table stored only `path_json` per observer — all observations pointed at one `transmissions.raw_hex`. This prevented the hex pane from updating when switching observations in the packet detail view. ## Changes \| Layer \| Change \| \|-------\|--------\| \| Schema \| `ALTER TABLE observations ADD COLUMN raw_hex TEXT` (nullable). Migration: `observations_raw_hex_v1` \| \| Ingestor \| `stmtInsertObservation` now stores per-observer `raw_hex` from MQTT payload \| \| View \| `packets_v` uses `COALESCE(o.raw_hex, t.raw_hex)` — backward compatible with NULL historical rows \| \| Server \| `enrichObs` prefers `obs.RawHex` when non-empty, falls back to `tx.RawHex` \| \| Frontend \| No changes — `effectivePkt.raw_hex` already flows through `renderDetail` \| ## Tests - Ingestor: `TestPerObservationRawHex` — two MQTT packets for same hash from different observers → both stored with distinct raw_hex - Server: `TestPerObservationRawHexEnrich` — enrichObs returns per-obs raw_hex when present, tx fallback when NULL - E2E: Playwright assertion in `test-e2e-playwright.js` for hex pane update on observation switch E2E assertion added: `test-e2e-playwright.js:1794` ## Scope - Historical observations: raw_hex stays NULL, UI falls back to transmission raw_hex silently - No backfill, no path_json reconstruction, no frontend changes Closes #881 --------- Co-authored-by: you <you@example.com>	2026-04-21 13:45:29 -07:00
efiten	cad1f11073	fix: bypass IATA filter for status messages, fill SNR on duplicate obs (#694 ) (#802 ) ## Problems Two independent ingestor bugs identified in #694: ### 1. IATA filter drops status messages from out-of-region observers The IATA filter ran at the top of `handleMessage()` before any message-type discrimination. Status messages carrying observer metadata (`noise_floor`, battery, airtime) from observers outside the configured IATA regions were silently discarded before `UpsertObserver()` and `InsertMetrics()` ran. Impact: Observers running `meshcoretomqtt/1.0.8.0` in BFL and LAX — the only client versions that include `noise_floor` in status messages — had their health data dropped entirely on prod instances filtering to SJC. Fix: Moved the IATA filter to the packet path only (after the `parts[3] == "status"` branch). Status messages now always populate observer health data regardless of configured region filter. ### 2. `INSERT OR IGNORE` discards SNR/RSSI on late arrival When the same `(transmission_id, observer_idx, path_json)` observation arrived twice — first without RF fields, then with — `INSERT OR IGNORE` silently discarded the SNR/RSSI from the second arrival. Fix: Changed to `ON CONFLICT(...) DO UPDATE SET snr = COALESCE(excluded.snr, snr), rssi = ..., score = ...`. A later arrival with SNR fills in a `NULL`; a later arrival without SNR does not overwrite an existing value. ## Tests - `TestIATAFilterDoesNotDropStatusMessages` — verifies BFL status message is processed when IATA filter includes only SJC, and that BFL packet is still filtered - `TestInsertObservationSNRFillIn` — verifies SNR fills in on second arrival, and is not overwritten by a subsequent null arrival ## Related Partially addresses #694 (upstream client issue of missing SNR in packet messages is out of scope) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 22:16:01 -07:00
Kpa-clawbot	a8e1cea683	fix: use payload type bits only in content hash (not full header byte) (#787 ) ## Problem The firmware computes packet content hash as: ``` SHA256(payload_type_byte + [path_len for TRACE] + payload) ``` Where `payload_type_byte = (header >> 2) & 0x0F` — just the payload type bits (2-5). CoreScope was using the full header byte in its hash computation, which includes route type bits (0-1) and version bits (6-7). This meant the same logical packet produced different content hashes depending on route type — breaking dedup and packet lookup. Firmware reference: `Packet.cpp::calculatePacketHash()` uses `getPayloadType()` which returns `(header >> PH_TYPE_SHIFT) & PH_TYPE_MASK`. ## Fix - Extract only payload type bits: `payloadType := (headerByte >> 2) & 0x0F` - Include `path_len` byte in hash for TRACE packets (matching firmware behavior) - Applied to both `cmd/server/decoder.go` and `cmd/ingestor/decoder.go` ## Tests Added - Route type independence: Same payload with FLOOD vs DIRECT route types produces identical hash - TRACE path_len inclusion: TRACE packets with different `path_len` produce different hashes - Firmware compatibility: Hash output matches manual computation of firmware algorithm ## Migration Impact Existing packets in the DB have content hashes computed with the old (incorrect) formula. Options: 1. Recompute hashes via migration (recommended for clean state) 2. Dual lookup — check both old and new hash on queries (backward compat) 3. Accept the break — old hashes become stale, new packets get correct hashes Recommend option 1 (migration) as a follow-up. The volume of affected packets depends on how many distinct route types were seen for the same logical packet. Fixes #786 --------- Co-authored-by: you <you@example.com>	2026-04-18 11:52:22 -07:00
Kpa-clawbot	bf674ebfa2	feat: validate advert signatures on ingest, reject corrupt packets (#794 ) ## Summary Validates ed25519 signatures on ADVERT packets during MQTT ingest. Packets with invalid signatures are rejected before storage, preventing corrupt/truncated adverts from polluting the database. ## Changes ### Ingestor (`cmd/ingestor/`) - Signature validation on ingest: After decoding an ADVERT, checks `SignatureValid` from the decoder. Invalid signatures → packet dropped, never stored. - Config flag: `validateSignatures` (default `true`). Set to `false` to disable validation for backward compatibility with existing installs. - `dropped_packets` table: New SQLite table recording every rejected packet with full attribution: - `hash`, `raw_hex`, `reason`, `observer_id`, `observer_name`, `node_pubkey`, `node_name`, `dropped_at` - Indexed on `observer_id` and `node_pubkey` for investigation queries - `SignatureDrops` counter: New atomic counter in `DBStats`, logged in periodic stats output as `sig_drops=N` - Retention: `dropped_packets` pruned alongside metrics on the same `retention.metricsDays` schedule ### Server (`cmd/server/`) - `GET /api/dropped-packets` (API key required): Returns recent drops with optional `?observer=` and `?pubkey=` filters, `?limit=` (default 100, max 500) - `signatureDrops` field added to `/api/stats` response (count from `dropped_packets` table) ### Tests (8 new) \| Test \| What it verifies \| \|------\|-----------------\| \| `TestSigValidation_ValidAdvertStored` \| Valid advert passes validation and is stored \| \| `TestSigValidation_TamperedSignatureDropped` \| Tampered signature → dropped, recorded in `dropped_packets` with correct fields \| \| `TestSigValidation_TruncatedAppdataDropped` \| Truncated appdata invalidates signature → dropped \| \| `TestSigValidation_DisabledByConfig` \| `validateSignatures: false` skips validation, stores tampered packet \| \| `TestSigValidation_DropCounterIncrements` \| Counter increments correctly across multiple drops \| \| `TestSigValidation_LogContainsFields` \| `dropped_packets` row contains hash, reason, observer, pubkey, name \| \| `TestPruneDroppedPackets` \| Old entries pruned, recent entries retained \| \| `TestShouldValidateSignatures_Default` \| Config helper returns correct defaults \| ### Config example ```json { "validateSignatures": true } ``` Fixes #793 --------- Co-authored-by: you <you@example.com>	2026-04-18 11:39:13 -07:00
Joel Claw	fa3f623bd6	feat: add observer retention — remove stale observers after configurable days (#764 ) ## Summary Observers that stop actively sending data now get removed after a configurable retention period (default 14 days). Previously, observers remained in the `observers` table forever. This meant nodes that were once observers for an instance but are no longer connected (even if still active in the mesh elsewhere) would continue appearing in the observer list indefinitely. ## Key Design Decisions - Active data requirement: `last_seen` is only updated when the observer itself sends packets (via `stmtUpdateObserverLastSeen`). Being seen by another node does NOT update this field. So an observer must actively send data to stay listed. - Default: 14 days — observers not seen in 14 days are removed - `-1` = keep forever — for users who want observers to never be removed - `0` = use default (14 days) — same as not setting the field - Runs on startup + daily ticker — staggered 3 minutes after metrics prune to avoid DB contention ## Changes \| File \| Change \| \|------\|--------\| \| `cmd/ingestor/config.go` \| Add `ObserverDays` to `RetentionConfig`, add `ObserverDaysOrDefault()` \| \| `cmd/ingestor/db.go` \| Add `RemoveStaleObservers()` — deletes observers with `last_seen` before cutoff \| \| `cmd/ingestor/main.go` \| Wire up startup + daily ticker for observer retention \| \| `cmd/server/config.go` \| Add `ObserverDays` to `RetentionConfig`, add `ObserverDaysOrDefault()` \| \| `cmd/server/db.go` \| Add `RemoveStaleObservers()` (server-side, uses read-write connection) \| \| `cmd/server/main.go` \| Wire up startup + daily ticker, shutdown cleanup \| \| `cmd/server/routes.go` \| Admin prune API now also removes stale observers \| \| `config.example.json` \| Add `observerDays: 14` with documentation \| \| `cmd/ingestor/coverage_boost_test.go` \| 4 tests: basic removal, empty store, keep forever (-1), default (0→14) \| \| `cmd/server/config_test.go` \| 4 tests: `ObserverDaysOrDefault` edge cases \| ## Config Example ```json { "retention": { "nodeDays": 7, "observerDays": 14, "packetDays": 30, "_comment": "observerDays: -1 = keep forever, 0 = use default (14)" } } ``` ## Admin API The `/api/admin/prune` endpoint now also removes stale observers (using `observerDays` from config) and reports `observers_removed` in the response alongside `packets_deleted`. ## Test Plan - [x] `TestRemoveStaleObservers` — old observer removed, recent observer kept - [x] `TestRemoveStaleObserversNone` — empty store, no errors - [x] `TestRemoveStaleObserversKeepForever` — `-1` keeps even year-old observers - [x] `TestRemoveStaleObserversDefault` — `0` defaults to 14 days - [x] `TestObserverDaysOrDefault` (ingestor) — nil/zero/positive/keep-forever - [x] `TestObserverDaysOrDefault` (server) — nil/zero/positive/keep-forever - [x] Both binaries compile cleanly (`go build`) - [ ] Manual: verify observer count decreases after retention period on a live instance	2026-04-17 09:24:40 -07:00
Kpa-clawbot	0e286d85fd	fix: channel query performance — add channel_hash column, SQL-level filtering (#762 ) (#763 ) ## Problem Channel API endpoints scan entire DB — 2.4s for channel list, 30s for messages. ## Fix - Added `channel_hash` column to transmissions (populated on ingest, backfilled on startup) - `GetChannels()` rewrites to GROUP BY channel_hash (one row per channel vs scanning every packet) - `GetChannelMessages()` filters by channel_hash at SQL level with proper LIMIT/OFFSET - 60s cache for channel list - Index: `idx_tx_channel_hash` for fast lookups Expected: 2.4s → <100ms for list, 30s → <500ms for messages. Fixes #762 --------- Co-authored-by: you <you@example.com>	2026-04-16 00:09:36 -07:00
Kpa-clawbot	14367488e2	fix: TRACE path_json uses path_sz from flags byte, not header hash_size (#732 ) ## Summary TRACE packets encode their route hash size in the flags byte (`flags & 0x03`), not the header path byte. The decoder was using `path.HashSize` from the header, which could be wrong or zero for direct-route TRACEs, producing incorrect hop counts in `path_json`. ## Protocol Note Per firmware, TRACE packets are always direct-routed (route_type 2 = DIRECT, or 3 = TRANSPORT_DIRECT). FLOOD-routed TRACEs (route_type 1) are anomalous — firmware explicitly rejects TRACE via flood. The decoder handles these gracefully without crashing. ## Changes `cmd/server/decoder.go` and `cmd/ingestor/decoder.go`: - Read `pathSz` from TRACE flags byte: `(traceFlags & 0x03) + 1` (0→1byte, 1→2byte, 2→3byte) - Use `pathSz` instead of `path.HashSize` for splitting TRACE payload path data into hops - Update `path.HashSize` to reflect the actual TRACE path size - Added `HopsCompleted` field to ingestor `Path` struct for parity with server - Updated comments to clarify TRACE is always direct-routed per firmware `cmd/server/decoder_test.go` — 5 new tests: - `TraceFlags1_TwoBytePathSz`: flags=1 → 2-byte hashes via DIRECT route - `TraceFlags2_ThreeBytePathSz`: flags=2 → 3-byte hashes via DIRECT route - `TracePathSzUnevenPayload`: payload not evenly divisible by path_sz - `TraceTransportDirect`: route_type=3 with transport codes + TRACE path parsing - `TraceFloodRouteGraceful`: anomalous FLOOD+TRACE handled without crash All existing TRACE tests (flags=0, 1-byte hashes) continue to pass. Fixes #731 --------- Co-authored-by: you <you@example.com>	2026-04-13 08:20:09 -07:00
copelaje	922ebe54e7	BYOP Advert signature validation (#686 ) For BYOP mode in the packet analyzer, perform signature validation on advert packets and display whether successful or not. This is added as we observed many corrupted advert packets that would be easily detectable as such if signature validation checks were performed. At present this MR is just to add this status in BYOP mode so there is minimal impact to the application and no performance penalty for having to perform these checks on all packets. Moving forward it probably makes sense to do these checks on all advert packets so that corrupt packets can be ignored in several contexts (like node lists for example). Let me know what you think and I can adjust as needed. --------- Co-authored-by: you <you@example.com>	2026-04-12 04:02:17 +00:00
Kpa-clawbot	2e1a4a2e0d	fix: handle companion nodes without adverts in My Mesh health cards (#696 ) ## Summary Fixes #665 — companion nodes claimed in "My Mesh" showed "Could not load data" because they never sent an advert, so they had no `nodes` table entry, causing the health API to return 404. ## Three-Layer Fix ### 1. API Resilience (`cmd/server/store.go`) `GetNodeHealth()` now falls back to building a partial response from the in-memory packet store when `GetNodeByPubkey()` returns nil. Returns a synthetic node stub (`role: "unknown"`, `name: "Unknown"`) with whatever stats exist from packets, instead of returning nil → 404. ### 2. Ingestor Cleanup (`cmd/ingestor/main.go`) Removed phantom sender node creation that used `"sender-" + name` as the pubkey. Channel messages don't carry the sender's real pubkey, so these synthetic entries were unreachable from the claiming/health flow — they just polluted the nodes table with unmatchable keys. ### 3. Frontend UX (`public/home.js`) The catch block in `loadMyNodes()` now distinguishes 404 (node not in DB yet) from other errors: - 404: Shows 📡 "Waiting for first advert — this node has been seen in channel messages but hasn't advertised yet" - Other errors: Shows ❓ "Could not load data" (unchanged) ## Tests - Added `TestNodeHealthPartialFromPackets` — verifies a node with packets but no DB entry returns 200 with synthetic node stub and stats - Updated `TestHandleMessageChannelMessage` — verifies channel messages no longer create phantom sender nodes - All existing tests pass (`cmd/server`, `cmd/ingestor`) Co-authored-by: you <you@example.com>	2026-04-09 20:03:52 -07:00
efiten	34e7366d7c	test: add RouteTransportDirect zero-hop cases to ingestor decoder tests (#684 ) ## Summary Closes the symmetry gap flagged as a nit in PR #653 review: > The ingestor decoder tests omit `RouteTransportDirect` zero-hop tests — only the server decoder has those. Since the logic is identical, this is not a blocker, but adding them would make the test suites symmetric. - Adds `TestZeroHopTransportDirectHashSize` — `pathByte=0x00`, expects `HashSize=0` - Adds `TestZeroHopTransportDirectHashSizeWithNonZeroUpperBits` — `pathByte=0xC0` (hash_size bits set, hash_count=0), expects `HashSize=0` Both mirror the equivalent tests already present in `cmd/server/decoder_test.go`. ## Test plan - [ ] `cd cmd/ingestor && go test -run TestZeroHopTransportDirect -v` → both new tests pass - [ ] `cd cmd/ingestor && go test ./...` → no regressions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 17:36:34 -07:00
efiten	144e98bcdf	fix: hide hash size for zero-hop direct adverts (#649 ) (#653 ) ## Fix: Zero-hop DIRECT packets report bogus hash_size Closes #649 ### Problem When a DIRECT packet has zero hops (pathByte lower 6 bits = 0), the generic `hash_size = (pathByte >> 6) + 1` formula produces a bogus value (1-4) instead of 0/unknown. This causes incorrect hash size displays and analytics for zero-hop direct adverts. ### Solution Frontend (JS): - `packets.js` and `nodes.js` now check `(pathByte & 0x3F) === 0` to detect zero-hop packets and suppress bogus hash_size display. Backend (Go): - Both `cmd/server/decoder.go` and `cmd/ingestor/decoder.go` reset `HashSize=0` for DIRECT packets where `pathByte & 0x3F == 0` (hash_count is zero). - TRACE packets are excluded since they use hashSize to parse hop data from the payload. - The condition uses `pathByte & 0x3F == 0` (not `pathByte == 0x00`) to correctly handle the case where hash_size bits are non-zero but hash_count is zero — matching the JS frontend approach. ### Testing Backend: - Added 4 tests each in `cmd/server/decoder_test.go` and `cmd/ingestor/decoder_test.go`: - DIRECT + pathByte 0x00 → HashSize=0 ✅ - DIRECT + pathByte 0x40 (hash_size bits set, hash_count=0) → HashSize=0 ✅ - Non-DIRECT + pathByte 0x00 → HashSize=1 (unchanged) ✅ - DIRECT + pathByte 0x01 (1 hop) → HashSize=1 (unchanged) ✅ - All existing tests pass (`go test ./...` in both cmd/server and cmd/ingestor) Frontend: - Verified hash size display is suppressed for zero-hop direct adverts --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: you <you@example.com>	2026-04-07 19:39:15 -07:00
Kpa-clawbot	a068e3e086	feat: zero-config defaults + deployment docs (M3-M4, #610 ) (#631 ) ## Zero-Config Defaults + Deployment Docs Make CoreScope start with zero configuration — no `config.json` required. The ingestor falls back to sensible defaults (local MQTT broker, standard topics, default DB path) when no config file exists. ### What changed `cmd/ingestor/config.go` — `LoadConfig` no longer errors on missing config file. Instead it logs a message and uses defaults. If no MQTT sources are configured (from file or env), defaults to `mqtt://localhost:1883` with `meshcore/#` topic. `cmd/ingestor/main.go` — Removed redundant "no MQTT sources" fatal (now handled in config layer). Improved the "no connections established" fatal with actionable hints. `README.md` — Replaced "Docker (Recommended)" section with a one-command quickstart using the pre-built image. No build step, no config file, just `docker run`. `docs/deployment.md` — New comprehensive deployment guide covering Docker, Compose, config reference, MQTT setup, TLS/HTTPS, monitoring, backup, and troubleshooting. ### Zero-config flow ``` docker run -d -p 80:80 -v corescope-data:/app/data ghcr.io/kpa-clawbot/corescope:latest ``` 1. No config.json found → defaults used, log message printed 2. No MQTT sources → defaults to `mqtt://localhost:1883` 3. Internal Mosquitto broker already running in container → connection succeeds 4. Dashboard shows empty, ready for packets ### Review fixes (commit `13b89bb`) - Removed `DISABLE_CADDY` references from all docs — this env var was never implemented in the entrypoint - Fixed `/api/stats` example in deployment guide — showed nonexistent fields (`mqttConnected`, `uptimeSeconds`, `activeNodes`) - Improved MQTT connection failure message with actionable troubleshooting hints Closes #610 --------- Co-authored-by: you <you@example.com>	2026-04-05 15:04:49 -07:00
Kpa-clawbot	232770a858	feat(rf-health): M2 — airtime, error rate, battery charts with delta computation (#605 ) ## M2: Airtime + Channel Quality + Battery Charts Implements M2 of #600 — server-side delta computation and three new charts in the RF Health detail view. ### Backend Changes Delta computation for cumulative counters (`tx_air_secs`, `rx_air_secs`, `recv_errors`): - Computes per-interval deltas between consecutive samples - Reboot handling: detects counter reset (current < previous), skips that delta, records reboot timestamp - Gap handling: if time between samples > 2× interval, inserts null (no interpolation) - Returns `tx_airtime_pct` and `rx_airtime_pct` as percentages (delta_secs / interval_secs × 100) - Returns `recv_error_rate` as delta_errors / (delta_recv + delta_errors) × 100 `resolution` query param on `/api/observers/{id}/metrics`: - `5m` (default) — raw samples - `1h` — hourly aggregates (GROUP BY hour with AVG/MAX) - `1d` — daily aggregates Schema additions: - `packets_sent` and `packets_recv` columns added to `observer_metrics` (migration) - Ingestor parses these fields from MQTT stats messages API response now includes: - `tx_airtime_pct`, `rx_airtime_pct`, `recv_error_rate` (computed deltas) - `reboots` array with timestamps of detected reboots - `is_reboot_sample` flag on affected samples ### Frontend Changes Three new charts in the RF Health detail view, stacked vertically below noise floor: 1. Airtime chart — TX (red) + RX (blue) as separate SVG lines, Y-axis 0-100%, direct labels at endpoints 2. Error Rate chart — `recv_error_rate` line, shown only when data exists 3. Battery chart — voltage line with 3.3V low reference, shown only when battery_mv > 0 All charts: - Share X-axis and time range (aligned vertically) - Reboot markers as vertical hairlines spanning all charts - Direct labels on data (no legends) - Resolution auto-selected: `1h` for 7d/30d ranges - Charts hidden when no data exists ### Tests - `TestComputeDeltas`: normal deltas, reboot detection, gap detection - `TestGetObserverMetricsResolution`: 5m/1h/1d downsampling verification - Updated `TestGetObserverMetrics` for new API signature --------- Co-authored-by: you <you@example.com>	2026-04-04 23:17:17 -07:00
Kpa-clawbot	6f35d4d417	feat: RF Health Dashboard M1 — observer metrics + small multiples grid (#604 ) ## RF Health Dashboard — M1: Observer Metrics Storage, API & Small Multiples Grid Implements M1 of #600. ### What this does Adds a complete RF health monitoring pipeline: MQTT stats ingestion → SQLite storage → REST API → interactive dashboard with small multiples grid. ### Backend Changes Ingestor (`cmd/ingestor/`) - New `observer_metrics` table via migration system (`_migrations` pattern) - Parse `tx_air_secs`, `rx_air_secs`, `recv_errors` from MQTT status messages (same pattern as existing `noise_floor` and `battery_mv`) - `INSERT OR REPLACE` with timestamps rounded to nearest 5-min interval boundary (using ingestor wall clock, not observer timestamps) - Missing fields stored as NULLs — partial data is always better than no data - Configurable retention pruning: `retention.metricsDays` (default 30), runs on startup + every 24h Server (`cmd/server/`) - `GET /api/observers/{id}/metrics?since=...&until=...` — per-observer time-series data - `GET /api/observers/metrics/summary?window=24h` — fleet summary with current NF, avg/max NF, sample count - `parseWindowDuration()` supports `1h`, `24h`, `3d`, `7d`, `30d` etc. - Server-side metrics retention pruning (same config, staggered 2min after packet prune) ### Frontend Changes RF Health tab (`public/analytics.js`, `public/style.css`) - Small multiples grid showing all observers simultaneously — anomalies pop out visually - Per-observer cell: name, current NF value, battery voltage, sparkline, avg/max stats - NF status coloring: warning (amber) at ≥-100 dBm, critical (red) at ≥-85 dBm — text color only, no background fills - Click any cell → expanded detail view with full noise floor line chart - Reference lines with direct text labels (`-100 warning`, `-85 critical`) — not color bands - Min/max points labeled directly on the chart - Time range selector: preset buttons (1h/3h/6h/12h/24h/3d/7d/30d) + custom from/to datetime picker - Deep linking: `#/analytics?tab=rf-health&observer=...&range=...` - All charts use SVG, matching existing analytics.js patterns - Responsive: 3-4 columns on desktop, 1 on mobile ### Design Decisions (from spec) - Labels directly on data, not in legends - Reference lines with text labels, not color bands - Small multiples grid, not card+accordion (Tufte: instant visual fleet comparison) - Ingestor wall clock for all timestamps (observer clocks may drift) ### Tests Added Ingestor tests: - `TestRoundToInterval` — 5 cases for rounding to 5-min boundaries - `TestInsertMetrics` — basic insertion with all fields - `TestInsertMetricsIdempotent` — INSERT OR REPLACE deduplication - `TestInsertMetricsNullFields` — partial data with NULLs - `TestPruneOldMetrics` — retention pruning - `TestExtractObserverMetaNewFields` — parsing tx_air_secs, rx_air_secs, recv_errors Server tests: - `TestGetObserverMetrics` — time-series query with since/until filters, NULL handling - `TestGetMetricsSummary` — fleet summary aggregation - `TestObserverMetricsAPIEndpoints` — DB query verification - `TestMetricsAPIEndpoints` — HTTP endpoint response shape - `TestParseWindowDuration` — duration parsing for h/d formats ### Test Results ``` cd cmd/ingestor && go test ./... → PASS (26s) cd cmd/server && go test ./... → PASS (5s) ``` ### What's NOT in this PR (deferred to M2+) - Server-side delta computation for cumulative counters - Airtime charts (TX/RX percentage lines) - Channel quality chart (recv_error_rate) - Battery voltage chart - Reboot detection and chart annotations - Resolution downsampling (1h, 1d aggregates) - Pattern detection / automated diagnosis --------- Co-authored-by: you <you@example.com>	2026-04-04 22:21:35 -07:00
Kpa-clawbot	2755dc3875	test: push ingestor coverage from 70% to 84% (#344 ) (#492 ) ## Summary Push Go ingestor test coverage from 70.2% → 84.0% (92.8% excluding the untestable `main()` and `init()` functions). Part of #344 — ingestor coverage ## What Changed Added `coverage_boost_test.go` with 60+ new test functions covering previously untested code paths: ### Coverage Before → After by Function \| Function \| Before \| After \| \|----------\|--------\|-------\| \| `NodeDaysOrDefault` \| 0% \| 100% \| \| `MoveStaleNodes` \| 0% \| 76.5% \| \| `NodePassesGeoFilter` \| 40% \| 100% \| \| `handleMessage` \| 41.4% \| 92.1% \| \| `ResolvedSources` \| 71.4% \| 100% \| \| `extractObserverMeta` \| 100% \| 100% \| \| `decodeAdvert` \| 88.2% \| 94.1% \| \| `decryptChannelMessage` \| 88.4% \| 93.0% \| \| Total \| 70.2% \| 84.0% \| ### Test Categories Added - Config: `NodeDaysOrDefault` all branches, broker scheme normalization (`mqtt://` → `tcp://`, `mqtts://` → `ssl://`) - Database: `MoveStaleNodes` (stale/fresh/replace), duplicate transmission handling, default timestamps, telemetry updates, schema migration verification - Decoder: Sensor telemetry parsing, location + features with truncated data, `countNonPrintable` with invalid UTF-8, `decryptChannelMessage` error paths (invalid key/MAC/ciphertext/alignment), short payload handling - Geo Filter: All branches (nil filter, nil coords, inside/outside) - Message Handler: Channel messages (with/without sender, empty text), direct messages, geo-filtered adverts, corrupted adverts (all-zero pubkey), non-advert packets, `Score`/`Direction` case-insensitive fallbacks, status messages with full hardware metadata ### Why Not 90%+ The remaining ~16% uncovered statements are: - `main()` function (68 blocks) — program entry point with MQTT client setup, signal handling, goroutines — not unit-testable without major refactoring - `init()` function — `--version` flag + `os.Exit(0)` — kills the test process - `prepareStatements()` error returns — only trigger on corrupted/incompatible SQLite databases - `applySchema()` migration error paths — only trigger on filesystem/SQLite failures Excluding `main()` and `init()`, effective coverage is 92.8%. ## Test Results All 100+ tests pass (existing + new): ``` ok github.com/corescope/ingestor 25.945s coverage: 84.0% of statements ``` --------- Co-authored-by: you <you@example.com>	2026-04-02 17:31:47 -07:00
Kpa-clawbot	f9cfad9cd4	fix: update observer last_seen on packet ingestion (#479 ) ## Summary Related to #463 (partial fix — addresses packet path, status message path still needs investigation) — Observers incorrectly showing as offline despite actively forwarding packets. ## Root Cause Observer `last_seen` was only updated when status topic messages (`meshcore/<region>/<observer_id>/status`) were received via `UpsertObserver`. When packets were ingested from an observer, the observer's `last_seen` was not updated — only the `observer_idx` was resolved for the observation record. This meant observers with low traffic that published status messages less frequently than the 10-minute online threshold would appear offline on the observers page, even though they were clearly alive and forwarding packets. ## Changes `cmd/ingestor/db.go`: - Added `stmtUpdateObserverLastSeen` prepared statement: `UPDATE observers SET last_seen = ? WHERE rowid = ?` - In `InsertTransmission`, after resolving `observer_idx`, update the observer's `last_seen` to the packet timestamp - This ensures any observer actively forwarding traffic stays marked as online `cmd/ingestor/db_test.go`: - Added `TestInsertTransmissionUpdatesObserverLastSeen` — verifies that inserting a packet from an observer updates its `last_seen` from a backdated value to the packet timestamp ## Performance The added `UPDATE` is a single-row update by `rowid` (primary key) — O(1) with no index overhead. It runs once per packet insertion when an observer is resolved, which was already doing a `SELECT` by `rowid` anyway. No measurable impact on ingestion throughput. ## Test Results All existing tests pass: - `cmd/ingestor`: 26.6s ✅ - `cmd/server`: 3.7s ✅ --------- Co-authored-by: you <you@example.com>	2026-04-01 23:43:47 -07:00
Kpa-clawbot	f87eb3601c	fix: graceful container shutdown for reliable deployments (#453 ) ## Summary Fixes #450 — staging deployment flaky due to container not shutting down cleanly. ## Root Causes 1. Server never closed DB on shutdown — SQLite WAL lock held indefinitely, blocking new container startup 2. `httpServer.Close()` instead of `Shutdown()` — abruptly kills connections instead of draining them 3. No `stop_grace_period` in compose configs — Docker sends SIGTERM then immediately SIGKILL (default 10s is often not enough for WAL checkpoint) 4. Supervisor didn't forward SIGTERM — missing `stopsignal`/`stopwaitsecs` meant Go processes got SIGKILL instead of graceful shutdown 5. Deploy scripts used default `docker stop` timeout — only 10s grace period ## Changes ### Go Server (`cmd/server/`) - Graceful HTTP shutdown: `httpServer.Shutdown(ctx)` with 15s context timeout — drains in-flight requests before closing - WebSocket cleanup: New `Hub.Close()` method sends `CloseGoingAway` frames to all connected clients - DB close on shutdown: Explicitly closes DB after HTTP server stops (was never closed before) - WAL checkpoint: `PRAGMA wal_checkpoint(TRUNCATE)` before DB close — flushes WAL to main DB file and removes WAL/SHM lock files ### Go Ingestor (`cmd/ingestor/`) - WAL checkpoint on shutdown: New `Store.Checkpoint()` method, called before `Close()` - Longer MQTT disconnect timeout: 5s (was 1s) to allow in-flight messages to drain ### Docker Compose (all 4 variants) - Added `stop_grace_period: 30s` and `stop_signal: SIGTERM` ### Supervisor Configs (both variants) - Added `stopsignal=TERM` and `stopwaitsecs=20` to server and ingestor programs ### Deploy Scripts - `deploy-staging.sh`: `docker stop -t 30` with explicit grace period - `deploy-live.sh`: `docker stop -t 30` with explicit grace period ## Shutdown Sequence (after fix) 1. Docker sends SIGTERM to supervisord (PID 1) 2. Supervisord forwards SIGTERM to server + ingestor (waits up to 20s each) 3. Server: stops poller → drains HTTP (15s) → closes WS clients → checkpoints WAL → closes DB 4. Ingestor: stops tickers → disconnects MQTT (5s) → checkpoints WAL → closes DB 5. Docker waits up to 30s total before SIGKILL ## Tests All existing tests pass: - `cd cmd/server && go test ./...` ✅ - `cd cmd/ingestor && go test ./...` ✅ --------- Co-authored-by: you <you@example.com> Co-authored-by: Kpa-clawbot <kpabap+clawdbot@gmail.com>	2026-04-01 12:19:20 -07:00
Kpa-clawbot	be313f60cb	fix: extract score/direction from MQTT, strip units, fix type safety issues (#371 ) ## Summary Fixes #353 — addresses all 5 findings from the CoreScope code analysis. ## Changes ### Finding 1 (Major): `score` field never extracted from MQTT - Added `Score float64` field to `PacketData` and `MQTTPacketMessage` structs - Extract `msg["score"]` with `msg["Score"]` case fallback via `toFloat64` in all three MQTT handlers (raw packet, channel message, direct message) - Pass through to DB observation insert instead of hardcoded `nil` ### Finding 2 (Major): `direction` field never extracted from MQTT - Added `Direction string` field to `PacketData` and `MQTTPacketMessage` structs - Extract `msg["direction"]` with `msg["Direction"]` case fallback as string in all three MQTT handlers - Pass through to DB observation insert instead of hardcoded `nil` ### Finding 3 (Minor): `toFloat64` doesn't strip units - Added `stripUnitSuffix()` that removes common RF/signal unit suffixes (dBm, dB, mW, km, mi, m) case-insensitively before `ParseFloat` - Values like `"-110dBm"` or `"5.5dB"` now parse correctly ### Finding 4 (Minor): Bare type assertions in store.go - Changed `firstSeen` and `lastSeen` from `interface{}` to typed `string` variables at `store.go:5020` - Removed unsafe `.(string)` type assertions in comparisons ### Finding 5 (Minor): `distHopRecord.SNR` typed as `interface{}` - Changed `distHopRecord.SNR` from `interface{}` to `*float64` - Updated assignment (removed intermediate `snrVal` variable, pass `tx.SNR` directly) - Updated output serialization to use `floatPtrOrNil(h.SNR)` for consistent JSON output ## Tests Added - `TestBuildPacketDataScoreAndDirection` — verifies Score/Direction flow through BuildPacketData - `TestBuildPacketDataNilScoreDirection` — verifies nil handling when fields absent - `TestInsertTransmissionWithScoreAndDirection` — end-to-end: inserts with score/direction, verifies DB values - `TestStripUnitSuffix` — covers all supported suffixes, case insensitivity, and passthrough - `TestToFloat64WithUnits` — verifies unit-bearing strings parse correctly All existing tests pass. Co-authored-by: you <you@example.com>	2026-04-01 07:26:23 -07:00
efiten	8a0862523d	fix: add migration for missing observations.timestamp index (#332 ) ## Problem On installations where the database predates the `idx_observations_timestamp` index, `/api/stats` takes 30s+ because `GetStoreStats()` runs two full table scans: ```sql SELECT COUNT() FROM observations WHERE timestamp > ? -- last hour SELECT COUNT() FROM observations WHERE timestamp > ? -- last 24h ``` The index is only created in the `if !obsExists` block, so any database where the `observations` table already existed before that code was added never gets it. ## Fix Adds a one-time migration (`obs_timestamp_index_v1`) that runs at ingestor startup: ```sql CREATE INDEX IF NOT EXISTS idx_observations_timestamp ON observations(timestamp) ``` On large installations this index creation may take a few seconds on first startup after the upgrade, but subsequent stats queries become instant. ## Test plan - [ ] Restart ingestor on an older database and confirm `[migration] observations timestamp index created` appears in logs - [ ] Confirm `/api/stats` response time drops from 30s+ to <100ms 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 07:06:54 -07:00
Kpa-clawbot	b2279b230b	fix: handle string, uint, and uint64 types in toFloat64 (#352 ) ## Summary Fixes #350 — `toFloat64()` silently drops SNR/RSSI values when bridges send strings instead of numbers. ## Problem Some MQTT bridges serialize numeric fields (SNR, RSSI, battery_mv, etc.) as JSON strings like `"-7.5"` instead of numbers. The existing `toFloat64()` switch only handled `float64`, `float32`, `int`, `int64`, and `json.Number`, so string values fell through to the default case returning `(0, false)` — silently dropping the data. ## Changes - `cmd/ingestor/main.go`: Added `string`, `uint`, and `uint64` cases to `toFloat64()` - `string`: uses `strconv.ParseFloat(strings.TrimSpace(n), 64)` to handle whitespace-padded numeric strings - `uint` / `uint64`: straightforward numeric conversion - Added `strconv` import - `cmd/ingestor/main_test.go`: Updated `TestToFloat64` with new cases: - Valid string (`"3.14"`), string with spaces (`" -7.5 "`), string integer (`"42"`) - Invalid string (`"hello"`), empty string - `uint(10)`, `uint64(999)` ## Testing All ingestor tests pass (`go test ./...`). Co-authored-by: you <you@example.com>	2026-04-01 06:58:27 -07:00
Kpa-clawbot	4898541bce	fix(ingestor): observer metadata nested stats + SNR/RSSI case fallback (#336 ) ## Problem Two data integrity bugs in the Go ingestor cause observer metadata and signal quality data to be missing for all Go-backend users. ### #320 — Observer metadata never populated `extractObserverMeta()` reads `battery_mv`, `uptime_secs`, and `noise_floor` from the top level of the MQTT status message. However, the actual MQTT payload nests these under a `stats` object: ```json { "status": "online", "origin": "ObserverName", "model": "Heltec V3", "firmware_version": "v1.14.0-9f1a3ea", "stats": { "battery_mv": 4174, "uptime_secs": 80277, "noise_floor": -110 } } ``` Result: battery, uptime, and noise floor are always NULL in the database. ### #321 — SNR and RSSI always missing on raw packets The raw packet handler reads `msg["SNR"]` and `msg["RSSI"]` (uppercase only). Some MQTT bridges send these as lowercase `snr`/`rssi`. The companion BLE handler already has a case-insensitive fallback — the raw packet path did not. Result: SNR/RSSI are NULL for all raw packet observations from bridges that use lowercase keys. ## Fix ### #320 — Nested stats with top-level fallback - Added `nestedOrTopLevel()` helper that checks `msg["stats"][key]` first, then `msg[key]` - `extractObserverMeta` now uses this helper for `battery_mv`, `uptime_secs`, `noise_floor` - Top-level fallback preserved for backward compatibility with bridges that flatten the structure - Safe type assertion: `stats, _ := msg["stats"].(map[string]interface{})` — no crash if stats is missing or wrong type ### #321 — Lowercase SNR/RSSI fallback - Raw packet handler now uses `else if` to check lowercase `snr`/`rssi` when uppercase keys are absent - Matches the pattern already used in the companion channel and direct message handlers ## Tests 10 new test cases added: \| Test \| What it verifies \| \|------\|-----------------\| \| `TestExtractObserverMetaNestedStats` \| All 5 fields populated from nested stats object \| \| `TestExtractObserverMetaNestedStatsPrecedence` \| Nested stats wins over top-level when both present \| \| `TestExtractObserverMetaFlatFallback` \| Flat structure still works (backward compat) \| \| `TestExtractObserverMetaEmptyStats` \| Empty stats object — no crash, model still works \| \| `TestExtractObserverMetaStatsNotAMap` \| stats is a string — no crash, falls back to top-level \| \| `TestExtractObserverMetaNoiseFloorFloat` \| Float precision preserved (noise_floor REAL migration) \| \| `TestHandleMessageWithLowercaseSNRRSSI` \| Lowercase snr/rssi both stored correctly \| \| `TestHandleMessageSNRRSSIUppercaseWins` \| When both cases present, uppercase takes precedence \| \| `TestHandleMessageNoSNRRSSI` \| Neither key present — nil, no crash \| \| Existing `TestExtractObserverMeta` \| Still passes (flat structure backward compat) \| All tests pass: `go test ./... -count=1` and `go vet ./...` clean. Closes #320 Closes #321 --------- Co-authored-by: Kpa-clawbot <259247574+Kpa-clawbot@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-31 17:53:04 -07:00

1 2

85 Commits