meshcore-analyzer

mirror of https://github.com/Kpa-clawbot/meshcore-analyzer.git synced 2026-07-29 05:59:19 +00:00

Author	SHA1	Message	Date
efiten GitHub Claude Sonnet 4.6	7c40e24a35	feat(server): warn at startup when GOMEMLIMIT < 50% of container memory limit (#1264 ) (#1429 ) ## Summary - Adds `readCgroupMemoryMB()` to detect container memory ceiling from cgroup v2 (`/sys/fs/cgroup/memory.max`) and v1 (`/sys/fs/cgroup/memory.limit_in_bytes`) - Adds `warnIfMemlimitUnderprovisioned()` called once from `main()` after the existing memlimit block — logs a `[memlimit] WARN` at startup if the effective GOMEMLIMIT is below 50% of the container limit - Works whether the limit was set via `GOMEMLIMIT` env var or derived from `packetStore.maxMemoryMB` - Adds `readCgroupMemoryMBFn` package-level hook for test injection (same pattern as `readProcSelfIOFn` in the ingestor) Fixes #1264. In the reported incident, GOMEMLIMIT was 1536 MiB on a 7.7 GB container; GC consumed 82% of CPU and all endpoints were 3–100× slower. This warning fires at startup so operators catch the misconfiguration before it causes an incident. ## Test plan - [ ] `TestWarnIfMemlimitUnderprovisioned_EmitsWarning` — warning fires when effective < 50% of cgroup - [ ] `TestWarnIfMemlimitUnderprovisioned_NoWarnWhenAdequate` — no warning at boundary (effective = 1024 MiB, cgroup = 1536 MiB) - [ ] `TestWarnIfMemlimitUnderprovisioned_NoCgroupNoLog` — silent on non-container hosts - [ ] `TestWarnIfMemlimitUnderprovisioned_NoneSource` — no warning when `source="none"` (no limit configured, runtime returns math.MaxInt64) - [ ] `TestMemlimitUnderprovisioned` — boundary table for the comparison helper - [ ] All existing `TestApplyMemoryLimit_*` still pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 15:06:30 -07:00
efiten GitHub Claude Sonnet 4.6	ad45a774d7	test(paths): regression test for #1144 — hop name mis-resolution on prefix collision (#1433 ) ## Summary - Adds `TestHandleNodePaths_HopName_CanonicalPathShowsTarget_1144` as a regression test for issue #1144 - When two nodes share a short pubkey prefix (e.g. `"37"`), the biased hop resolver (`resolveWithContext`) could pick a GPS-having sibling over the actual target node, producing the wrong name in hop display - The bug was already fixed during the #1352 canonical-path work: the canonical-path branch (Option A) uses `lookupNode(resolvedPK)` with the full pubkey from `resolved_path`, bypassing the biased resolver entirely - This PR documents and locks in the correct behaviour with a targeted test ## Test setup - `targetPK` (`37cf...`): no GPS - `siblingPK` (`37bb...`): has GPS — the biased resolver's tier-3 picks this without the fix - One TX with `resolved_path = [targetPK]` → Option A fires → `lookupNode(targetPK)` → hop shows `"CJS SF Mission"`, not `"Templeton Hills"` If Option A were removed (bug re-introduced), `resolveWithContext("37", ...)` on the two candidates would return the GPS-having sibling, triggering the test failure. ## Test plan - [x] `go test -run TestHandleNodePaths_HopName -v` passes - [x] Full `go test ./...` passes - [x] Code review addressed (collapsed redundant error checks) Closes #1144 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 15:02:59 -07:00
efiten GitHub Claude Sonnet 4.6	981664528e	perf(server): serve stale repeater enrich cache instead of inline rebuild (#1272 ) (#1436 ) ## Summary - Removes the TTL-based inline rebuild from `GetRepeaterRelayInfoMap` and `GetRepeaterUsefulnessScoreMap` - When the cache is non-nil it is returned immediately, regardless of age — no more 700ms on-request recompute - Inline compute is retained only as a nil-cache guard (edge case: tests without a running recomputer) - Fixes the stale `// 15s-TTL gate` comment in `recomputeRepeaterEnrichmentSafe` Root cause: `computeRepeaterRelayInfoMap` runs inline when the TTL expires, taking ~700ms on a busy instance. `StartRepeaterEnrichmentRecomputer` (introduced in #1262) already keeps the cache warm via synchronous prewarm at startup + 5-min ticks, making the inline path dead code that fires only when the TTL is shorter than the recomputer interval (e.g. custom `analytics.defaultIntervalSeconds > 600`). ## Test plan - [ ] `TestGetRepeaterRelayInfoMap_ServesStaleOnTTLExpiry` — regression guard: stale sentinel is returned without recompute - [ ] `TestGetRepeaterUsefulnessScoreMap_ServesStaleOnTTLExpiry` — same for usefulness score map - [ ] `TestGetRepeaterRelayInfoMap_BuildsWhenNil` — nil-cache fallback still works - [ ] Full `-short` suite passes (`go test -short ./...`) Closes #1272 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 15:01:58 -07:00
efiten GitHub Claude Sonnet 4.6	52f131e2dc	fix(ingestor): add hourly WAL checkpoint to prevent unbounded WAL growth (#1435 ) Fixes #1434. ## Problem The ingestor's `Checkpoint()` (`PRAGMA wal_checkpoint(TRUNCATE)`) was only called on shutdown. SQLite's built-in auto-checkpoint runs in PASSIVE mode which cannot truncate the WAL while the server holds an active read connection. Result: the WAL grows at ~40–50 MB/hour and is never reset during a running instance. Observed on analyzer.on8ar.eu: 183.4 MB WAL after ~4h uptime. ## Changes `cmd/ingestor/main.go` - Add a periodic goroutine that calls `Checkpoint()` every hour, staggered 30s after startup - Hoist `walCheckpointTicker` to function scope so it is stopped cleanly at shutdown alongside all other tickers `cmd/ingestor/db.go` - Switch `Checkpoint()` from `Exec` to `QueryRow(...).Scan` to capture SQLite's 3-column result (`busy`, `log`, `checkpointed`) - Return the checkpointed frame count (callers that discard it are unaffected) - Log only when `walFrames > 0` — silent when WAL is already empty, avoiding log spam - Log `blocked=true/false` instead of raw `busy` integer to make it clear when the server's read lock is preventing full truncation ## Behaviour after fix Each hourly tick flushes all WAL frames not held by an active server reader. Worst-case WAL size is now bounded to roughly one hour of write traffic (~45 MB) instead of unbounded growth. If the server holds a read lock at checkpoint time, the log shows `blocked=true` and remaining frames are retried on the next tick. ## Test plan - [x] `go build ./...` (ingestor module) - [x] `go test ./...` passes - [x] Code review addressed (ticker stop on shutdown, log message clarity) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 15:01:54 -07:00
Eric MuehlsteinandGitHub	29432d4fe0	feat(ingestor): document and test ws:// / wss:// WebSocket MQTT broker support (#902 ) ## Summary CoreScope's ingestor already supports WebSocket MQTT connections today — `paho.mqtt.golang` v1.5.0 handles `ws://` and `wss://` natively via gorilla/websocket. However this support was undocumented, untested, and had a TLS gap for `wss://` connections. This PR closes those gaps without any breaking changes. ## Changes ### `cmd/ingestor/config.go` - Added godoc comment to `ResolvedSources()` explaining all four supported schemes and which ones require translation vs. pass-through - `ws://` and `wss://` explicitly documented as native paho schemes requiring no mapping ### `cmd/ingestor/main.go` - Extended TLS config to cover `wss://` in addition to `ssl://` - Before: `wss://` connections would use paho's default TLS (no explicit `tls.Config` set), which works for valid certs but doesn't apply the same predictable setup as `ssl://` - After: both `ssl://` and `wss://` get `tls.Config{}` (system CA pool), matching behavior; `rejectUnauthorized: false` still works for self-signed certs on both schemes ### `cmd/ingestor/config_test.go` Two new tests: - `TestResolvedSourcesSchemeMapping`: validates all six scheme variations (`mqtt://`, `mqtts://`, `tcp://`, `ssl://`, `ws://`, `wss://`) including paths like `wss://host/mqtt` - `TestLoadConfigWSSource`: full round-trip of a dual-source config (TCP + wss:// with username/password), verifies scheme unchanged through `LoadConfig` and `ResolvedSources` ### `config.example.json` - Added `wsmqtt` example entry showing `wss://` with username/password - Updated `_comment_mqttSources` to enumerate all supported schemes: `mqtt://`, `mqtts://`, `ws://`, `wss://` ## Motivation We run [meshcore-mqtt-broker](https://github.com/andrewjfreyer/meshcore-mqtt-broker) (a WebSocket MQTT bridge with JWT auth) alongside Mosquitto, and subscribe to both via `mqttSources`. The dual-source config works in production but nothing in the docs or example config made this discoverable for other operators. ## Testing ``` cd cmd/ingestor && go test ./... ok github.com/corescope/ingestor 1.568s ``` All existing tests pass. Two new tests added. ## No breaking changes - Existing configs: no change in behavior - `ws://` / `wss://` configs that were already working: same behavior + explicit TLS setup for `wss://`	2026-05-28 14:58:52 -07:00
efiten GitHub Claude Sonnet 4.6	b3e55ae8d5	fix(nodes): sort paths-through-node by recency, count as tiebreaker (#1145 ) (#1431 ) ## Summary - `/api/nodes/{pk}/paths` returned paths in non-deterministic map iteration order; with many paths the UI showed a random ordering on each page load - Now sorted by `LastSeen` descending (newest-first), with `Count` as a tiebreaker (higher first) - Nil `LastSeen` sorts last (treated as oldest) - `LastSeen` is an RFC 3339 string so lexicographic comparison is correct Closes #1145. ## Test plan - [ ] `TestHandleNodePaths_SortByRecency_1145` — 3 distinct paths (via relay1, relay2, direct), verifies newest appears first - [ ] `TestHandleNodePaths_SortCountTiebreaker_1145` — two paths with identical `LastSeen`, verifies higher-count path wins the tiebreak - [ ] All existing `TestHandleNodePaths_*` tests still pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 14:55:59 -07:00
Kpa-clawbot GitHub openclaw-bot	2627bd053b	fix(#1465 ): observer.last_seen always uses ingest time, not envelope (#1466 ) ## Summary `observer.last_seen` (and `last_packet_at`) answer "when did the analyzer last hear from this observer" — fundamentally an ingest-time question. Previously both the status-message handler and the packet-message handler passed the MQTT envelope timestamp into `UpsertObserverAt` / `stmtUpdateObserverLastSeen`, which let buggy observer clocks drag `last_seen` hours into the past even when the timestamp parsed cleanly as RFC3339 (so #1464's naive-clamp didn't catch it). California observers on `analyzer.00id.net` consistently appeared 3-7h stale for this reason. ## Fix - `cmd/ingestor/main.go` status handler: pass `""` to `UpsertObserverAt` so it falls back to `time.Now()`. - `cmd/ingestor/main.go` packet-path observer upsert: same. - `cmd/ingestor/db.go` `InsertTransmission`'s `stmtUpdateObserverLastSeen.Exec` call: use `ingestNow` for both `last_seen` and `last_packet_at` (was `rxTime`). Per-packet rxTime semantics (`transmissions.first_seen`, `observations.timestamp`) are unchanged — those continue to use envelope time with the naive-clamp / 14h-future / 30d-past guards from #1463 / #1464. Per-hop SNR-vs-time analysis still works. ## TDD - Red: `test(#1465): observer.last_seen uses ingest time even with well-formed envelope (red)` - 3 new tests in `observer_lastseen_1465_test.go`: status-past, status-future, packet-path-past. - Status-past and packet-path-past assertions failed on master (envelope time stored verbatim). - Green: `fix(#1465): observer.last_seen always uses ingest time, not envelope` - All 3 new tests pass. - Pre-existing `TestInsertTransmissionUpdatesObserverLastSeen` and `TestLastPacketAtUpdatedOnPacketOnly` were encoding the buggy behavior; updated to assert ingest-time semantics. - Full `go test ./cmd/ingestor/...` green. ## Refs - Refs #1463 (root-cause investigation) - Refs #1464 (naive-clamp fix that handled malformed timestamps) - Closes #1465 --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-28 12:16:29 -07:00
Kpa-clawbot GitHub openclaw-bot	7106e1921e	fix(#1463 ): clamp naive envelope timestamps symmetrically (#1464 ) Red commit: `fc6ed65f` (CI fails on `TestResolveRxTimeNaiveTimestampClamp`) Green commit: `80bf1285` ## Problem California observers (UTC−7) had `last_seen` perpetually pinned ~7h behind wall-clock and rendered "Stale" in the UI despite active MQTT status traffic. Root cause: `parseEnvelopeTime` parses zone-less ISO timestamps (python `datetime.now().isoformat()`) as UTC, leaving a residual offset equal to the observer's UTC offset. The existing soft-clamp at `resolveRxTime` only caught the future-skew (UTC+N) mirror case. ## Fix — Option B (symmetric clamp) - `parseEnvelopeTime` now returns a `(time.Time, naive bool, error)` tuple so callers can tell zone-aware from zone-less parses. - `resolveRxTime` applies a 15-minute symmetric tolerance window for `naive==true` values: anything further off than 15 min collapses to ingest time and emits a warning log. - Well-behaved observers (Z-suffixed or explicit `±HH:MM` offset) are completely untouched regardless of skew — legitimate buffered uploads remain accurate to the second. Chose option B over option A (reject naive outright) because some observers may be sending naive UTC strings — those would suddenly lose their own time. Symmetric clamp preserves the well-synced naive case (< 15 min off) and rescues every other zone. ## Tests - New `TestResolveRxTimeNaiveTimestampClamp` covers naive past, naive future, naive w/ microseconds, Z-suffixed past (verbatim), offset-suffixed (canonicalized to UTC), naive within tolerance (verbatim). - `TestParseEnvelopeTime` updated for new signature, asserts `naive` flag. - All existing rxtime tests preserved (factory date, 30-day floor, 14h future, plausible past). - Red commit ran first, failed on assertions, then green commit makes everything pass. ## Operator visibility `naive timestamp "..." off by 7h, using ingest time` now appears in the ingestor log so operators can identify upstream observer scripts that should switch to `datetime.now(timezone.utc).isoformat()`. Fixes #1463 --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-28 09:00:12 -07:00
Kpa-clawbot GitHub clawbot	d24246395d	fix(#1456 ): rename Usefulness → Traffic share + add traffic_share_score field (#1457 ) ## Summary Rename the "Usefulness" UI label to "Traffic share", add hover tooltips for both Traffic share and Bridge score, and introduce a new `traffic_share_score` field on `/api/nodes` (alongside the legacy `usefulness_score`, kept for API back-compat). Closes #1456. ## Why The "Usefulness" label implied a composite score that doesn't exist yet — only the Traffic-share axis (axis 1 of 4 from #672) and the Bridge axis (axis 2 of 4 from #1275) are wired today. A node with low traffic but critical structural position read as "not useful" — exactly wrong. Neither score had a tooltip explaining what it measured. ## Changes ### Frontend (`public/nodes.js`) - Visible label `Usefulness` → `Traffic share` (with ⓘ glyph) - Tooltip explains traffic-share semantics, cross-references Bridge for structural importance, points at #672 for the 4-axis roadmap - Bridge row gets a parallel ⓘ glyph and a tooltip naming "betweenness centrality" + the "quiet but irreplaceable chokepoint" interpretation - Prefers new `traffic_share_score` with graceful fallback to legacy `usefulness_score` ### Backend (`cmd/server/routes.go`) - `/api/nodes` and `/api/nodes/{pubkey}` now emit BOTH `usefulness_score` (kept for API compat) AND `traffic_share_score` (new canonical name), populated with the same value - Inline comment documents the deprecation path: when the #672 composite ships, `usefulness_score` becomes the composite and `traffic_share_score` keeps the per-axis value ## Tests - `test-issue-1456-score-labels.js` — file-grep pins on `nodes.js` (label, tooltip fragments, percent formatting, dual-field read with fallback) - `cmd/server/traffic_share_score_test.go` — `/api/nodes` + `/api/nodes/{pk}` responses contain both fields with equal values TDD: red commit (`8bd235a0`) added failing tests; green commit (`c4d3aee5`) implemented. `go test ./cmd/server/...` passes (47s). ## Out of scope - Renaming the backend field (would break consumers) - Wiring axes 3 (Coverage) and 4 (Redundancy) — tracked in #672 - Changing the score calculation --------- Co-authored-by: clawbot <bot@openclaw.local>	2026-05-28 05:22:08 -07:00
Kpa-clawbot GitHub openclaw	777f77a451	feat(#1420 ): dark-tile provider picker in customizer (4 variants) (#1430 ) # feat(#1420): dark-tile provider picker in customizer (4 variants) Closes #1420. ## What Operator pick: don't force a single dark-tile choice on everyone. Wire 4 candidates into the customizer + server config so users can choose which dark basemap they want, with per-browser persistence. ## Providers shipped \| ID \| Source \| Filter \| \|---\|---\|---\| \| `carto-dark` (default) \| `https://{s}.basemaps.cartocdn.com/dark_all/{z}/{x}/{y}{r}.png` \| none \| \| `esri-darkgray-labels` \| Esri Dark Gray Base + Reference (two stacked layers) \| none \| \| `voyager-inverted` \| Carto Voyager + CSS `invert(1) hue-rotate(180deg) brightness(0.9) contrast(1.05)` on `.leaflet-tile-pane` \| applied in dark, cleared in light \| \| `positron-inverted` \| Carto Positron + same CSS invert \| applied in dark, cleared in light \| No new dependencies — all providers are URL-only. ## Architecture - `public/map-tile-providers.js` — registry + 5 public helpers (`MC_TILE_PROVIDERS`, `MC_setDarkTileProvider`, `MC_getDarkTileProvider`, `MC_setServerDefaultTileProvider`, `MC_applyTileFilter`). Persists to `localStorage['mc-dark-tile-provider']`. Dispatches `mc-tile-provider-changed` on user pick. - `public/map.js` / `public/live.js` — resolve the active dark provider via the registry, manage the Esri labels overlay lifecycle (add when needed, remove cleanly so we don't leak layers on repeated theme toggles), and apply/clear the CSS filter on `.leaflet-tile-pane`. Listen for both `data-theme` mutations AND `mc-tile-provider-changed`. - `public/customize-v2.js` — new "Dark Map Tiles" dropdown in the Display tab. On change, calls `MC_setDarkTileProvider(id)`; the maps re-render live without reload. - `public/roles.js` — hydrates the server default via `MC_setServerDefaultTileProvider` from `/api/config/client`. - Server (`cmd/server/`) — new `mapDarkTileProvider` string on `Config` + surfaced in `ClientConfigResponse`. Default empty → client uses `carto-dark`. - `config.example.json` — documents the new field with all allowed values. ## Behavior guarantees (from the acceptance criteria) - ✅ Light mode is completely unchanged — `_resolveTileUrl(false)` short-circuits to `TILE_LIGHT` with no filter and no overlay logic. - ✅ Switching dark→light always clears the CSS filter, even if an inverted provider remains selected (`MC_applyTileFilter` is called on every theme change and early-returns to `style.filter = ''` when not dark). - ✅ Switching light→dark with an inverted provider re-applies the filter. - ✅ Attribution is updated per provider (Esri credit for Esri, CartoDB credit for the others); the Leaflet attribution control is refreshed. - ✅ Esri uses two stacked layers (base + reference labels). The reference layer is added/removed cleanly so repeat toggles do not leak. - ✅ Customizer change → immediate re-render, no reload. Uses the same "live setting + persist + dispatch event" pattern as cb-presets (#1361). ## TDD - Red commit: `148b71c3` — `test(#1420): add failing tests for dark-tile provider registry (red)` — 6/7 assertions fail (stub only returns nulls). - Green commit: `49ffb230` — `feat(#1420): dark-tile provider picker — 4 variants wired into customizer` — 7/7 pass. ## Tests `test-issue-1420-tile-providers.js` (wired into `test-all.sh` and `.github/workflows/deploy.yml` JS-unit step): ``` ── #1420 Dark-tile provider registry ── ✅ MC_TILE_PROVIDERS has all 4 IDs with url + attribution ✅ Inverted providers have non-null invertFilter; non-inverted have null ✅ MC_setDarkTileProvider persists to localStorage and dispatches mc-tile-provider-changed ✅ MC_setDarkTileProvider rejects unknown IDs (no persistence, no dispatch) ✅ MC_getDarkTileProvider falls back to server default, then carto-dark ✅ Apply filter for inverted provider in dark mode; clear when switching to non-inverted ✅ Light mode always clears the CSS filter even if inverted provider is selected 7 passed, 0 failed ``` `cd cmd/server && go build ./... && go vet ./...` — clean. ## CDP verification Not run in this PR — the sandbox does not have a Chrome CDP endpoint reachable, and staging cannot exercise this code path until this branch is deployed. The issue body's "CDP-verified candidate set" table covers prior provider-URL validation; the new code path (registry lookup + filter swap + Esri overlay lifecycle) is covered by the unit tests above. Recommend operator run a quick manual verification on staging post-deploy: dark mode → open customizer → cycle through all 4 providers, confirm tiles render and the CSS filter is applied for `voyager-inverted` / `positron-inverted` (verify via `getComputedStyle(document.querySelector('.leaflet-tile-pane')).filter`). ## Files touched - `public/map-tile-providers.js` (new) - `public/map.js`, `public/live.js`, `public/customize-v2.js`, `public/roles.js`, `public/index.html` - `cmd/server/config.go`, `cmd/server/routes.go`, `cmd/server/types.go` - `config.example.json` - `test-issue-1420-tile-providers.js` (new), `test-all.sh`, `.github/workflows/deploy.yml` - `.eslintrc.json` (register new `MC_*` globals) --------- Co-authored-by: openclaw <bot@openclaw.local>	2026-05-27 14:37:51 +00:00
efiten GitHub Claude Sonnet 4.6	f0c69d5fe7	perf(server): fix repeaterEnrichTTL mismatch causing 18s /api/nodes latency (#1425 ) ## Root cause `repeaterEnrichTTL` was 15 seconds, but the background recomputer (`StartRepeaterEnrichmentRecomputer`) runs every 5 minutes. After each recomputer tick, the relay/usefulness caches were valid for 15 seconds. For the remaining 4m45s, every `/api/nodes` request hit a stale TTL gate in `GetRepeaterRelayInfoMap` / `GetRepeaterUsefulnessScoreMap` and fell through to `computeRepeaterRelayInfoMap` on the request goroutine. On production (16k+ transmissions, 240k hop records) that rebuild takes ~18 seconds, making `/api/nodes?limit=5000` freeze on virtually every page load. The pattern was: ``` recomputer runs at T=0 → cache valid T=15s → TTL expires T=15s … T=5min → every request rebuilds on-thread (18s each) T=5min → recomputer runs again → 15s valid window repeat ``` ## Fix One line in `repeater_enrich_bulk.go`: ```go // Before const repeaterEnrichTTL = 15 * time.Second // After const repeaterEnrichTTL = 10 * time.Minute ``` The TTL now exceeds the recomputer interval so the cache is always warm between background ticks. The TTL remains as a safety net for cases where the recomputer isn't running (tests, early startup edge cases) — it just no longer expires between ticks. ## Production results (analyzer.on8ar.eu) Tested with binary injection on the live server before opening this PR. \| Metric \| Before \| After \| \|--------\|--------\|-------\| \| TTFB (`/api/nodes?limit=5000`) \| 18.6 s \| 0.47–0.54 s \| \| Total response time \| 18.9 s \| 1.55–1.73 s \| \| Improvement \| — \| 34–39× \| Confirmed still fast at t+60s (well past the old 15s window). ## Test results ``` TestHandleNodesPerfLargeFleet elapsed=1.9ms budget=2s PASS TestHandleNodesLimit2000ColdMiss elapsed=5.3ms budget=2s PASS ``` Both existing perf regression tests pass unchanged — the TTL change doesn't affect their behavior (they test the cold-prewarm path, not TTL expiry). ## Why this wasn't caught by tests `TestHandleNodesLimit2000ColdMiss` only tests the cold-startup path (cache nil → on-thread build → cache hit). It doesn't test the TTL-expiry path (cache exists but stale → on-thread rebuild). A test covering the latter would need to fast-forward time past the TTL, which the existing fixture doesn't do. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 01:28:46 -07:00
Kpa-clawbot GitHub claw	f15d2efe81	fix(#1386 ): #1324 follow-up — test coverage + RWMutex + lock-hold-time + dead code + cadence (#1390 ) # #1324 follow-up — test coverage + RWMutex + lock-hold-time + dead code + cadence Addresses the post-merge audit findings in #1386 on PR #1324 (multi-byte capability persistence). Two independent audits (Kent Beck test-quality + Carmack perf) surfaced one top-level test-coverage gap and three perf concerns. This PR closes all of them; cadence cleanup is included. Red commit: `<RED_SHA>` (CI: `<RED_URL>`) ## What 1. Tests (`cmd/ingestor/multibyte_persist_test.go`): - `TestRunMultibyteCapPersist_RoundTrip` — end-to-end persist → close store → reopen → assert DB state survived. - `TestRunMultibyteCapPersist_MalformedSnapshot` — corrupt snapshot must log + no-op, not crash. - `TestRunMultibyteCapPersist_MissingSchemaColumns` — legacy DB without `multibyte_sup` cols must skip with explicit log, not panic / silently swallow. - `TestRunMultibyteCapPersist_PreservesConfirmedOnUnknown` — status=`unknown` MUST NOT clobber an existing `confirmed` row (mutation guard for the data-destruction check). 2. `cmd/server/store.go` - `cacheMu sync.Mutex` → `sync.RWMutex`. The per-node `GetMultibyteCapFor` read path in `/api/nodes` (`routes.go:1215`) uses `RLock` now; no longer serializes against itself or against analytics readers. - Build the multi-byte index map OUTSIDE `cacheMu`, then swap the pointer inside. Removes a 2400-iteration allocation hold from the analytics-cycle critical section. - Drop the dead `GetMultiByteCapMap` (zero callers confirmed by `rg`) and the stale `multibyteStatusToInt` tombstone comment. 3. `cmd/ingestor/multibyte_persist.go` - Replace the per-entry pair of `UPDATE nodes` + `UPDATE inactive_nodes` (50% guaranteed-miss) with a single dispatch-by-table-membership `UPDATE` per entry. ~50% fewer prepared-stmt round-trips. - Explicit `MalformedSnapshot` log line distinct from cold-start. - Defensive schema-presence check via `PRAGMA table_info` once at start; logs `[multibyte-persist] schema missing` and returns clean stats on legacy DBs. 4. `cmd/server/analytics_recomputer.go` / `config.example.json` — bump default snapshot cadence from 15s to 1m (the snapshot is a derived cache the ingestor only reads every 5 min; 4× less disk churn, no observable freshness loss). ## Why Direct quotes from the audit (#1386): > "No end-to-end persist→restart→load round-trip — the documented > value prop of the PR ('survives restart') has no single test > exercising the full path." (Kent Beck) > "`cacheMu` is `sync.Mutex` not `sync.RWMutex` + per-node read in > `handleNodes` — 2400 serialized lock acquisitions per `/api/nodes` > call, contended against every analytics-cache reader/writer. > The O(1) win is consumed by lock contention." (Carmack #1) > "Map construction held under shared `cacheMu` — every 15s > analytics cycle blocks every API cache read for the duration of a > 2400-entry map build. Build outside the lock, swap pointer > inside." (Carmack #2) > "`UPDATE nodes` + `UPDATE inactive_nodes` per entry … 4800 > prepared-stmt round-trips, 2400 guaranteed-empty." (Carmack #3) > "Server writes 20 snapshots for every one the ingestor reads. > Cadence mismatch — server could publish every 1 min and lose > nothing." (Carmack §2) ## TDD Red commit adds the four tests above. Two of the four (`MalformedSnapshot`, `MissingSchemaColumns`) fail on assertions against the pre-fix `multibyte_persist.go`; the other two (`RoundTrip`, `PreservesConfirmedOnUnknown`) are regression coverage of behaviour the original implementation already honoured but never exercised — they exist to guard future mutation (the audit's mutation-suggestion lens). Green commit lands the implementation. ## Bench `go test -bench BenchmarkGetMultibyteCapFor -benchmem -count=10` (local, idle laptop, n=2400-entry index, 8 reader goroutines vs. one analytics writer): \| variant \| ns/op \| allocs/op \| \|--------------------\|------:\|----------:\| \| `sync.Mutex` (pre) \| n/a — see note \| — \| \| `sync.RWMutex` \| n/a — see note \| — \| Note: did not produce a concurrent benchmark in this PR (would require non-trivial test scaffolding around the cache lifecycle). The win is structural — `RLock` allows the ~2400 per-`/api/nodes` reads to proceed in parallel rather than serializing on the same mutex held by every analytics writer. Documenting honestly per AGENTS.md "perf claims require proof": full microbench deferred to a follow-up. ## Manual verification (staging) - New tests: `go test ./... -count=1 -timeout 300s` in `cmd/ingestor` and `cmd/server` — green. - All multibyte-area tests (`#1366`, `#1368`, `#1372` regression suites in `multibyte_capability_test.go`, `multibyte_enrich_test.go`, `multibyte_region_filter_test.go`): green. - Preflight: `bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master` — exit 0. Fixes #1386 --------- Co-authored-by: claw <claw@openclaw.local>	2026-05-25 23:29:35 -07:00
Joel ClawandGitHub	95d7916530	fix(channels): normalize known channel display names (public → Public) (#777 ) Normalizes well-known channel display names (currently only `public` → `Public`) so existing deployments with pre-#761 lowercase config keys show the canonical firmware-default name `Public` in the UI. Behavior: - `knownChannelCasing` lookup (`decoder.go`) — single-entry map, easy to extend. - `normalizeChannelName()` applied at config load (`loadChannelKeys`) AND at decode time (defense in depth). - One-shot SQLite migration `channel_hash_casing_v1` backfills `channel_hash='public'` → `'Public'` on `payload_type=5` rows so channel-grouping queries don't split across the upgrade boundary. - Hardcoded list intentionally tiny (1 entry); custom/user channels left untouched. Safety: - Channel-hash derivation (`SHA256(channelName)[:16]` for `#`-prefixed `HashChannels`) is unchanged — normalization only renames map keys for explicit `ChannelKeys` entries (which don't feed `deriveHashtagChannelKey`). - PSK lookup is by hash byte, not by name — mesh interop preserved. - Migration is gated by `_migrations.name='channel_hash_casing_v1'`, idempotent. Tests (`cmd/ingestor/normalize_channel_test.go`): - `TestNormalizeChannelName` covers known + hashtag + custom + empty. - `TestLoadChannelKeys_NormalizesKnownDisplayNames` — verifies `public` → `Public` at load. - `TestLoadChannelKeys_LeavesCustomNamesUntouched` — custom names not auto-capitalized. - `TestLoadChannelKeys_DuplicateCasingLogsWarning` — config containing both casings resolves deterministically (canonical wins). Mutation test confirmed: reverting load-time normalize → `TestLoadChannelKeys_NormalizesKnownDisplayNames` and `_DuplicateCasingLogsWarning` both fail on assertions. Related: #761	2026-05-25 23:05:07 -07:00
efiten GitHub Claude Sonnet 4.6 openclaw-bot	0b35c7eef3	feat(server): persist multi-byte capability across restart + O(1) per-key lookup (#903 ) (#1324 ) ## Summary Follows the reconciliation recommendation in #916 — extracts only the NET-NEW persistence layer from that PR (which is now superseded by #1002 for the overlay UI) into a focused 6-file change against current master. What this adds: - `multibyte_sup_v1` migration: `multibyte_sup INTEGER NOT NULL DEFAULT 0` + `multibyte_evidence TEXT` on `nodes`/`inactive_nodes` so capability survives restart - `hasMultibyteSupCols` schema detection gates the persist/load paths - `loadMultibyteCapFromDB()`: pre-populates `mbCapSnapshot`/`mbCapIndex` at startup — cold starts serve last-known capability without waiting for the first ~15s analytics cycle - `maybePersistMultibyteCapability()` + `persistMultibyteCapability()`: after each analytics cycle; TryLock-gated (concurrent cycles coalesce); skips `sup==0` entries (data-destruction guard) - `GetMultibyteCapFor(pk)`: O(1) map lookup; both `handleNodes` and node-detail call sites updated from the O(N)-alloc `GetMultiByteCapMap()` What this explicitly does NOT change: - API field names (`multi_byte_status`, `multi_byte_evidence`, `multi_byte_max_hash_size`) - `EnrichNodeWithMultiByte` — unchanged - `GetMultiByteCapMap` — still present for any external callers - `public/map.js`, `public/live.css`, `Dockerfile`, `docs/` — zero frontend churn ## Test plan - [x] `TestMultibyteCapPersistRoundTrip` — confirmed values survive persist → fresh-store load - [x] `TestMultibyteCapPersistSkipsUnknown` — data-destruction guard: `sup==0` entry does not overwrite DB-confirmed value - [x] `TestMultibyteCapMaybePersistCoalesces` — TryLock coalesces 10 concurrent callers without deadlock - [x] `TestMultibyteCapGetMultibyteCapForO1` — O(1) index returns correct entry / false for unknown pubkey - [x] `TestMultibyteCapLoadFromDB` — only `sup>0` rows loaded; `sup==0` row excluded - [x] `TestSchemaMultibyteSupColumns` — migration adds columns to both tables; idempotent on second `OpenStore` - [x] All existing `TestMultiByteCapability_*` tests pass unchanged - [x] Full ingestor test suite: `ok` in 27s - [x] `go build ./cmd/server/ && go build ./cmd/ingestor/` clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: openclaw-bot <bot@openclaw>	2026-05-25 22:35:35 -07:00
Kpa-clawbot GitHub openclaw-bot	9d3dd8df0a	fix(packets): order by ingest id, not rxTime — fresh activity visible on packets page (#1345 ) (#1349 ) ## Summary Fixes #1345 — the packets page shows "no recent activity" while MQTT ingest is healthy because the default `/api/packets` query was `ORDER BY first_seen DESC`, and PR #1233 redefined `first_seen` as the observer's radio receive time (rxTime). When an observer buffers offline and uploads hours later, its packets land with hours-old `first_seen` values; older-ingested packets with fresher rxTime then crowd the top of the list and the visually freshest activity disappears. ## Fix Switch the default ordering to `t.id DESC` (ingest order) on `/api/packets` and the closely-related endpoints. `id` is monotonic with ingest time and immune to buffered uploads. Endpoints changed (all use the same fix for the same reason): \| Path \| Function \| File \| \|------\|----------\|------\| \| `GET /api/packets` (default) \| `DB.QueryPackets`, `Store.QueryPackets` \| `cmd/server/db.go`, `cmd/server/store.go` \| \| `GET /api/packets?nodes=…` \| `DB.QueryMultiNodePackets`, `Store.QueryMultiNodePackets` \| same \| \| Node detail "recent transmissions" \| `DB.GetRecentTransmissionsForNode` \| `cmd/server/db.go` \| ## `since=` semantic — preserved `since=` still filters by `first_seen` (RFC3339 path uses the observations.timestamp subquery), i.e. "packets the network received since X." Buffered uploads of older packets are still excluded from a `since=15m` view even if they were ingested in the last 15 minutes. Only the display order changes; filtering by receive time is unchanged. ## Audit — NOT changed - `Store.QueryGroupedPackets` already sorts by `LatestSeen` (max observation timestamp), which is correct for the grouped view and immune to the buffered-upload regression. - `GetChannelMessages` and channel `sample_json` subqueries keep `first_seen DESC` — channel message chronology is meaningful for message UX; if buffered uploads become a problem here too it's a separate UX call (out of scope for #1345). - `s.packets` insertion ordering (Load + ingest) — untouched. The fix sorts at query time so we don't perturb `oldestLoaded` invariants. ## Tests — TDD red → green - Red: `508f4371` adds `cmd/server/packets_order_test.go` with two cases — order assertion (failed on master with `[fresh, buffered]`) and since-filter semantic (RFC3339 path uses observation timestamps). - Green: `0fd685e7` switches the SQL + in-memory ordering. Tests pass; full `cmd/server` suite green locally (44s). ## Out of scope - Re-thinking #1233's first_seen semantics - Adding a UI sort toggle (issue's option 2) - Channel-message page ordering ## Preflight Clean (`bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master`). --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-25 22:32:00 -07:00
Kpa-clawbot GitHub openclaw-bot	dc6c79cff8	fix(mqtt): watchdog forces paho reconnect on stall — recovers from half-open TCP (closes #1335 ) (#1336 ) RED `f06887` — GREEN `8f53c1`. CI: (will populate on PR open) `Fixes #1335` ## Problem PR #1216 added per-source stall detection (`LivenessStalled`) but only logged. Staging's `lincomatic` source has been silently losing ~14k pkts/hr behind a half-open TCP socket the Azure NAT abandons: paho reports `IsConnected==true`, no messages arrive for 1h+, container restart is the only known recovery. Prod (MikroTik networking) doesn't see it. ## Fix Make the watchdog actually recover. - `SourceLivenessState.ForceReconnectFn` — per-source closure wired in `main.go` next to `IsConnectedFn`, wraps `client.Disconnect(250) + client.Connect()`. - `processLivenessTransition` — on the `LivenessStalled` edge AND on every heartbeat re-emit while still Stalled, invoke `maybeForceReconnect`. `LivenessNeverReceived` (cold-start ACL deny / wrong hash) is deliberately not force-reconnected — a new TCP socket won't fix an ACL deny and would just churn the broker. - `maybeForceReconnect` — throttled at `forceReconnectThrottle = 60s` per source so a stall→reconnect→re-stall loop self-recovers without hammering the broker. The Disconnect+Connect runs in a goroutine so a single slow source can't stall the watchdog tick. - `buildMQTTOpts` — explicit `SetKeepAlive(30 * time.Second)`. paho's default happens to be 30s, but the #1335 RCA called this out — making it explicit so it can't drift and so operators reading the code know it's intentional. - Telemetry — `WATCHDOG forcing reconnect` (intent), `WATCHDOG reconnect attempt issued` (post-goroutine), `WATCHDOG suppressing forced reconnect` (throttle window). ## TDD - RED `f06887` — `mqtt_watchdog_force_reconnect_test.go`. Stub field + constant added so the file compiles; assertions fail because `processLivenessTransition` never invokes `ForceReconnectFn`. Reverting just the `s.ForceReconnectFn()` call line from GREEN re-fails the same assertion (mutation verified). - GREEN `8f53c1` — wiring + throttle + keepalive. ## Scope discipline Additive only. No regression to currently-flowing sources: `LivenessOK`, `LivenessRecovered`, `LivenessDisconnected`, `LivenessHeartbeat`, and `LivenessNeverReceived` transitions are unchanged. Throttle bound = ≤1 reconnect/min/source = ≤60/hr worst-case across all sources, well within any broker rate limit. Preflight: clean (all gates pass). --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-25 22:31:56 -07:00
Kpa-clawbot GitHub openclaw-bot	88bc5d9d3b	fix(#1373 ): drop ghost "unknown" channel bucket from /api/channels for encrypted-no-key packets (#1377 ) ## What Drops the ghost `unknown` channel bucket from `/api/channels` for encrypted GRP_TXT packets whose decoded JSON sets `channel=""` (server has no PSK to decrypt). Fix A from issue #1373 — cosmetic / immediate. Fix B (server-side decryption / key sharing) is intentionally out of scope and remains for a follow-up issue. ## Why When an operator adds a PSK channel key client-side (via the channel customizer), the channel list shows the newly-decrypted channel correctly — but it ALSO shows a stale `unknown` bucket holding the SAME packets the new channel just decrypted. The bucket is a server-side debug catch-all (`if channelName == "" { channelName = "unknown" }`) that leaks into the user-facing channel list. It's not a real channel; dropping it from `/api/channels` is the right fix until/unless server-side decryption lands. Choice made: keep the `channelName = "unknown"` fallback path removed by adding an early `continue` BEFORE the bucket is created. This keeps the diff minimal, preserves the `hasGarbageChars` filter ordering, and makes the intent obvious ("encrypted-no-key packets are not channels"). The DB path (`cmd/server/db.go`) already filters NULL `channel_hash` at the SQL level and `continue`s on empty; the test pins that contract. ## TDD - Red commit: `35b8ba51c74dcc6200d5cf4a87dc7a0b63b2b2c2` — seeds 5 encrypted GRP_TXT (Channel="") + 3 decrypted (#real) into both PacketStore and DB paths; asserts `GetChannels` returns exactly 1 channel (#real). Fails on assertions, not compile. - Green commit: see follow-up commit on this branch — drops the `"unknown"` fallback in `cmd/server/store.go` `GetChannels`; DB path unchanged (already correct, test pins it). ## Manual verification (staging) After deploy, on a staging instance with encrypted GRP_TXT traffic and no PSKs configured: 1. `curl -s https://staging/api/channels \| jq '[.[] \| select(.name == "unknown")] \| length'` → `0` 2. Real channels with known hashes still appear with correct messageCount. ## Files changed - `cmd/server/store.go` — drop the `if channelName == "" { channelName = "unknown" }` fallback; skip the packet instead. - `cmd/server/channels_no_unknown_bucket_1373_test.go` — new test covering both code paths. Fixes #1373 --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-25 22:16:14 -07:00
Kpa-clawbot GitHub corescope-bot	0f7cce3a5f	fix(#1370 ): revert ingestor envelope-timestamp path — server ingest time for packet/observation storage (counters #1233 ) (#1372 ) ## Summary Reverts the part of PR #1233 (commit `498fbc03`) that routed the MQTT envelope's `timestamp` field into `PacketData.Timestamp` for `transmissions.first_seen` and `observations.timestamp`. Packet ordering is restored to server ingest time — the client clock is untrusted. `UpsertObserverAt` + `MAX(MIN(existing, ingestNow), rxTime)` for observer/node `last_seen` (PR #1233's other half) is preserved unchanged. `parseEnvelopeTime` / `resolveRxTime` helpers are preserved — they still feed the observer.last_seen path. ## Diagnosis — Voodoo3 tx 304114 on staging Staging `tx_id = 304114` in channel `#test` has 5 observations: \| # \| observer \| reported timestamp \| comment \| \|---\|-----------\|--------------------\|---------\| \| 1 \| Voodoo3 \| 18:42 \| broken client RTC — ingested first, locks `first_seen` \| \| 2 \| Voodoo3 \| 18:42 \| broken client RTC \| \| 3 \| Voodoo3 \| 18:42 \| broken client RTC \| \| 4 \| Voodoo3 \| 18:42 \| broken client RTC \| \| 5 \| other obs \| 01:42 \| genuine receive time \| 4 of 5 observations carry stale 18:42 timestamps from Voodoo3's own broken clock. Because Voodoo3 ingested first, PR #1233's code wrote `transmissions.first_seen = 18:42` (envelope value). Downstream aggregators that compute `MAX(first_seen)` per channel saw 18:42 as the latest activity, and `/api/channels` for `#test` displayed `lastActivity` ~7h+ in the past plus a stale heartbeat in the row preview — hiding the genuinely-newest message (Voodoo3's `tst hmdpt` at 01:42). ## Why PR #1233's premise fails PR #1233 assumed: > Uploaders stamp `timestamp` when the radio receives the frame and > freeze it; the MQTT message is published late, but the timestamp > field is not re-stamped at publish. A buffered packet uploaded > hours late still carries its true receive time. That holds ONLY when the uploader's wall clock is correct. Observers in the field (Voodoo3 here, surely others) have broken local clocks. Their envelope timestamps are not a true receive time — they're a broken-clock receive time, which is just garbage with extra steps. The server clock is the only one we control, so packet ordering must use it. ## Fix ### `cmd/ingestor/db.go` - `BuildPacketData`: `PacketData.Timestamp = time.Now().UTC().Format(time.RFC3339)`, NOT `msg.Timestamp`. Docstring updated to cite #1370 and explain why `msg.Timestamp` is no longer read here. ### `cmd/ingestor/main.go` - Channel-companion path: `Timestamp: ingestNow` (was `rxTime`). - DM-companion path: `Timestamp: ingestNow` (was `rxTime`). - Local `rxTime := resolveRxTime(msg, tag)` removed from both paths (no remaining consumers in those scopes). ### Preserved (NOT touched) - `resolveRxTime`, `parseEnvelopeTime` — still used by `handleMessage` to populate `mqttMsg.Timestamp` and to call `UpsertObserverAt`, which feeds `observer.last_seen` and `observer.last_packet_at`. - All three `MAX(MIN(existing, ingestNow), rxTime)` guards (#1233 observer.last_seen, observer.last_packet_at, node.last_seen). - `MQTTPacketMessage.Timestamp` struct field. ## Tests \| File \| Asserts \| \|------\|---------\| \| `cmd/ingestor/ingest_time_regression_1370_test.go` (3 cases) \| Raw-packet, channel-companion, and DM-companion `handleMessage` paths. Feed envelope `timestamp = T_now - 7h`; assert stored `transmissions.first_seen` (RFC3339) and `observations.timestamp` (epoch) are server wall clock (±5s). Each case fails on master under PR #1233's premise. \| ### Adjusted test - `cmd/ingestor/db_test.go::TestBuildPacketData` — PR #1233 had asserted `pkt.Timestamp == "2026-05-16T10:00:00Z"` (the envelope value propagating). Now asserts the opposite: `pkt.Timestamp` is non-empty AND is NOT the envelope value. Comment cites #1370 and why the expectation flipped. ### Verified still-green - `cmd/ingestor/rxtime_test.go` (`TestParseEnvelopeTime`, `TestResolveRxTime`) — helpers untouched, still cover envelope parsing for the observer.last_seen path. - `cmd/server/channels_message_order_1366_test.go` (#1366). - `cmd/server/db_channel_messages_perf_test.go` (#1368 perf budget). ## Commits - `a9b7efc3` — RED: 3 `handleMessage` assertion-fail tests + test name collision check. - `5a0891f0` — GREEN: revert envelope→PacketData.Timestamp plumbing in `cmd/ingestor/{db,main}.go` + flip `TestBuildPacketData`. Fixes #1370 --------- Co-authored-by: corescope-bot <bot@corescope.dev>	2026-05-25 19:56:49 -07:00
Kpa-clawbot GitHub CoreScope Bot bot openclaw-bot	c7ab5f3eb9	fix(#1366 ): channels view shows latest message time — backend emits LatestSeen, not FirstSeen (#1368 ) Red commit: `702d82eb5e` (CI: see Actions tab for fix/issue-1366) ## What Channel view emits the max observation timestamp (`tx.LatestSeen`) instead of the analyzer's first-observation time (`tx.FirstSeen`) as the rendered `timestamp` field. A new `first_seen` field is exposed alongside for debug surfaces. `sender_timestamp` continues to be returned in the JSON response but is intentionally NOT used as the rendered time (client clocks are unreliable). ## Root cause Two parallel call sites both emitted the wrong field: - `cmd/server/store.go` — `GetChannelMessages` (~line 4807): set `entry.Data["timestamp"] = strOrNil(tx.FirstSeen)` for every new dedup entry. `tx.FirstSeen` is the analyzer's first-ever observation time of a `transmissions.hash` row; for heartbeat-style packets (e.g. `BlorkoBot 🤖` posting the same status line periodically), the hash is stable, so FirstSeen stays pinned at the very first observation while the message keeps retransmitting hours later. Operator sees "old" message timestamps for live messages. - `cmd/server/db.go` — `GetChannelMessages` (~line 1757): same problem against the SQLite-backed query path. Used `nullStr(fs)` (where `fs` is `t.first_seen`) for the `timestamp` field. ### Repro from staging Same packet, same hash `aba4f0493249de57`, sender `BlorkoBot 🤖`: - `/api/channels/%23test/messages` → `timestamp: "2026-05-25T15:53:20Z"` (FirstSeen, 7h+ in the past) - `/api/packets?hash=aba4f0493249de57` → `first_seen: "2026-05-25T22:53:19Z"` (latest obs), `observation_count: 84` The packets view used max-obs correctly; the channels view did not. 7h gap matches operator screenshot. ## TDD red → green Red: `cmd/server/channels_message_order_1366_test.go` — three tests: - `TestChannelMessages_TimestampUsesLatestSeen`: seeds a CHAN tx with observations 7h apart, asserts returned `timestamp` ≈ latest observation epoch (±1s). Fails under FirstSeen with Δ=−25200s. - `TestChannelMessages_TimestampNotSenderTimestamp`: seeds a CHAN tx whose decoded `sender_timestamp` is year-2000 (bad RTC). Asserts the rendered `timestamp` parses to current year — guards against the tempting "just use sender_timestamp" alt-fix that would let bad client clocks corrupt the view. - `TestChannelMessages_TimestampIsUTCZ`: asserts the emitted string is unambiguously UTC (suffix `Z` or `+00:00`) so browsers don't apply a local-zone shift. Green commit changes: - `store.go`: emit `tx.LatestSeen` (with FirstSeen fallback if no obs); add `first_seen` field. - `db.go`: join `o.timestamp` per-observation, track max epoch per tx, emit RFC3339 UTC at the end; add `first_seen` field. `sender_timestamp` remains in the response — unchanged shape, frontend never read it for the rendered time (verified: only `msg.timestamp` is consumed in `public/channels.js:1902`). ## Manual verification (post-merge) 1. Deploy to staging. 2. Curl `/api/channels/%23test/messages?limit=5` and `/api/packets?hash=<recent>`. The channel `timestamp` field MUST equal the packets `first_seen` (max obs) for the same hash, NOT lag it. 3. Send a fresh GRP_TXT via a MeshCore client into a watched channel. Within 15s, refresh the Channels view at `/channels`. The new message MUST render at the bottom with the correct (current) time. ## Why not `sender_timestamp`? It's a per-client field, decoded from the payload. Many MeshCore firmware builds run without RTC/NTP/GPS and report bogus values. Trusting it for display would propagate bad client clocks into the analyzer UI — the analyzer is the source of truth for UTC, not the client. Fixes #1366 --------- Co-authored-by: CoreScope Bot <bot@corescope> Co-authored-by: bot <bot@kpa-clawbot.dev> Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-25 17:45:32 -07:00
Kpa-clawbot GitHub bot corescope-bot	de583f9df4	fix(paths-through): use canonical resolved_path instead of naive prefix match — fixes wrong-node attribution (#1352 ) (#1353 ) ## Summary `/api/nodes/{pk}/paths` (paths-through-node) attributed the same transmission to every prefix-sibling when their hop bytes collided (e.g. 5 nodes with `c0…` on staging). Querying any of them returned the tx — visible bug per #1352 where Kpa Roof Solar's view included a packet whose actual relay was C0ffee SF. ## Root cause `handleNodePaths` has two branches: 1. Canonical resolved_path branch (#1278) — when a tx has a persisted `resolved_path`, membership is decided from the stored pubkeys. This branch is correct. 2. Fallback branch — when `resolved_path` is NULL/missing, the code invoked `pm.resolveWithContext(hop, []string{lowerPK}, graph)` to re-resolve hops. The `hopContext=[lowerPK]` anchors the resolver on the queried target, so the tier-2 (geo-proximity) / tier-3 (GPS+observation-count) tiers preferentially pick the target. Every `paths-through-X` call for any `X` in the sibling set then resolved the colliding hop to `X` and counted the tx — wrong-node attribution across the whole sibling set. ## Fix Server-side, query-time only. No DB writes (`#1289` read-only invariant preserved). No canonical-branch changes — only the fallback path. In the fallback branch, accept a biased-resolver match as evidence of target membership only when either: - (a) the tx is already pre-confirmed via the resolved_path index hit or SQL `INSTR(resolved_path, pubkey)` check, or - (b) the hop's prefix candidate set is unique (`len(pm.m[hop]) <= 1`) — no collision, no bias possible. Multi-candidate prefix hops without independent SQL/index confirmation are now treated as ambiguous and excluded from paths-through. Same rule applied to the unresolvable-hop sub-case (when `resolveHop` returns nil but the prefix could match the target). ## Which canonical resolved_path source is used This PR does not introduce a new resolved_path source. It piggybacks on what's already in place: - Canonical branch: `s.store.fetchResolvedPathForTxBest(tx)` → SQLite `observations.resolved_path` (populated upstream by the hop-disambiguator from #1198/#1200/#1235). - Pre-confirmation in fallback: `confirmedByFullKey` (membership index `s.store.byPathHop[lowerPK]`) and `confirmedBySQL` (`s.store.confirmResolvedPathContains` → `INSTR(LOWER(resolved_path), "pubkey")`). So when canonical data exists, attribution is purely persisted-path driven; when it doesn't, attribution requires either a SQL pubkey hit or a unique prefix candidate. Biased resolution alone is no longer sufficient. ## TDD — red, then green Two new tests in `cmd/server/paths_through_collision_1352_test.go`: 1. `TestHandleNodePaths_PrefixCollision_1352` — canonical branch (already green via #1278). 3 nodes share `c0`, tx canonical resolved_path = [B]. Only paths-through-B includes the tx. 2. `TestHandleNodePaths_PrefixCollision_1352_FallbackBranch` — red before the fix. 3 GPS-having `c0` siblings, NULL resolved_path. Before: A=1 B=1 C=1 (wrong-node attribution on all). After: ≤1 attribution. Mutation: reverting the `len(pm.m[hop]) <= 1` guard in `routes.go` restores the failing red state. Existing tests preserved: - `TestHandleNodePaths_PrefixCollisionExclusion` (#929) — still green. - `TestHandleNodePaths_AnchorBiasInconsistency_Issue1278` (#1278) — still green. - Full `go test ./...` on `cmd/server` and `cmd/ingestor`: green. ## Acceptance criteria (from #1352) - [x] On node detail for Kpa Roof Solar-shape, packet where actual relay is C0ffee SF does NOT appear in paths-through (canonical branch test). - [x] On node detail for C0ffee SF-shape, that same packet DOES appear (canonical branch test). - [x] Ambiguous fallback case (NULL resolved_path, multi-prefix-collision) attributes to ≤1 node (fallback test). - [x] Mutation test: removing the uniqueness guard makes the fallback test fail. ## Out of scope - Frontend UX for "ambiguous (N candidates)" badge (separate UX issue). - Wider hop-disambiguator changes (#1198 family). Fixes #1352 --------- Co-authored-by: bot <bot@example.com> Co-authored-by: corescope-bot <bot@corescope>	2026-05-25 06:03:10 +00:00
Kpa-clawbot GitHub openclaw-bot	eeddf46bc9	fix(ingestor): neighbor-builder delta scan + watermark — recovers 97% packet loss from #1289 (fixes #1339 ) (#1341 ) ## Summary PR #1289 moved neighbor-graph construction into the ingestor with a 60s ticker. `buildAndPersistNeighborEdges` then issued an unbounded `SELECT … FROM observations o JOIN transmissions t …` every tick. On staging (3.7M observations) one tick took ~2 minutes; with `max_open_conns=1`, the SQLite single-writer was held continuously and MQTT ingest collapsed (~6,500 tx/day → ~180 tx/day, 97% loss). ## Fix Watermark-bounded delta scan. Each call derives the watermark from `MAX(neighbor_edges.last_seen)` and restricts the SELECT to `WHERE o.timestamp > ? ORDER BY o.timestamp LIMIT 50000`. `neighbor_edges` itself is the persistence — no new metadata table, no in-memory state, restarts resume cleanly from whatever the table reflects. - Empty edges table → watermark 0 → full warm-up scan (preserves #1289's synchronous warm-up intent). - Warm-up loops the builder until a call returns fewer than the batch cap, so the first server snapshot load sees a fully-populated table even on fresh DBs. - 50k batch cap stops any single tick from monopolising the writer; a backlog drains over successive ticks. - Per-tick wallclock is logged (`tick: N edges in DUR`); a tick >5s is logged loudly as a possible regression of #1339. Broader instrumentation is tracked in #1340. - Output schema unchanged — server's `neighbor_recomputer.go` is unaffected. ## Trade-off An anomalously-old observation that arrives after its timestamp has been crossed by the watermark will be skipped. Acceptable for an approximate neighbor graph; a periodic full-rebuild can land later if needed. ## TDD - RED (`d88e2522`): `TestNeighborEdgesBuilderDeltaScan` seeds 100k observations, asserts an empty-delta tick is a no-op (<1s), and a 100-row delta is upserted in <500ms with no rescan of baseline rows. Baseline builder fails the empty-delta assertion (sees all 200k baseline edges). - GREEN (`cf6fbb4e`): watermark + LIMIT — all assertions pass. - Mutation: revert the `WHERE o.timestamp > ?` clause → the test hangs to lock-contention timeout, confirming the WHERE actually gates the behavior. ## Benchmark (synthetic, 100k observations, local sqlite) \| \| Scan duration \| \|---\|---\| \| Baseline builder, full scan every tick \| ~40s \| \| Patched builder, empty-delta tick \| <50ms \| \| Patched builder, 100-row delta \| <50ms \| Staging projection: 2–3 min ticks → <1s ticks; SQLite writer freed for MQTT ingest. Fixes #1339 --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-23 20:54:16 -07:00
Marcel VerdultandGitHub	498fbc0321	fix: ingestor uses ingest-time now() instead of observer receive time (#1233 ) ## Problem The ingestor stamps every stored packet with its own ingest-time `time.Now()` (`BuildPacketData` in `db.go`; channel/DM paths in `main.go`), discarding the observer receive time the uploader already puts in the MQTT envelope's `timestamp` field. `MQTTPacketMessage` had no `Timestamp` field and `handleMessage` parsed every envelope field except that one. Observers that buffer packets offline and upload hours later get every buffered packet displayed at upload time, not receive time — a 5-hour deferred upload shows packets 5 hours late. Retained messages and broker backlog hit the same skew. ## Why the envelope timestamp is trustworthy Uploaders stamp `timestamp` when the radio receives the frame and freeze it; the MQTT message is published late, but the `timestamp` field is not re-stamped at publish. A buffered packet uploaded hours late still carries its true receive time. ## Fix New `resolveRxTime` helper reads `msg["timestamp"]` and falls back to `time.Now()` only when it is missing, unparseable, or implausibly in the future. Applied to all three ingest paths (raw packet, channel, DM). No wire-format change — the field already exists. Channel/DM dedup hashes intentionally stay on ingest time, since those bridge messages carry no real packet hash and need ingest-unique input. ## Observer/node last_seen correction Packet timestamps must reflect receive time, but observer/node `last_seen` must not. `InsertTransmission` fed `data.Timestamp` (now rxTime) into `observers.last_seen` and `UpsertNode`'s `last_seen`, so a buffered upload could drag both fields backwards, and retained-message replay on MQTT reconnect could flash long-offline observers as Online. - `UpsertObserverAt` takes an explicit `lastSeen`; the status-packet and BLE companion handlers pass the resolved rxTime. `UpsertObserver` keeps its wall-clock behaviour for other callers. - All three `last_seen` writes are guarded with `MAX(MIN(existing, ingestNow), rxTime)`: `last_seen` never moves backwards from a stale retained message, and never locks in a future value. ## Naive UTC+N timestamps `resolveRxTime` rejects a timestamp only when it is >14h ahead (UTC+14 is the maximum standard offset — anything further is a genuine clock error). A timestamp that is merely in the future is soft-clamped to ingest time: a future rxTime means a live packet from a UTC+N observer whose naive local clock parses as-if UTC, not a buffered packet, so ingest time is correct and no future timestamp reaches the DB. For buffered packets from naive-clock uploaders a bounded residual offset remains (equal to the observer's UTC offset); uploaders emitting zone-aware ISO8601 everywhere would be the full cure but is a separate format change. ## Test `cmd/ingestor/rxtime_test.go` covers `parseEnvelopeTime` (zone-aware, naive, microseconds, garbage, empty) and `resolveRxTime` (plausible past used verbatim, missing/garbage/future → ingest-time fallback). The existing `TestBuildPacketData` is updated to supply an envelope timestamp and assert it propagates, since `BuildPacketData` no longer self-stamps.	2026-05-23 11:22:51 -07:00
Kpa-clawbot GitHub openclaw-bot	d9ba9937a6	fix(dbschema): canonical source for optional column migrations — fixes startup race (closes #1321 ) (#1322 ) Red commit `2a8102b9` (failing test) → green commit `bb957c9f`. CI: https://github.com/Kpa-clawbot/CoreScope/actions/workflows/ci.yml?query=branch%3Afix%2Fissue-1321 Fixes #1321. ## Why On staging `/api/scope-stats` 500'd with `scope_name column not present` despite the ingestor adding the column ~0.5s after server startup. `cmd/server/db.go detectSchema()` runs in `OpenDB` and caches `hasScopeName`/`hasDefaultScope`/`hasObsRawHex` booleans. With supervisord launching server + ingestor simultaneously, the server's PRAGMA can fire BEFORE the ingestor's `ALTER TABLE` completes — and the boolean stays false until the server restarts. Same race class as #1283; #1289 moved server-side ensures to `dbschema` but the optional columns the ingestor still owned were left out. ## Fix — option (c) from the issue Made `internal/dbschema/dbschema.go` the single source of truth for the optional columns the server detects. Migrations moved from `cmd/ingestor/db.go applySchema` into `dbschema.Apply`: - `transmissions.scope_name` + `idx_tx_scope_name` partial index - `nodes.default_scope` - `inactive_nodes.default_scope` - `observations.raw_hex` `AssertReady` now asserts every one of those columns. The server cannot start with stale-false booleans because `AssertReady` will fatal first if the columns are missing. The ingestor's old gated blocks are replaced with pointer comments so anyone hunting for them lands in `dbschema.go`. The `_migrations` marker rows are preserved (`INSERT OR IGNORE`) to keep legacy DBs idempotent. Documented invariant in the package doc: any new optional column the server PRAGMA-detects belongs in `internal/dbschema/dbschema.go`, NOT in `cmd/ingestor/db.go applySchema`. ## Tests Added `internal/dbschema/dbschema_test.go` (RED in `2a8102b9`): - `TestApplyAddsOptionalColumns_CanonicalSource` — post-`Apply`, all four columns must exist. - `TestAssertReady_RequiresOptionalColumns` — `AssertReady` must refuse a DB missing them AND pass after full `Apply`. `cmd/ingestor` and `cmd/server` full suites green. --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-23 08:33:21 -07:00
efiten GitHub Claude Sonnet 4.6 Kpa-clawbot openclaw-bot	317b59ab10	feat: area-based visual node filter — attribute packets by transmitter GPS (#804 ) (#839 ) ## Summary - Adds configurable GPS polygon areas to `config.json`; nodes are attributed to an area if their last-known position falls inside the polygon - New `Area: …` dropdown filter (matching the existing region filter style) appears on all analytics, nodes, packets, map, and live screens when areas are configured - Backend resolves area membership with a 30s TTL cache; area filter bypasses the 500-node cap on `/api/bulk-health` so all area nodes are always returned - Includes a polygon builder tool (`/area-map.html`) for drawing and exporting area boundaries ## Changes Backend - `AreaEntry` type + `Areas` config field - `GetNodePubkeysInArea` DB query + `resolveAreaNodes` (30s TTL, `areaNodeMu` RWMutex) - `PacketQuery.Area` + `filterPackets` polygon check - `?area=` param propagated through all analytics, topology, clock-health, and bulk-health routes - `/api/config/areas` endpoint Frontend - `area-filter.js`: single-select dropdown, persists to localStorage, cleans up stale keys on load - Wired into analytics, nodes, packets, channels, map, and live pages - Live map clears node markers on area change Docs & tools - `docs/user-guide/area-filter.md` — configuration and usage guide - `docs/api-spec.md` — updated with new endpoint and `?area=` param table - `tools/area-map.html` — polygon builder for defining area boundaries - Demo areas added to `config.example.json` ## Test plan - [x] No areas configured → filter dropdown does not appear on any page - [x] Areas configured → dropdown appears, "All" selected by default - [x] Selecting an area filters nodes/packets/topology/map correctly - [x] Selecting "All" restores unfiltered view - [x] Selection persists across page reloads (localStorage) - [x] Stale localStorage key (area removed from config) is cleared on load - [x] `/api/bulk-health?area=X` returns all nodes in area (no 500-node cap) - [x] `/api/config/areas` returns correct list 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Kpa-clawbot <kpaclawbot@outlook.com> Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-21 14:00:15 -07:00
efiten GitHub Claude Sonnet 4.6 Kpa-clawbot openclaw-bot	2329639f45	feat: scoped/unscoped transport-route statistics (#899 ) (#915 ) @ ## What this PR does Implements region-scoped transport-route packet tracking with two sub-features: ### Feature 1 — Scope statistics (`scope_name`) - At ingest, transport-route packets (route_type 0/3) with Code1 != `0000` are HMAC-matched against configured `hashRegions` keys (mirroring the `hashChannels` pattern). Matched region name (or `""` for unknown) stored in new `transmissions.scope_name` column via migration `scope_name_v1`. - New `GET /api/scope-stats?window=` endpoint (1h/24h/7d, 30s server-side TTL) returning transport totals, scoped/unscoped counts, per-region breakdown, and time-series. - New Scopes tab in Analytics with summary cards, per-region table, and two-line SVG chart. Auto-refreshes every 60s. ### Feature 2 — Node default scope (`default_scope`) - Per-node `default_scope` column on `nodes`/`inactive_nodes` (migration `nodes_default_scope_v1`) tracks the most recently matched region for each node, derived from transport-scoped ADVERT packets. - `GET /api/nodes` response includes `default_scope` field when column is present. - Node detail panel displays the default scope badge. - Async startup backfill (`BackfillDefaultScopeAsync`) populates the column for nodes with pre-existing ADVERT data. ### Config Add `hashRegions` to `config.json` (see `config.example.json`). One entry per region name (with or without leading `#`). @ --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Kpa-clawbot <kpaclawbot@outlook.com> Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-21 14:00:06 -07:00
efiten GitHub Claude Sonnet 4.6 OpenClaw Bot efiten-bot	caf3851ff8	feat(server): add opt-in HTTP gzip and WebSocket permessage-deflate compression (#934 ) ## Summary - Adds `"compression": {"gzip": true, "websocket": true}` config option (both `false` by default — no behavior change) - HTTP gzip middleware wraps the entire router; skips WebSocket upgrade requests and clients without `Accept-Encoding: gzip` - WebSocket permessage-deflate enabled via `hub.upgrader.EnableCompression` when `websocket: true` - `CompressionConfig` struct and `GZipEnabled()` / `WSCompressionEnabled()` helpers on `Config` - `Hub.upgrader` moved from package-level var to struct field so tests using `NewHub()` don't need changes ## Why opt-in / off by default Operators behind a reverse proxy that already compresses (nginx, Caddy with `encode gzip`) should leave this off to avoid double-compression. Only enable when the proxy does not compress. ## Test plan - [x] `TestCompressionConfigDefaults` — both helpers return false when `Compression` is nil - [x] `TestCompressionConfigExplicitFalse` — both helpers return false when set to false - [x] `TestCompressionConfigEnabled` — both helpers return true when set to true - [x] `TestGZipMiddlewareCompresses` — response body is valid gzip, headers set correctly - [x] `TestGZipMiddlewareSkipsNoAcceptEncoding` — passthrough when client doesn't send Accept-Encoding: gzip - [x] `TestGZipMiddlewareSkipsWebSocket` — WebSocket upgrades are never gzip-wrapped All 6 tests pass (`go test ./...` in `cmd/server`). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: OpenClaw Bot <bot@openclaw.local> Co-authored-by: efiten-bot <bot@efiten.dev>	2026-05-21 11:39:49 -07:00
efiten GitHub Claude Sonnet 4.6 openclaw-bot	ba6c2ac6ba	feat: repeater liveness indicator with relay stats (#662 ) (#755 ) ## Summary - Backend: adds `relayTimes` in-memory index (sorted unix-millis per repeater pubkey), maintained in lockstep with `byPathHop`. Populated at startup from all packet observations (not just best), updated on ingest/evict/backfill. Exposes `relay_count_1h`, `relay_count_24h`, `last_relayed` in both `/api/nodes` (for repeaters) and `/api/nodes/{pubkey}/health`. - Frontend: `getNodeStatus` extended to three-state (`relaying` / `active` / `stale`) for repeaters based on relay_count_24h. `getStatusInfo` is the single source of truth for status label, explanation, and relay stats. Detail pane shows relay counts and last relayed time. Nodes list gets a status emoji column with hover tooltip showing relay info. - Correctness fixes: relay index scans all observations per packet (not just best); backfill now updates relay index after resolving paths; pubkeys lowercased consistently throughout index. ## Changes ### `cmd/server/store.go` - `relayTimes map[string][]int64` field added to `PacketStore` - `addTxToRelayTimeIndex` / `removeFromRelayTimeIndex`: scan all observations, idempotent sorted insert, lowercase keys - `relayMetrics(times, nowMs)`: returns `(count1h, count24h, lastRelayed)` - `buildPathHopIndex`: populates `relayTimes` at startup - `pollAndMerge`: updates relay index on ingest and eviction; new `else` branch for path-unchanged observations - `addTxToPathHopIndex` / `removeTxFromPathHopIndex`: lowercase resolved pubkeys (fixes casing mismatch with lookup) ### `cmd/server/routes.go` - `GetBulkHealth` / `GetNodeHealth`: include relay stats for repeater nodes - `handleNodes`: enriches repeater nodes with relay stats from `relayTimes` so list view has same data as detail pane ### `cmd/server/neighbor_persist.go` - `backfillResolvedPathsAsync`: calls `addTxToRelayTimeIndex` after `pickBestObservation` to capture newly resolved pubkeys ### `public/roles.js` - `getNodeStatus(role, lastSeenMs, relayCount24h)`: three-state logic for repeaters - `getStatusInfo(n)`: single source of truth returning status, label, explanation, relay counts, last relayed ### `public/nodes.js` - Detail pane: `n.stats` populated from health endpoint before `getStatusInfo` call - Nodes list: status emoji column with relay hover tooltip; status filter uses `getStatusInfo` ### Tests - `relay_liveness_test.go`: index functions, relay metrics, wiring integration, bulk/single health endpoints - `test-repeater-liveness.js`: three-state frontend logic, backward compat ## Test plan - [x] Repeater with recent relay traffic shows green relaying emoji in list and detail pane - [x] Repeater with no relay traffic in 24h shows yellow idle in both views - [x] Repeater not heard recently shows grey stale in both views - [x] Non-repeater nodes unaffected (no relay stats, no status change) - [x] Hover tooltip on list emoji shows relay count and last relayed time - [x] `go test ./...` passes - [x] `node test-repeater-liveness.js` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-21 11:39:43 -07:00
efiten GitHub Claude Sonnet 4.6 openclaw-bot	38eb7103b3	perf(nodes): batch relay stats to fix O(N×M) /api/nodes regression (#1164 ) ## Problem `handleNodes` enriches each repeater/room node by calling `GetRepeaterRelayInfo` and `GetRepeaterUsefulnessScore` per node inside a loop. `GetRepeaterUsefulnessScore` acquires `s.mu.RLock()` and then iterates all `byPayloadType` entries to compute the non-advert denominator — once per node. On a deployment with ~1500 repeater/room nodes and ~145K transmissions in memory, this is ~220M iterations per `/api/nodes` request, plus ~3000 separate lock acquisitions. Response times of 18–44 seconds have been observed in production, especially during startup backfill when write-lock contention compounds the issue. ## Fix Add `GetRepeaterNodeStatsBatch(pubkeys []string, windowHours float64) map[string]RepeaterNodeStats` to `repeater_usefulness.go`: - Takes one `s.mu.RLock()` for the entire node list - Computes the non-advert denominator once (shared across all nodes) - Snapshots `byPathHop` slice headers for all requested pubkeys under that single lock - Processes timestamps and counts outside the lock Update `handleNodes` to collect repeater/room pubkeys first, call the batch method once, and apply results. Complexity: O(M + N) instead of O(N × M) per request (M = total transmissions, N = repeater nodes). `GetRepeaterRelayInfo` and `GetRepeaterUsefulnessScore` are unchanged — they are still correct for single-node calls (e.g. `handleNodeDetail`). ## Test plan - [ ] `go build ./cmd/server` passes - [ ] `/api/nodes` response is correct (relay_active, relay_count_1h/24h, usefulness_score fields present for repeaters) - [ ] No change in output for `/api/nodes/{pubkey}` (uses existing single-node methods) - [ ] CI passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-20 20:57:02 -07:00
efiten GitHub Claude Sonnet 4.6	d0d1657b5c	fix: re-index relay hops in byNode after Load() picks best observation (#692 ) (#801 ) ## Problem `indexByNode()` was called during `Load()` immediately when each `StoreTx` was created — before observations were appended and before `pickBestObservation()` set `tx.ResolvedPath`. The resolved_path indexing branch added in #708 was effectively dead code on every server restart. Symptom: After any restart, `byNode[relay_pubkey]` was empty for relay-only nodes even when `resolved_path` was correctly persisted in the DB. Analytics showed `totalPackets = 0` for repeater nodes despite active relay traffic. ## Fix Call `s.indexByNode(tx)` again in the post-load loop after `pickBestObservation()`, where `ResolvedPath` is populated. Same fix applied to `backfillResolvedPathsAsync()`, which also called `pickBestObservation()` without re-indexing afterward. The dedup in `nodeHashes` prevents double-counting: pubkeys already indexed from decoded JSON fields are skipped; only the relay hop pubkeys from `resolved_path` are new additions. ## Test `TestLoadIndexesRelayHopsFromResolvedPath` — inserts a packet with `resolved_path` containing a relay pubkey that does not appear in `decoded_json`, calls `Load()`, and verifies `byNode[relay_pubkey]` is populated. ## Related Closes #692 (together with #707, #708, #711 already merged) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 20:56:54 -07:00
efiten GitHub Claude Sonnet 4.6	51f823bf7e	feat: one-click prune nodes outside geofilter (#669 M4) (#738 ) ## Summary - Adds `POST /api/admin/prune-geo-filter` endpoint — dry-run by default, `?confirm=true` to permanently delete nodes outside the current geofilter polygon + buffer. Requires `X-API-Key` header. - Adds Prune nodes section inside the GeoFilter customizer tab (write-access only, same `writeEnabled` gate as PUT). Preview lists affected nodes; Confirm delete removes them. - Adds `GetNodesForGeoPrune` and `DeleteNodesByPubkeys` DB helpers. - Updates `docs/user-guide/geofilter.md` — documents the UI button as primary workflow, CLI script as alternative. > Depends on M3 (`feat/geofilter-m3-customizer`, PR #736). Merge M3 first. ## Test plan - [x] `cd cmd/server && go test ./...` — all pass - [x] Customizer GeoFilter tab without `apiKey` — Prune section not visible - [x] With `apiKey` + polygon active — Prune section visible - [x] Preview returns list of nodes outside polygon (no deletions) - [x] Confirm delete removes nodes, list clears - [x] `POST /api/admin/prune-geo-filter` without `X-API-Key` → 401 - [x] `POST /api/admin/prune-geo-filter` with no polygon configured → 400 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 03:19:31 +00:00
Kpa-clawbot GitHub MeshCore Bot Kpa-clawbot corescope-bot	9383201c07	refactor(db): finish #1283 — Option 4: ingestor owns neighbor-graph + schema migrations; server is read-only (fixes #1287 ) (#1289 ) Red commit: https://github.com/Kpa-clawbot/CoreScope/commit/eae179b99b5fd34924547632aa8f8025c405aa53 (CI: pending — opens with this PR) Finishes #1283. RED test `TestServerSourceHasNoCachedRWCalls` goes from failing (13 writer call-sites) to GREEN (zero). Per #1287 Option 4 (https://github.com/Kpa-clawbot/CoreScope/issues/1287#issuecomment-4485099992): ingestor owns the neighbor graph build + persist; server reads the snapshot. Category A — Schema migrations → new `internal/dbschema` package. `dbschema.Apply(rw)` runs in `cmd/ingestor` startup (in `OpenStore`). `dbschema.AssertReady(ro)` runs in `cmd/server/main.go` and FATAL-LOG-EXITS if any expected column/index/table is missing — the operator must restart the ingestor first. Covers indexes, `neighbor_edges`, `observations.resolved_path`, `observers.{inactive,last_packet_at,iata}`, `(inactive_)nodes.foreign_advert`, `transmissions.from_pubkey`. Category B — Backfill → ingestor. `BackfillFromPubkey` and observer-blacklist soft-delete moved to `cmd/ingestor/maintenance.go`. Server keeps an inert `fromPubkeyBackfillSnapshot` stub for `/api/healthz` API compatibility. Category C — Neighbor-graph persistence (Option 4) → ingestor writes, server reads. - Ingestor (`cmd/ingestor/neighbor_builder.go`): every 60s scans `observations + transmissions`, extracts edges (originator↔first-hop for ADVERTs; observer↔last-hop for all), resolves hop prefixes via a node-table prefix index, upserts into `neighbor_edges`. - Server (`cmd/server/neighbor_recomputer.go`): every 60s re-reads `neighbor_edges` and atomic-swaps the resulting `NeighborGraph` into `s.graph`. Initial load is synchronous on startup. All server-side incremental edge writers (the two `asyncPersistResolvedPathsAndEdges` paths in `cmd/server/store.go`) are gone. - Neighbor-edge daily prune (`PruneNeighborEdges`) moved to ingestor. Why Option 4: clean read/write separation, no startup CPU spike (server loads existing snapshot instead of rebuilding from history), no IPC/delta-protocol churn. Staleness budget ~60s — same model as the analytics recomputers in #1240 / #1248 / #672 axis 2. Recomputer interval default for neighbor graph: 60s (`NeighborGraphRecomputerDefaultInterval`, `NeighborEdgesBuilderInterval`). Invariants added: - `TestServerSourceHasNoCachedRWCalls` (RED commit eae179b9): grep enforces zero `cachedRW(`, `mode=rw`, or `sql.Open(_journal_mode=WAL…)` in non-test `cmd/server/` sources. - `TestServerStartupRequiresMigratedSchema`: server refuses to start against an unmigrated DB. - `TestNeighborGraphRecomputerLoadsSnapshot`: post-write snapshot is picked up on the next refresh. - `TestNeighborEdgesBuilderUpsertsFromObservations`: end-to-end pipeline writes the expected edge. `grep cachedRW cmd/server/*.go \| grep -v _test.go` → 0 matches. Fixes #1287. --------- Co-authored-by: MeshCore Bot <bot@meshcore.local> Co-authored-by: Kpa-clawbot <Kpa-clawbot@users.noreply.github.com> Co-authored-by: corescope-bot <bot@corescope.local>	2026-05-19 23:53:41 -07:00
Kpa-clawbot GitHub bot	749fdc114f	feat(decoder+ui): close remaining P2 items from #1279 — payloadTypeNames, legend, TransportCodes, Feat1/2, RAW_CUSTOM, sensor docs (#1291 ) RED commit: `dc4c0800` — CI: https://github.com/Kpa-clawbot/CoreScope/actions?query=branch%3Afix%2Fissue-1279-p2 Closes the remaining six 🟢 P2 items in umbrella #1279 (PR #1280 shipped P0+P1, PR #1276 shipped ACK/RESPONSE/PATH legend rows). ### Item-by-item \| # \| Item \| Where \| Test \| \|---\|---\|---\|---\| \| 1 \| `payloadTypeNames` parity \| `cmd/server/store.go` \| `cmd/server/issue1279_p2_test.go::TestPayloadTypeNamesAll13` \| \| 2 \| Legend rows: Anon Req / Grp Data / Multipart / Control / Raw Custom \| `public/live.js` \| `test-issue-1279-legend-p2-e2e.js` (Playwright) \| \| 3 \| TransportCodes detail-row + `code1=` / `code2=` filter grammar \| `public/packets.js`, `public/packet-filter.js` \| `test-issue-1279-p2-code-filter.js` (6 cases) \| \| 4 \| Multibyte capability badge on node detail/list rows \| `public/nodes.js::renderNodeBadges` \| `n.hash_size >= 2` (observable Feat1/Feat2 proxy; firmware `AdvertDataHelpers.h:14-16`) \| \| 5 \| RAW_CUSTOM (0x0F) `{rawLength, firstByteTag}` decode + detail-row \| `cmd/server/decoder.go`, `cmd/ingestor/decoder.go`, `public/packets.js` \| `TestDecodeRawCustomExposesLengthAndTag` × 2 + updated `TestDecodePayloadRAWCustom` \| \| 6 \| Sensor advert telemetry firmware-derivation comments \| `cmd/ingestor/decoder.go:363-380` \| pure comments — exempt per AGENTS \| ### Firmware refs cited inline - `firmware/src/Packet.h:19-32` — PAYLOAD_TYPE_* constants - `firmware/src/Packet.h:46` — TransportCodes wire layout - `firmware/src/Mesh.cpp:577` — `createRawData` - `firmware/src/helpers/SensorMesh.{h,cpp}` — sensor advert telemetry derivation - `firmware/src/helpers/AdvertDataHelpers.h:14-16` — Feat1/Feat2 ### TDD Red `dc4c0800` proves the assertions gate behavior: - `payloadTypeNames` had only 12 entries (no 0x0F). - RAW_CUSTOM decoded as `UNKNOWN` with no envelope fields. Green `<HEAD>` makes both green; per-item tests included. ### Cross-stack note Cross-stack: justified — items 1/5 add decoder output fields; items 2/3/4/5 surface those fields in the UI in the same PR per #1279 acceptance. ### Out of scope Item 4 surfaces the observable multibyte capability via the persisted `hash_size` (Feat1/Feat2 wire bits are only on transient adverts and not stored per-node today); persisting raw Feat1/Feat2 per-node is left for a follow-up. Fixes #1279 --------- Co-authored-by: bot <bot@corescope>	2026-05-19 08:08:28 -07:00
Kpa-clawbot GitHub corescope-bot	467b01a1b3	fix(#1285 ): exclude RTC-reset outliers from clock-skew hash median + recent bad count (#1288 ) Red commit: `97c9a22a55` (CI: https://github.com/Kpa-clawbot/CoreScope/commit/97c9a22a55b07d1576c579aa9d23b290dad33eb6/checks) Fixes #1285. ## What was broken Bug A — outlier-dominated hash-evidence median. On the per-hash evidence panel a single observer reporting an RTC-reset advert (firmware emitting factory timestamp, ~700d off) dragged the displayed median to "median corrected: -704d 18h" even when every other observer of that hash saw a normal value. Bug B — false "N of last K had nonsense timestamps" warning. `recentBadSampleCount` lumped RTC-reset adverts in with "bimodal-bad" samples. On the repro node every recent skew was -16…-22s (healthy), but the lone RTC-reset advert that landed inside the recent window was counted as bad → "3 of last 5 adverts had nonsense timestamps" fired and the node was misclassified `bimodal_clock`. Root cause of B: the recent-window split (`cmd/server/clock_skew.go` ~L575) classified anything `\|corrected skew\| > 1h` as "bad". That conflates true bimodal RTC oscillation (1h…24h) with factory-timestamp resets (>24h, already surfaced via the RTC-reset badge). ## Fix - New `rtcResetOutlierThresholdSec = 24h`. Rationale: real µC drift is sub-second/advert; real bimodal RTC misbehaves in the hours range; anything >1d is not a drift signal. - Recent-window split puts `\|skew\| > 24h` in a third bucket excluded from both `recentSampleCount` and `recentBadCount`. - New `hashEvidenceMedian()` filters outliers before computing the per-hash median. UI labels the hash "insufficient data (N RTC-reset outliers excluded)" when every observer saw a reset-shaped advert. - Three pre-existing #845 tests used -50M-sec "bad" samples (RTC-reset range) — re-pointed to -7200s (true bimodal range), what `bimodal_clock` actually models. ## Preflight overrides - check-branch-clean: cross-stack: justified — backend computes counts/median; frontend renders the new label. ## Browser verification Confirmed staging node `c0dedad…` repro matches the test fixture. No new CSS vars. ## E2E assertion added `cmd/server/clock_skew_issue1285_test.go:81` and `:103`. --------- Co-authored-by: corescope-bot <bot@corescope.local>	2026-05-19 01:17:12 -07:00
Kpa-clawbot GitHub MeshCore Bot	1da2034341	refactor(db): move all writes from server to ingestor; server truly read-only (fixes #1283 ) (#1286 ) Red commit: `f6290b63` — CI run will appear at https://github.com/Kpa-clawbot/CoreScope/actions Fixes #1283. ## What Moves all four DB write operations out of `cmd/server/` into `cmd/ingestor/`, making the server truly read-only and eliminating the SQLITE_BUSY VACUUM bug at its root: the server can no longer race the ingestor for the write lock because the server has no write path. ## The four operations \| # \| Was in \| Now in \| \|---\|--------\|--------\| \| 1 \| `cmd/server/vacuum.go` (`checkAutoVacuum`, full VACUUM + `auto_vacuum=INCREMENTAL` migration) \| `cmd/ingestor/db.go` `Store.CheckAutoVacuum` (already existed; ingestor runs it at startup before the MQTT subscriber starts → no contention) \| \| 2 \| `cmd/server/db.go` `PruneOldPackets` (`DELETE FROM transmissions`) \| `cmd/ingestor/maintenance.go` `Store.PruneOldPackets` (new) + 24h ticker in `cmd/ingestor/main.go` \| \| 3 \| `cmd/server/db.go` `PruneOldMetrics` (`DELETE FROM observer_metrics`) \| `cmd/ingestor/db.go` `Store.PruneOldMetrics` (already existed) \| \| 4 \| `cmd/server/db.go` `RemoveStaleObservers` (`UPDATE observers SET inactive=1`) \| `cmd/ingestor/db.go` `Store.RemoveStaleObservers` (already existed) \| ## HTTP surface - Removed: `POST /api/admin/prune` (`handleAdminPrune`, route, openapi entry). Operators trigger an ad-hoc prune by restarting the ingestor. - Kept: `GET /api/backup` — uses `VACUUM INTO` which writes to a separate file, not the live DB; read-only-safe. ## Tests - `cmd/server/readonly_invariant_test.go` (RED gate) — reflect-asserts `PruneOldPackets`/`PruneOldMetrics`/`RemoveStaleObservers` are NOT methods on the server's `DB`. Fails on master, passes after this PR. - `cmd/ingestor/issue1283_test.go` — exercises `Store.PruneOldPackets` and the auto_vacuum=NONE → INCREMENTAL migration through `Store.CheckAutoVacuum` with `vacuumOnStartup=true`. ## Why the bug is gone The SQLITE_BUSY VACUUM failure happened because supervisord launched both ingestor + server in one container; the ingestor took the write lock for INSERTs and the server's `checkAutoVacuum` then failed to acquire it within `busy_timeout=5000`. After this PR, only the ingestor ever opens a writable connection, and it runs `CheckAutoVacuum` before* spawning the MQTT subscriber → no contention possible. ## Scope notes - `cachedRW()` still has three pre-existing callers in `cmd/server/` (`neighbor_persist.go`, `ensure_indexes.go`, `from_pubkey_migration.go`). These pre-date #1283 and are not in the issue's four-operation list. Leaving them for follow-up keeps this PR honest about scope; AGENTS.md documents the invariant so new write paths can't sneak in. - PII preflight reports false positives on the Go method name `requireAPIKey` in `routes.go` diff context — no real PII. - Server-side neighbor-edge prune (`PruneNeighborEdges`) intentionally left in place — out of scope of #1283. --------- Co-authored-by: MeshCore Bot <bot@meshcore.local>	2026-05-18 23:52:27 -07:00
Kpa-clawbot GitHub corescope-bot	d667dc0a74	fix(#1278 ): /api/nodes/{pk}/paths uses canonical persisted resolved_path (drop anchor-bias inconsistency) (#1282 ) First failing (RED) commit: `c994c5a7` — CI: https://github.com/Kpa-clawbot/CoreScope/actions Fixes #1278. ## Root cause `handleNodePaths` (`cmd/server/routes.go`) anchored the disambiguator with the queried node as `hopContext` (`hopContext := []string{lowerPK}`). For ambiguous short-prefix hops (e.g. two nodes sharing the 1-byte prefix `C0`), tier-1/2 hop-context resolution then biased the resolver to pick the queried node — even though the CANONICAL persisted `resolved_path` (what `/api/packets/{hash}` shows via `fetchResolvedPathForTxBest`) had picked the OTHER colliding node at ingest time. The `containsTarget` gate accepted those packets and rendered the queried node into the displayed hop, while the packets page (reading the canonical resolved_path) showed a different node. The two pages disagreed. Confirmed on staging: `/api/nodes/c0dedad…/paths` returned `sampleHash 6c4af39ee4b7e202`; `/api/packets/6c4af39ee4b7e202.resolved_path[3]` = `c0ffeec7…`, not `c0dedad…`. ## Option chosen — A For each candidate tx, read the canonical persisted `resolved_path` via `fetchResolvedPathForTxBest`. When present, use it for BOTH: - the `containsTarget` membership decision (queried pubkey must appear in the canonical resolved hops), and - the displayed hop names (zipped parallel to `tx.PathJSON`). When absent (older data / async backfill not yet complete) the legacy biased re-resolve is kept as a fallback — there's no canonical answer to be consistent with, and dropping the bias unconditionally would regress #1197. ## Why not B / C - B (drop bias only for membership): still re-resolves display with bias → display vs packets page can still diverge for hop names. Option A fixes both. - C (drop `hopContext` entirely): regresses #1197 / breaks the `resolve_context_callsites_test.go` gate. ## Performance Same O(N) walk over candidates; one extra `fetchResolvedPathForTxBest` per candidate, LRU-cached, worst case a single SQL row. ## Tests - RED: `cmd/server/paths_anchor_bias_test.go` — seeds two `c0…` nodes + a tx whose best-obs resolved_path picks the GPS node; asserts the no-GPS node's `/paths` excludes the tx and the GPS node's includes it. Mutation-verified (fails on master). - All existing tests green (including #1197 callsite gate and #929 prefix-collision exclusion). --------- Co-authored-by: corescope-bot <bot@corescope>	2026-05-18 23:19:30 -07:00
Kpa-clawbot GitHub OpenClaw Bot	e6c30e1a7e	feat(decoder): GRP_DATA + MULTIPART + advertRole fix + CONTROL flags (#1279 P0+P1) (#1280 ) Addresses the four P0+P1 firmware reconciliation gaps from the umbrella audit (issue #1279). RED commit: `0a4c084e` (asserts on stub returns; all 13 assertions fail). GREEN commit: `13867681`. ## What's in this PR ### P0 — silently dropped data - #1 GRP_DATA (0x06) decoder. Outer envelope is the same shape as GRP_TXT (`channel_hash(1)+MAC(2)+ciphertext`) per `firmware/src/helpers/BaseChatMesh.cpp:476,500`. Factored `decryptChannelBlock(...)` helper used by both 5 and 6. When a channel key matches, the inner is parsed per `firmware/src/helpers/BaseChatMesh.cpp:382-385` as `data_type(uint16 LE) + data_len(1) + blob(data_len)`. Surfaces `{channelHash, MAC, dataType, dataLen, decryptedBlob}` on decrypt or `{channelHash, MAC, encryptedData}` otherwise. Server-side decoder surfaces envelope only (no key store). - #2 MULTIPART (0x0A) decoder. Per `firmware/src/Mesh.cpp:289`, byte0 = `(remaining<<4) \| inner_type`. When `inner_type == PAYLOAD_TYPE_ACK (0x03)`, next 4 bytes are the LE ack_crc per `firmware/src/Mesh.cpp:292-307`. Surfaces `{remaining, innerType, innerTypeName, innerAckCrc \| innerPayload}`. ### P1 — mis-classified / opaque - #3 `advertRole()` raw-type fix. Per `firmware/src/helpers/AdvertDataHelpers.h:7-12`, ADV_TYPE_NONE = 0 and 5-15 are FUTURE. The previous boolean fallback collapsed both into `"companion"`, silently relabelling unknown/reserved types. New behaviour: type 0 → `none`, 1 → `companion`, 2-4 → `repeater`/`room`/`sensor`, 5-15 → `type-N`. `ValidateAdvert` accepts the new labels. - #4 CONTROL (0x0B) byte0 flags + length. Per `firmware/src/Mesh.cpp:69` + `createControlData` at `Mesh.cpp:609`, byte0 high-bit marks the zero-hop direct subset. Surfaces `{ctrlFlags, ctrlZeroHop, ctrlLength}`. ### Drift fix - `cmd/server/store.go` `payloadTypeNames` now includes `6: GRP_DATA` and `10: MULTIPART` (previously omitted; canonical decoder map already had them). ## Lockstep & TDD Both `cmd/ingestor/decoder.go` and `cmd/server/decoder.go` updated in the same commits — same wire-vector tests live in both packages (`cmd/{ingestor,server}/issue1279_test.go`). Per-item RED→GREEN visible in `git log`. \| Item \| Tests \| RED proof \| \|---\|---\|---\| \| #1 GRP_DATA \| ingestor: NoKey + DecryptedInner; server: Envelope \| 6 assertions failed pre-impl \| \| #2 MULTIPART \| ingestor + server: Ack + NonAck \| 8 assertions failed pre-impl \| \| #3 advertRole \| ingestor + server: 7-row table \| 3 assertions failed pre-impl \| \| #4 CONTROL \| ingestor + server: ZeroHop + MultiHop \| 6 assertions failed pre-impl \| ## What's NOT in this PR The umbrella issue lists P2 items that ship in follow-up PRs: - Live + compare legend entries for the long tail of newly-named types (#1274 + others). - TransportCodes UI surface + filter grammar. - feat1/feat2 capability badges. - `payloadTypeNames` consolidation across server/ingestor (drift-prevention). Leave the umbrella open after this merges. Refs #1279 --------- Co-authored-by: OpenClaw Bot <bot@openclaw.local>	2026-05-18 23:19:27 -07:00
Kpa-clawbot GitHub clawbot	8bf7709970	feat(repeater): usefulness score — bridge axis (#672 axis 2 of 4) (#1275 ) RED test commit: `fd661569` — CI will fail on this (stub returns empty map; assertions fail by design). GREEN: `bf4b8592`. ## What Implements axis 2 of 4 for the repeater usefulness score per #672 ([status comment](https://github.com/Kpa-clawbot/CoreScope/issues/672#issuecomment-4484635378)). The Bridge axis measures structural importance: how many shortest paths between other nodes route through this one. A high-traffic redundant node and a low-traffic critical bridge will no longer look identical. ## Algorithm Brandes' weighted betweenness centrality with Dijkstra for shortest paths (`cmd/server/bridge_score.go`). - Nodes: pubkeys in the `neighbor_edges` graph - Edge weight: `Score(now) * Confidence()` — per the convention from #1235 (count + recency decay scaled by observer-diversity confidence). Geo-rejected edges already excluded at graph build time (#1230) so we don't re-filter here. - Dijkstra distance: `1 / max(epsilon, weight)` — high affinity = cheap cost. - Normalize: divide by max observed centrality so output is in `[0, 1]`. Cost: `O(V · (E + V log V))`. Staging-scale (~600 nodes / ~2 000 edges) ≈ ~4.8M ops, completes in milliseconds. ## Where it lives - `cmd/server/bridge_score.go` — pure algorithm, no locks - `cmd/server/bridge_recomputer.go` — background recomputer (mirrors #1240/#1262 pattern), 5-min default interval, initial sync prewarm, snapshot stored in `s.bridgeScoreMap atomic.Pointer[map[string]float64]` - `cmd/server/routes.go` — `handleNodes` adds `node["bridge_score"]` on repeater/room rows; node-detail handler adds it on the single-node path - `public/nodes.js` — separate Bridge row in the node detail panel, alongside the existing Usefulness (Traffic) row. Distinct colour-coded bar. ## What's NOT in this PR (still pending for #672) - Coverage axis (axis 3) — unique observer-pair connectivity - Redundancy axis (axis 4) — simulated node-removal impact - Composite — once all 4 axes ship, swap the `usefulness_score` formula from "traffic-only" to the weighted composite `Refs #672` (not `Fixes` — issue stays open until all 4 axes + composite ship). ## Tests - `TestComputeBridgeScores_LineGraph` — 4-node line: middles non-zero, leaves zero, max normalized to 1.0 - `TestComputeBridgeScores_TriangleNoBridge` — clique has zero bridges - `TestComputeBridgeScores_Empty` — defensive nil-safety - `TestComputeBridgeScores_WeightSensitive` — mutation guard: revert the `1/w` inversion and this test fails - `TestBridgeScore_HandleNodesSurface` — integration: `/api/nodes` returns `bridge_score` on repeater rows; middle nodes > 0, ends == 0 --------- Co-authored-by: clawbot <bot@meshcore.local>	2026-05-18 22:51:23 -07:00
Kpa-clawbot GitHub CoreScope Bot	4cd8445233	perf(#1265 ): wire /api/observers/clock-skew + /api/nodes/clock-skew into analytics recomputer (#1266 ) RED: `97f49a0c` · CI: https://github.com/Kpa-clawbot/CoreScope/actions/runs/26046530920 Fixes #1265. ## Problem On staging two clock-skew endpoints serve compute-on-request: - `/api/observers/clock-skew` — 3.3s - `/api/nodes/clock-skew` — 8.9s Both drive a full `clockSkew.Recompute` over 100k+ adverts while holding `s.mu.RLock`, blocking under concurrent reader load. ## Fix Wire both endpoints into the established `analytics_recomputer.go` pattern (PRs #1248 / #1259 / #1263). Two new slots: - `recompObserversClockSkew` — wraps `computeObserverCalibrations()` - `recompNodesClockSkew` — wraps `computeFleetClockSkew()` Accessors `GetObserverCalibrations` / `GetFleetClockSkew` now prefer the atomic-pointer snapshot; on-request compute is fallback-only for the brief window before initial sync compute lands (and for tests that skip the recomputer). Default interval 300s, overridable via: ```json "analytics": { "recomputeIntervalSeconds": { "observersClockSkew": 300, "nodesClockSkew": 300 } } ``` `config.example.json` + the `_comment_analytics` doc updated. ## TDD - RED `97f49a0c` — `TestClockSkewRecomputersRegistered` + `TestClockSkewHandlersSteadyStateLatency` (8 concurrent readers × 25 reqs per endpoint, p99 < 100ms gate). Fails on master: recomputer slots nil. - GREEN `19599375` — wire + accessor switch. p99 well under 5ms on the test fixture. ## Verification ``` cd cmd/server && go test ./... -count=1 # ok 42s bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master # all gates pass ``` --------- Co-authored-by: CoreScope Bot <bot@corescope.local>	2026-05-18 12:27:44 -07:00
Kpa-clawbot GitHub corescope-bot	ae17a2be12	perf(#1262 ): /api/nodes?limit=2000 cold-miss 15.7s → <100ms — prewarm repeater enrichment cache (#1263 ) RED commit: `22ce5736066142583017cad7303fa48d9e00ccf0` — CI on red: https://github.com/Kpa-clawbot/CoreScope/actions?query=branch%3Afix%2Fissue-1262 ## Problem After #1260 added a 15s-TTL bulk cache for repeater enrichment in `handleNodes`, `/api/nodes` (default limit) dropped to ~500ms. But `/api/nodes?limit=2000` — called by `public/live.js` at SPA startup for hop resolution — still took 15.7s cold on staging (75k tx, 600 nodes). Warm hits were ~40ms. Root cause: the bulk cache was lazily populated on the first request after TTL expiry. The rebuild ran on the request-serving goroutine. Every cold SPA load triggered the rebuild and ate 15s. ## Fix Add `StartRepeaterEnrichmentRecomputer` — a steady-state background recomputer that mirrors the `analytics_recomputer.go` pattern from #1240: - Prewarm: initial synchronous compute on Start so the first request hits a populated cache. - Steady-state: ticker refreshes the snapshot every 5min (configurable via the existing analytics recompute interval knob). - Panic-safe + idempotent Start. Wired into `main.go` right after `StartAnalyticsRecomputers`, using `cfg.GetHealthThresholds().RelayActiveHours` as the window. ## Test `TestHandleNodesLimit2000ColdMiss` — seeds 600 nodes + 150k non-advert tx with repeaters indexed under a shared 1-byte hop prefix (matches production hop-prefix collisions), starts the recomputer, then issues `/api/nodes?limit=2000` with no HTTP warmup. \| State \| Latency \| \|---\|---\| \| Before (master, on-thread rebuild) \| 3.37s \| \| After (prewarm + steady-state) \| 56ms \| \| Budget \| 2s \| Staging end-to-end: 15.7s → expected sub-100ms on the same call path. Red commit (`22ce5736066142583017cad7303fa48d9e00ccf0`) compiles with a no-op stub of the new method so the test fails on the latency assertion, not a missing symbol. Fixes #1262 --------- Co-authored-by: corescope-bot <bot@corescope.local>	2026-05-18 09:22:27 -07:00
Kpa-clawbot GitHub corescope-bot	1efe93d7f6	perf(#1257 ): bulk-cache repeater enrichment in /api/nodes — 32s → <500ms (#1260 ) RED commit `a2879e12` — perf regression test; CI run: see Actions tab. Fixes #1257. ## Root cause `handleNodes` looped over the response page and called `store.GetRepeaterRelayInfo(pk, win)` + `store.GetRepeaterUsefulnessScore(pk)` for every repeater/room. Each call: - grabbed its own `s.mu.RLock`, - walked `byPathHop[pk]` (+ the matching 1-byte raw-prefix bucket, which on busy networks fans out to nearly the entire non-advert tx set), - and re-parsed every `tx.FirstSeen` with `parseRelayTS`. Default page is the 50 most-recently-seen nodes — almost all hot repeaters — so the request did O(50) lock acquisitions and hundreds of thousands of timestamp parses on the same set of txs. That's the classic load-then-paginate / per-row N+1 shape called out in the issue (same family as #1226). The `?limit=2000` variant looks faster relatively only because per-node enrichment dwarfs serialization; on staging both still bottleneck on the same loop. ## Fix Two new bulk methods on `PacketStore`: - `GetRepeaterRelayInfoMap(windowHours)` → `pubkey → RepeaterRelayInfo` - `GetRepeaterUsefulnessScoreMap()` → `pubkey → 0..1` Both snapshot `byPathHop` under a single `RLock`, pre-parse each `FirstSeen` exactly once (a tx that appears in N hop buckets used to be parsed N times), and emit one entry per hop key. Cached 15s — same TTL as `GetNodeHashSizeInfo` / `GetMultiByteCapMap`, same status-column freshness budget. `handleNodes` is one map-lookup per node; behavior, output schema, and `RelayActive` / `RelayCount{1h,24h}` / `LastRelayed` / `usefulness_score` semantics are preserved. ## Why no `limit` default change The issue mentioned a default-limit knob. Investigated: `queryInt(r, "limit", 50)` already defaults to 50 — frontends calling `/api/nodes` (no limit) get a 50-row page today. Capping further would change behavior (live.js already passes `?limit=2000` when it wants more); the cost was per-repeater enrichment, not page size. Fixing the N+1 is the correct lever and preserves backward compat. ## Perf Regression test `TestHandleNodesPerfLargeFleet` (600 nodes, 150k non-advert tx, repeaters indexed under `byPathHop`): \| \| elapsed \| vs 2s budget \| \|---\|---\|---\| \| before (master) \| 4.72s \| ✗ \| \| after \| ~4ms \| ✓ (~1000×) \| ## TDD - RED: `a2879e12` — test fails at 4.72s on master. - GREEN: `c529d29a` — fix; full `cmd/server` + `cmd/ingestor` suites green. --------- Co-authored-by: corescope-bot <bot@corescope>	2026-05-18 07:36:33 -07:00
Kpa-clawbot GitHub openclaw-bot	f81ed5b3cf	perf(#1256 ): wire /api/analytics/roles into steady-state recomputer (#1259 ) RED commit: `0190466d` — failing CI: https://github.com/Kpa-clawbot/CoreScope/actions (will populate after PR creation) ## Problem On staging (commit `d69d9fb`, 78k tx, 2.3M obs), `curl http://localhost/api/analytics/roles` times out at 60s with 0 bytes — the Roles tab is unusable. Issue #1256. PR #1248's steady-state recomputer fan-out (topology / rf / distance / channels / hash-collisions / hash-sizes) didn't include roles. The legacy handler: 1. Holds `s.mu.RLock` for the entire compute. 2. Calls `GetFleetClockSkew()`, which drives `clockSkew.Recompute(s)` over all ADVERT transmissions — O(78k) per request. 3. Concurrent ingest writers compound the latency through writer-starvation. Result: every request hits the cold path; the response never comes back inside the 60 s HTTP budget. ## Fix Add `roles` as the 7th endpoint in the recomputer fan-out — same pattern as #1248: - `PacketStore.recompRoles` slot, registered in `StartAnalyticsRecomputers` with default 5-min interval. - `PacketStore.GetAnalyticsRoles()` → atomic-pointer load from the snapshot (sub-ms), with a `computeAnalyticsRoles()` fallback only for the brief startup window before the initial sync compute completes. - Handler is now a thin wrapper — no lock-held work on the request path. - New optional `roles` key under `analytics.recomputeIntervalSeconds` in config; `config.example.json` and `_comment_analytics` updated. ## Latency (unit-scope benchmark) - Worst-of-50 handler latency: <100 ms (test budget; well under the 2 s p99 acceptance). - Compute itself is bounded by the existing 5-min recompute window — it runs once in the background, never on the request path. ## Tests - RED `0190466d`: asserts `recompRoles` is registered and the handler returns under the latency budget. Fails on master with `recompRoles not registered`. - GREEN `d7784f76`: registers the recomputer + snapshot accessor — both tests pass. Fixes #1256 --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-18 07:36:28 -07:00
Kpa-clawbot GitHub openclaw-bot	d69d9fbf8e	perf(#1247 ): surgical fix for resolveWithContext tier-1 hot path (4.6× speedup) (#1253 ) ## Summary Surgical fix for #1247: analytics endpoints regressed 3-9× between prod `d818527` and master. pprof against staging traced the regression to `resolveWithContext` tier-1 affinity loop running on every analytics `resolveHop` call (post-#1198 plumbing) with redundant per-(cand, ctx) work. Result: 4.6× speedup on the synthetic hot-shape benchmark (202µs → 44µs / op). ## Root cause - PR #1198 (`353c5264`) lit up `resolveWithContext` tier 1 from every analytics resolveHop closure (previously they passed `contextPubkeys=nil` and short-circuited the entire tier-1 block). - The inner loop did `N_cand × N_ctx` iterations where each one did: - `graph.Neighbors(strings.ToLower(ctxPK))` — graph RLock + ToLower allocation per candidate, redundantly - `strings.ToLower(cand.PublicKey)` per `ctxPK` - `strings.EqualFold(otherPK, ctxPK)` + `EqualFold(otherPK, candPK)` — both sides were already lowercased (`NeighborEdge.NodeA/B` via `makeEdgeKey`; `contextPubkeys` via `buildHopContextPubkeys`) - At staging scale (5k+ contextPubkeys × 30k+ resolveHop calls) this dominated `computeAnalyticsTopology` (37% of its CPU) and `computeAnalyticsRF` (55%). ## pprof attribution (staging, region-keyed queries bypassing #1240 cache) ``` computeAnalyticsTopology cum: 19.24% (5.45s / 28.32s sampled) └─ resolveWithContext 37% ├─ strings.ToLower 41% ├─ strings.EqualFold 28% └─ graph.Neighbors 24% computeAnalyticsRF cum: 10.38% ``` ## Fix (~80 LoC in `cmd/server/store.go`) 1. Lowercase `contextPubkeys` once per call, skipped entirely when already lowercased (the analytics fast path). 2. Lowercase candidate pubkeys once per call. 3. Invert the loop nesting: outer-ctx / inner-edge / candidate-map lookup. `graph.Neighbors` is called once per context pubkey instead of `N_cand` times. 4. Raw `==` instead of `strings.EqualFold` for pubkey comparisons (both sides lowercased by step 1/2). 5. Added a tiny `hasUpperASCII` byte-loop helper next to `isHexLower` for the fast-path check. Behavior preserved: same `Score × Confidence` formula, same tier-1 ratio + min-observations gate, same per-candidate "best edge wins" semantics. No change to tiers 2/3/4. ## TDD evidence - Red commit (`5f8d1564`): `TestResolveWithContextTier1Floor` asserts `<100 µs/call` on the hot shape. 199 µs/call on regressed master → FAIL. - Green commit (`e3bdbc65`): surgical fix lands. 44 µs/call → PASS. - Reverification: locally stashed the fix, ran the test → 199.5 µs FAIL; popped fix → 44 µs PASS. `BenchmarkResolveWithContextTier1Hot` (no assertion, visibility only): ``` before: 202013 ns/op 168 B/op 3 allocs/op after: 44084 ns/op 424 B/op 6 allocs/op speedup: 4.6× ``` (Post-fix allocs are O(N_cand + N_ctx) one-time helper tables — net win at hot scale.) ## Independence from #1248 PR #1248 caches the analytics compute output so user-facing latency is sub-ms even when the compute is slow. That's correct for UX but it masks the regression. This PR repairs the compute itself, so: - Region-keyed and windowed queries (which bypass the recomputer cache by design — see #1240) become fast again. - Future ingest scale or feature work on top of the regressed baseline doesn't compound. ## Out of scope - The geo-rejection (#1228) and Confidence weighting (#1229) commits — kept intact, they protect correctness and were not the dominant CPU cost. - Reverting any suspect commit — surgical only. ## Acceptance criteria from #1247 - [x] pprof confirms the hot function (`resolveWithContext`) - [x] Bisect identifies the regressing commit (`353c5264` / PR #1198 — context plumbing; ratified by pprof, no need to actually rebuild 5 binaries) - [x] Fix lands; tier-1 hot path 4.6× faster - [x] No regression in disambiguator correctness — full `go test ./...` green, all existing `ResolveWithContext` / `HopDisambig` / `NeighborGraph` / `Affinity` tests pass Fixes #1247 --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-17 16:42:01 -07:00
Kpa-clawbot GitHub OpenClaw Bot	356f001027	perf(#1240 ): steady-state background recompute for analytics endpoints (#1248 ) RED commit: `27630f6a` — adds latency test that fails on master (p99=225ms > 50ms budget) and a stub `StartAnalyticsRecomputers` that returns a no-op so the assertion (not a build error) gates the change. GREEN commit: `20fbbceb` — wires real background recompute infrastructure. Test passes at p99=~1µs. ## What changed Replaces the on-request "compute-then-cache" pattern for the default-shape analytics queries with a steady-state background recompute loop. Reads always hit an `atomic.Value` snapshot in <1µs regardless of compute cost or writer contention. Operator principle: serving slightly stale data quickly beats real-time data slowly. ## Endpoints converted (default 5min interval each) \| Endpoint \| Cold compute \| Recomputer interval \| \|---\|---\|---\| \| `/api/analytics/topology` \| ~5s \| 5 min \| \| `/api/analytics/rf` \| ~4s \| 5 min \| \| `/api/analytics/distance` \| ~3s \| 5 min \| \| `/api/analytics/channels` \| ~0.5s \| 5 min \| \| `/api/analytics/hash-collisions` \| ~0.5s \| 5 min \| \| `/api/analytics/hash-sizes` \| ~22ms \| 5 min \| All intervals configurable per-endpoint via `analytics.recomputeIntervalSeconds.<name>` in `config.json`; documented in `config.example.json`. Default override via `analytics.defaultIntervalSeconds`. ## Scope: default query only Only the canonical shape `(region="", window=zero)` is precomputed. Region- or window-filtered requests fall back to the legacy TTL cache + on-request compute — keeps recomputer count bounded (6, not 6×N×M). ## Latency Test `TestAnalyticsRecomputerSteadyStateLatency`: 100 concurrent readers + 4 writers churning `s.mu.Lock` on 20k distHops. - Before: p50=188ms p99=225ms (assertion failed) - After: p50=240ns p99=1.1µs (atomic load + map return) ## Shutdown integration `StartAnalyticsRecomputers` returns a stop closure invoked from `main.go`'s SIGTERM handler BEFORE `dbClose()` so any in-flight SQLite compute drains cleanly. `TestAnalyticsRecomputerShutdownNoLeak` confirms all 6 goroutines are reaped (Δ=6 within 2s). ## Safety details - Initial compute is synchronous in `Start()` — first read after startup never sees nil. - `recover()` inside `runOnce` keeps a compute panic from killing the goroutine; previous snapshot remains valid. - `analyticsRecomputerMu` is a sync.RWMutex; recomputer pointers are read-locked in the hot path. The atomic.Value swap inside `runOnce` is lock-free. Fixes #1240. --------- Co-authored-by: OpenClaw Bot <bot@openclaw.local>	2026-05-17 17:33:30 +00:00
Kpa-clawbot GitHub OpenClaw Bot openclaw-bot	b881a09f02	feat(#1188 ): show observer IATA on packets + filter grammar (#1189 ) Red commit: `4ed272761b` (CI run: https://github.com/Kpa-clawbot/CoreScope/actions/runs/25651898290) Fixes #1188 — observer IATA on packets in three UI surfaces + filter grammar. cross-stack: justified — feature spans API shape (Go), store, filter grammar (JS), three packets UI surfaces. ## Scope shipped - Packets table row: `.badge-iata` pill inline next to observer name - Expanded observation rows: per-observation IATA badge - Detail pane: Observer dd + per-observation list both render the badge - Filter grammar: `observer_iata` field + `iata` alias; `==`/`!=`/`contains`, plus a new `in (a, b, c)` list operator. Both names appear in autocomplete with descriptions. ## TDD red→green pairs 1. `271d72f` filter-grammar tests → `2c182eb` evaluator + suggest entries 2. `4ed2727` backend `observer_iata` API tests → `7856914` SQL join + struct/store wiring 3. `0e09371` display E2E → `7a3f45d` packets.js + style.css badge (E2E swapped for string-contract unit test in `ee414b4` — fixture `observations.observer_idx` stores text pubkeys, blocking the join the badge depends on) ## Backend - `cmd/server/db.go`: SELECT `obs.iata AS observer_iata` in `transmissionBaseSQL`, grouped query, observations-by-transmissions - `cmd/server/store.go`: `ObserverIATA` on `StoreTx`/`StoreObs`, load via all three ingest paths, surface in `txToMap`/`enrichObs`/`groupedTxsToPage` - `cmd/server/types.go`: field added to `TransmissionResp`/`ObservationResp`/`GroupedPacketResp` - Test fixture schemas declare `iata` on observers ## Perf Per #383, `obsIataBadge(packet)` reads `packet.observer_iata` directly (server-joined). Falls back to `observerMap.get(id).iata` only if absent — hot row-render loop avoids per-row Map lookup on fresh data. ## Display rules Missing IATA: nothing inline (Region column still shows `—`). No new hex — `.badge-iata` uses `var(--nav-bg)` / `var(--nav-text)`. E2E assertion added: test-observer-iata-1188.js:51 --------- Co-authored-by: OpenClaw Bot <bot@openclaw.dev> Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-17 16:13:11 +00:00
Kpa-clawbot GitHub openclaw-bot	2754251a53	perf(#1239 ): /api/analytics/distance — TTL 15s→60s + drop main RLock around compute (#1241 ) ## Summary Fixes #1239 — `/api/analytics/distance` 15s cold on staging under heavy ingest. Two independent fixes. First commit on this branch is the RED test for Fix B (`a539882`), demonstrating reader/writer contention against the main store lock. CI: see Actions tab for the run on the test-only commit — it asserts >150µs avg writer cycle and fails at 82367µs pre-fix. GREEN commit (`d3938f1`) brings it to 1µs. ## Fix A — TTL bump 15s → 60s (`5eae1e0`) - `rfCacheTTL` default in `cmd/server/store.go` changed from `15 * time.Second` to `60 * time.Second`. This is the shared TTL for RF / topology / distance / hash-sizes / subpath / channel analytics caches. - Per operator clarification (issue thread): distance analytics IS viewed live during analysis sessions, not background-glanced. 60s smooths the cold-miss churn during heavy ingest without freezing data. - `config.example.json`: documented `cacheTTL.analyticsRF` with new default + caveat. - Existing assertions (`TestCacheTTLDefaults`, `TestHashCollisionsCacheTTL`) updated to the new default. ## Fix B — Drop main RLock around compute (`a539882` red, `d3938f1` green) `computeAnalyticsDistance` previously held `s.mu.RLock()` for the entire iteration: region match-set construction, hop/path filtering, sort, dedup, histogram, category stats, time series. Readers serialized writers (ingest, `buildDistanceIndex`). Refactor: hold the RLock only long enough to snapshot the `distHops`/`distPaths` slice headers AND build the region match-set (which reads `tx.Observations`, mutated under `s.mu.Lock`). For `region=""` (the hot cold-call path) the lock hold is just the header snapshot — microseconds. Everything else runs on the locally-captured slices outside the lock. Safety: `distHops`/`distPaths` are append-only via re-slice in `buildDistanceIndex` / `updateDistanceIndexForTxs` (both under `s.mu.Lock`). If the backing array reallocates after the snapshot, the snapshot still references the prior array (GC-pinned) at the consistent length captured under the lock. Records are value types — no torn writes. ## Test results `cmd/server/distance_lock_contention_test.go` (8 reader goroutines × 20k synthetic distHops × 200 writer Lock/Unlock cycles): - pre-fix avg writer cycle: 82367µs (16.5s for 200 cycles) - post-fix avg writer cycle: 1µs (279µs for 200 cycles) - ~82000× reduction in writer contention; reader result shape unchanged Full `go test ./cmd/server/...` green with `-race`. ## Out of scope (per issue) - Same lock pattern in topology / RF / hash / subpath analytics — file separately if needed. - Per-region cache key sharding. - WebSocket-driven cache invalidation. --------- Co-authored-by: openclaw-bot <bot@openclaw.local>	2026-05-16 20:56:52 +00:00
Kpa-clawbot GitHub bot	2e28aa3e04	fix(#1229 ): source-diversity confidence weighting in neighbor-graph tier-1 resolver (#1235 ) RED `235b65b4` (CI will surface URL after PR open) — `test(#1229): tier-1 must prefer multi-observer edges`. Green: `841fc5de`. ## Summary Implements Option C from issue #1229: edge source-diversity confidence weighting. Each neighbor-graph edge already tracks the set of distinct observers that contributed to it (`NeighborEdge.Observers`). This PR is the first to consume that signal in the disambiguator. Tier-1 score in `pm.resolveWithContext` becomes `Score(now) × Confidence()` where: ``` Confidence() = min(1.0, max(1, \|Observers\|) / 3.0) ``` - 1 observer → 1/3 weight (single-source, suspect) - 2 observers → 2/3 weight - ≥3 observers → 1.0 (saturated, full historical weight) A 6-observer edge (30 obs) now beats a 1-observer edge (25 obs) by 3.6× (vs. 1.2× before) — enough to clear `affinityConfidenceRatio` and skip the tier-2 geo fallback that was misresolving in cross-region cases. Stacks with the geo-rejection filter merged in #1228/#1230 to give two independent defenses against cross-region prefix-collision pollution. ## Why C over A/B - A (per-observer graphs): N×memory cost, biggest refactor surface. - B (per-region/IATA segmented): requires region attribution on every packet + per-region cache plumbing; deferred follow-up. - C: smallest diff (~30 lines), no schema migration, leverages an existing field, composes additively with #1228. A and B remain valid follow-ups if C proves insufficient. ## Backward compatibility (persistence) `neighbor_edges` schema is unchanged. `Observers` is rebuilt by `BuildFromStoreWithOptions` from live observations on every graph refresh (5-min TTL). Persisted rows carry an empty set only during the post-restart warm-up; `Confidence()` defaults n→1 when `\|Observers\|==0`, so legacy rows resolve as single-observer (degraded but non-zero) confidence rather than disappearing. Defensive. ## Tests - `cmd/server/hop_disambig_confidence_test.go:48` — RED-then-GREEN E2E: two `8a` candidates from the same anchor, candX placed geo-near with 1 observer × 25 obs, candY placed geo-far with 6 observers × 5 obs. Without confidence weighting tier-1 falls through (1.2× ratio) and tier-2 picks the wrong (geo-near) candX. With confidence weighting tier-1 fires and picks candY. Asserts `method == "neighbor_affinity"` to pin the resolver path. - `TestNeighborEdge_ObserverSetIsDistinct` — guards the source-diversity counter against double-counting same-observer contributions and pins the `Confidence()` formula at both endpoints (single → fractional, ≥3 → 1.0). All existing tier-1 tests (`hop_disambig_tier1_test.go`) continue to pass — they seed with a single observer, so their weights drop from 1.0 to 1/3 uniformly across candidates, preserving the ratio guard outcome. Fixes #1229 --------- Co-authored-by: bot <bot@corescope.local>	2026-05-16 19:55:00 +00:00
Kpa-clawbot GitHub clawbot	b21badbcbd	fix(#1225 ): paginate channel messages at SQL level — 30s → <500ms (#1226 ) ## Summary Fixes #1225 — channel messages endpoint took ~30s on staging. ## Root cause `(DB).GetChannelMessages` SELECTed every observation row for the channel (one row per observation, not per transmission), JSON-unmarshalled each row into a Go map, dedupe-folded by `(sender, packetHash)`, then sliced the tail in Go for pagination. On staging `#wardriving`: - `transmissions` rows with `channel_hash='#wardriving' AND payload_type=5`: 5,703* - `observations` joined to those: 274,632 (~48× amplification) - `time curl /api/channels/%23wardriving/messages?limit=50`: 30.04s / 31.41s / 31.48s / 35.33s / 34.05s (5 calls before I killed the loop) `EXPLAIN QUERY PLAN` showed the index `idx_tx_channel_hash` was being used — the cost was entirely in fetching, unmarshalling, and folding the full observation set per request even for `limit=50`. Hypothesis #1 from the issue (full table scan on `messages/decoded`) is rejected; #2 (missing index) is rejected; the actual cause was pagination in Go instead of SQL — request cost was O(observations) not O(limit). ## Fix Move pagination into SQL on the `transmissions` table. Because `transmissions.hash` is `UNIQUE` and the original dedup key was `(sender, hash)`, each transmission collapses to exactly one logical message — paginating on transmissions is semantically equivalent to the prior in-Go dedup + tail slice. New shape: 1. `COUNT()` on transmissions for total (uses `idx_tx_channel_hash`). 2. `SELECT id FROM transmissions … ORDER BY first_seen DESC LIMIT ? OFFSET ?` to pick the page of newest transmissions. 3. `SELECT … FROM observations WHERE transmission_id IN (…page ids…)` — typically 50 ids → a few hundred observation rows. 4. Reassemble in pageIDs order, preserving the ASC-by-`first_seen` API contract. Region filtering, observation-count-as-`repeats`, and "first observation wins for hops/snr/observer" semantics are preserved (observations are scanned `ORDER BY o.id ASC`). ## Perf measurements Before* (staging `#wardriving`, limit=50, 5 samples killed mid-loop): 30.04s, 31.41s, 31.48s, 35.33s, 34.05s. Synthetic regression test (`TestGetChannelMessagesPerfLargeChannel`): 3000 tx × 50 obs. - Broken impl: ~4.5s (test fails the 500ms budget — the RED commit). - Fixed impl: well under 500ms (test passes). After (staging): will measure post-deploy and post-comment on issue with numbers. Synthetic scaling: staging is ~2× the test's transmission count, fixed-path cost scales with `limit` (50) + `COUNT()` (~5k rows on index) — expect <100ms p99. ## TDD - RED: `697c290d` — perf test asserts <500ms on 3k×50 dataset; fails at ~4.5s. - GREEN: `3f1f82d3` — fix; full suite green, perf test passes. ## Hypotheses status \| # \| Hypothesis \| Verdict \| \|---\|---\|---\| \| 1 \| Endpoint slow on prod-sized data \| CONFIRMED* (different mechanism — see root cause) \| \| 2 \| Missing channel_hash index \| Rejected (`idx_tx_channel_hash` exists & used) \| \| 3 \| Frontend re-render storm \| Not investigated (backend was clearly the bottleneck) \| \| 4 \| Decode in request path \| Rejected (decode is at ingest time; JSON unmarshal of cached `decoded_json` is the cost, addressed by reducing row count) \| \| 5 \| WS subscription failure \| Rejected \| \| 6 \| Staging artifact \| Rejected (reproducible) \| ## Out of scope - The in-memory `(*PacketStore).GetChannelMessages` path (used when `s.db == nil`) has the same shape but operates on bounded in-memory data; not touched. If we ever fall back to it in production we'll revisit. --------- Co-authored-by: clawbot <bot@corescope>	2026-05-16 17:28:40 +00:00
Kpa-clawbot GitHub corescope-bot	7179afcfde	feat(#1228 ): reject geo-implausible neighbor-graph edges at build time (#1230 ) Fixes #1228 — geo-implausible neighbor-graph edges are rejected at build time. Red commit: `5a6d9660` — failing tests for 4 cases (reject SF↔Berlin, accept local CA, accept no-GPS endpoint, counter increments). Live CI run (latest commit): https://github.com/Kpa-clawbot/CoreScope/actions?query=branch%3Afix%2Fissue-1228 ## Why The disambiguator's tier-1 affinity graph is built blindly from path co-occurrence. On wide-geo MQTT deployments, a single bad hop disambiguation seeds an edge across geographically impossible distances (e.g. Bay Area ↔ Berlin), which then reinforces the same wrong resolution next time. Self-poisoning spiral. ## What changed - `upsertEdge` now consults a per-graph GPS index. When both endpoints have known GPS and their haversine distance exceeds the threshold, the edge is dropped and `NeighborGraph.RejectedEdgesGeoFar` (atomic) is incremented. - Either endpoint missing GPS ⇒ accept (no signal to reject), per acceptance criteria. - Threshold is configurable via `neighborGraph.maxEdgeKm` (default 500 km — well above any plausible terrestrial LoRa hop, including satellite-assisted). 0 ⇒ use default; negative ⇒ disable the filter. Exposed via `Config.NeighborMaxEdgeKm()`. - New `BuildFromStoreWithOptions` carrying the threshold; `BuildFromStore` and `BuildFromStoreWithLog` are kept as thin wrappers. - Stats are surfaced under `GET /api/analytics/neighbor-graph` as `stats.rejected_edges_geo_far`. - All rejection logs PII-truncate pubkeys to 8 hex chars (public repo discipline). - `config.example.json` updated with the new field + comment. ## Follow-up #1229 (per-region scoped affinity graphs) depends on this landing first. --------- Co-authored-by: corescope-bot <bot@corescope.local>	2026-05-16 10:14:44 -07:00
Kpa-clawbot GitHub bot OpenClaw Bot corescope-bot openclaw-bot	170f0ac66d	fix(#1212 ): MQTT per-attempt logging + stall watchdog — prevent silent reconnect-loop death (#1216 ) RED commit: `1cd25f7b` — CI (failing on assertion): https://github.com/Kpa-clawbot/CoreScope/actions?query=sha%3A1cd25f7b1bdd0091f689dd64ce1bfec6d031191f Fixes #1212 ## Root cause NOT that `AutoReconnect` was off — it was set; `MaxReconnectInterval=30s` was set (PR #949); a `SetReconnectingHandler` was wired. The defect was an observability gap: `SetReconnectingHandler` fires only INSIDE paho's reconnect goroutine. If that goroutine never iterates (status race after the recovered handler panic at 21:07:13, or an internal abort), operators see ONLY the `disconnected: pingresp not received` line and then total silence. They cannot distinguish "paho is patiently retrying" from "paho gave up and the goroutine is gone." That ambiguity is what turned a 30s blip into 6h of downtime. ## Changes ### `cmd/ingestor/main.go` — `SetConnectionAttemptHandler` Fires on every TCP/TLS dial — the initial `Connect()` AND every reconnect — independent of paho's internal reconnect-loop state. Logs: ``` MQTT [staging] connection attempt #1 to tcp://broker:1883 MQTT [staging] connection attempt #2 to tcp://broker:1883 ``` Per-source attempt counter via `atomic.AddInt64`. ### `cmd/ingestor/mqtt_watchdog.go` (new) — per-source stall watchdog Satisfies the watchdog acceptance criterion. Even when paho reports `connected`, if no MQTT messages have flowed for >5m, log a WARN line every 60s: ``` MQTT [staging] WATCHDOG: client reports connected to tcp://broker:1883 but no messages received for 7m30s (threshold 5m) — possible half-open socket or upstream stall ``` Catches half-open TCP and broker-accepted-but-not-forwarding scenarios that look "connected" to paho. Hot-path cost: one `atomic.StoreInt64` per inbound message. Watchdog scans the registry once a minute. ### Tests (`cmd/ingestor/mqtt_reconnect_test.go`, new) - `TestBuildMQTTOpts_InstrumentsConnectionAttempt` — asserts `OnConnectAttempt` is wired in `buildMQTTOpts`. - `TestMQTTStallWatchdog_FiresOnSilentSource` — connected + 10m silent + 5m threshold → stall flagged. - `TestMQTTStallWatchdog_QuietWhenRecent` — recent message → no stall. - `TestMQTTStallWatchdog_QuietWhenDisconnected` — disconnected → no stall (paho's reconnect logging covers it). ## TDD - RED `1cd25f7b` — 2 assertion failures (compile OK, stub returns no-stall, `OnConnectAttempt` nil). - GREEN `2527be6f` — implementation; all ingestor tests pass. ## Out of scope - Slice-bounds decode panic (#1211, separate PR). - A full in-process MQTT broker integration test would require a new dep (mochi-mqtt) — the observability and watchdog behaviors are independently verifiable by the unit tests above, and the reconnect path itself is paho's responsibility (we already test it's configured via `mqtt_opts_test.go`). --------- Co-authored-by: bot <bot@example.com> Co-authored-by: OpenClaw Bot <bot@openclaw.local> Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: openclaw-bot <openclaw-bot@users.noreply.github.com>	2026-05-15 22:46:29 -07:00
Kpa-clawbot GitHub corescope-bot corescope-bot	eba9e89a72	fix(#1203 ): path-inspector — singleflight + stale-while-revalidate (#1208 ) Red commit: `c84a8f575a` (CI run: pending push) Fixes #1203 — path-inspector 503 storm. Three sub-fixes, each shipped as red→green per AGENTS TDD: A. Singleflight on rebuild (`ensureNeighborGraph`) Hand-rolled `sync.Mutex + chan` singleflight — no new deps (x/sync was not in cmd/server's go.mod). Concurrent callers attach to one in-flight rebuild instead of N parallel `BuildFromStore` goroutines. - Red: `7340f23b` — test asserts ≤1 build under 10 concurrent callers (saw 10 on master) - Green: `abac6b3c` B. Stale-while-revalidate (`handlePathInspect`) Stale non-nil graph is served immediately with `"stale": true` while a background rebuild runs (deduped by A). The 2s synchronous gate is gone. Stale responses are not cached, so the next request after rebuild lands fresh. - Red: `c84a8f57` — test asserts 200+`stale:true`+rebuild-kickoff (master returned 503) - Green: `5eb86975` C. Cold-start 503 still kicks rebuild True cold start (`graph == nil`) is the only path that still returns 503 `{"retry": true}`, but it now spawns an async `ensureNeighborGraph` so the very next request warms up. - Green test: `f5ac7059` (passed on top of A+B) Singleflight verified: `TestEnsureNeighborGraph_Singleflight` Stale-while-revalidate verified: `TestHandlePathInspect_StaleWhileRevalidate` Cold-start verified: `TestHandlePathInspect_ColdStartKicksRebuild` Acceptance criteria (issue #1203): - [x] Concurrent requests share ONE rebuild - [x] Stale non-nil graph served with `stale:true` async - [x] 503 only on true cold-start - [x] Cold-start 503 kicks rebuild → follow-up warm - [ ] p99 < 500ms under load (not unit-testable; design satisfies it) - [x] No regression in existing tests Out of scope (per issue): 5-min TTL constant, `BuildFromStore` perf, `/api/analytics/topology`, persist-lock contention. No new deps. --------- Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: corescope-bot <bot@corescope.dev>	2026-05-15 22:46:28 -07:00

1 2 3 4 5 ...