mirror of
https://github.com/Kpa-clawbot/meshcore-analyzer.git
synced 2026-06-04 11:01:23 +00:00
c0ca47e9be64aba76a87a8e4aa25c1810c848a10
273 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
13bdee57d4 |
perf: P0 hot-path fixes (observers, neighbor-graph, observer-analytics) (#1481) (#1483)
## What Three of the four P0s from #1481's scale-test findings. Each cuts a distinct hot path; together they target /api/observers, /api/analytics/neighbor-graph, and /api/observers/{id}/analytics — the top three live offenders. ### P0-1: 5-min atomic-pointer cache for default neighbor-graph response - Live p95 10.8s on the most-trafficked organic endpoint. - Background recomputer (5-min cadence per operator directive) builds the default-filter (`minCount=5 minScore=0.1`, no region, no role) `NeighborGraphResponse` and stores it via `atomic.Pointer`. - `handleNeighborGraph` short-circuits on the default shape; non-default filters take the extracted `computeNeighborGraphResponse` path (identical semantics to the previous inline build). ### P0-2: cache parsed `StoreObs.Timestamp` + drop RLock window - `handleObserverAnalytics` re-parsed the RFC3339 timestamp three times per observation, for 60k+ observations per active observer, under `s.store.mu.RLock` — blocking writers for the full scan. - `StoreObs.ParsedTime()` parses once via `sync.Once` (mirrors `StoreTx.ParsedDecoded`). - Handler snapshots the `byObserver[id]` pointer slice, releases the RLock immediately, then iterates locally. ### P0-3: 30s cache for `/api/observers` + sargable `IN` + covering index - Three SQL queries on every request → ~1.7s p50 at 50-concurrent. - Atomic-pointer 30s cache for the default (no-filter) query. - `GetNodeLocationsByKeys` drops `LOWER(public_key) IN (...)` (non-sargable); callers pre-lowercase in Go and the plain `IN` matches the existing `public_key` index. - New ingestor migration `obs_observer_ts_idx_v1` adds composite index `idx_observations_observer_idx_timestamp(observer_idx, timestamp)` so `GetObserverPacketCounts` can resolve its GROUP-BY + range filter from the index without scanning the 1.9M-row observations table. ### P0-4: deferred `perfMiddleware`'s global mutex was claimed to serialize every API request. A direct test (`50 concurrent requests through the middleware, handler sleeps 20ms each`) shows total elapsed ≈ 25ms, not 1s — the lock is held only for the post-handler bookkeeping (a few µs). Real impact is below measurement noise. Skipping to avoid invasive churn on PerfStats consumers without a demonstrable win. ## Test plan Red → green per P0: - `observers_cache_test.go` — handler reads `s.observersCache` before SQL, TTL boundary, atomic.Pointer (no mutex contention). - `storeobs_parsedtime_test.go` — parses three timestamp shapes, caches result, no race under concurrent readers. - `neighbor_graph_cache_test.go` — handler serves from atomic pointer when set, bypasses cache when `?region=` (or any non-default filter) is passed. Full server + ingestor suites pass: `go test -count=1 ./...`. ## Perf proof Before/after p50/p95/p99 (50 requests × 50 concurrent) against prod (before) and staging once CI deploys (after) will be posted as a PR comment per the operator's "no merge without proof of improvement" gate. Closes #1481 ## TDD exemption — P0-1 and P0-2 (net-new surfaces, AGENTS.md) Per CoreScope `AGENTS.md` § "Exemptions": **net-new code surfaces with no prior tests to break** may land tests in the same PR without a strict test-first → impl commit split. - **P0-1 (neighbor-graph atomic-pointer cache)** — `neighborGraphCache`, `recomputeNeighborGraphCache`, `loadNeighborGraphCacheBytes`, `startNeighborGraphRecomputer` and the default-shape short-circuit in `handleNeighborGraph` were brand-new code with no pre-existing assertions covering them. There was no green test to first turn red. - **P0-2 (cached `StoreObs.Timestamp` + RLock window drop)** — `StoreObs.ParsedTime()` and the snapshot+release pattern in `handleObserverAnalytics` were new surfaces; the prior code did the parse inline per call with no behavioural test to break. P0-3 was authored properly red-then-green (commit `6e63ec6a` red, then `83ae129b` green) and does NOT use this exemption. ## Default-filter detection vs frontend reality (#1483 follow-up) The Neighbor Graph analytics tab in `public/analytics.js` fetches `/analytics/neighbor-graph?min_count=1&min_score=0` because the client-side sliders need the full edge set to filter from. That shape did NOT match the `(5, 0.1)` cached default, so the UI tab still paid the cold compute cost despite #1481 P0-1. The #1483 follow-up commit caches BOTH shapes in the same recomputer pass: - `(minCount=5, minScore=0.1, no region, no role)` — `live.js` affinity-scoring consumer. - `(minCount=1, minScore=0, no region, no role)` — analytics tab. Both are served from `atomic.Pointer` with an `X-Cache-Age-Seconds` header. The per-shape cost in the background goroutine is roughly linear in edge count; total recompute time stays well under the 5-minute cadence on prod-scale graphs. --------- Co-authored-by: openclaw-bot <bot@openclaw.dev> Co-authored-by: mc-bot <mc-bot@users.noreply.github.com> |
||
|
|
43b93c6bb9 |
feat(observers): surface naive-clock observers as ⚠️ chip + detail banner (#1478) (#1480)
## Summary Issue #1478 — surface observers whose envelope timestamps are being clamped because they're emitting zone-less local-time strings (UTC-N observers showed up perpetually as "Stale" before #1466, and per-packet rxTime is still clamped to ingest time for them, muddying propagation-delay analytics). Now the UI tells operators which observers are misconfigured + how to fix it. ## What changed ### Ingestor (cmd/ingestor) - New `observers_clock_naive_v1` migration adds three columns to `observers`: - `clock_skew_seconds INTEGER` (signed: negative = behind UTC, positive = ahead) - `clock_skew_count_24h INTEGER` (rolling 24h event count) - `clock_last_naive_at TEXT` (RFC3339 timestamp of last clamp) - `resolveRxTime` now returns `(rxTime, naiveSkewSec)`. The packet-handler call site invokes `store.RecordNaiveSkew(observerID, deltaSec)` whenever a naive envelope is clamped (the existing >15 min naive-tolerance path). The counter resets to 1 if no event in the prior 24h, else increments. Single INSERT-or-UPDATE round trip per clamp. ### Server (cmd/server) - `Observer` struct + `GetObservers` / `GetObserverByID` extended to scan the three new columns. - `ObserverResp` gains four JSON fields exposed by `/api/observers` and `/api/observers/{id}`: - `clock_naive` (bool, derived from `clock_last_naive_at` being within 24h) - `clock_skew_seconds`, `clock_skew_count_24h`, `clock_last_naive_at` - Decay is **read-side**: a stale event yields `clock_naive=false` with zero counts. No background sweep, no writes from the read-only server, no race with the ingestor. ### Frontend (public) - `window.ObserversNaiveChip.render(o)` — total render helper, returns ⚠️ chip HTML when `o.clock_naive===true`, `""` otherwise. Used inline in the observers-list `name` cell and in the row-detail slide-over. Tooltip explains magnitude + direction + count + fix. - `window.ObserverDetailNaiveBanner.render(obs)` — yellow alert banner at the top of the observer-detail page with the skew magnitude, last-event timestamp, and the actionable fix ("Set host clock to UTC, OR emit Z-suffixed/offset-aware timestamps from the observer script"). ## TDD trail - `5ddd5b42` red: backend `cmd/server/observer_naive_clock_1478_test.go` (3 tests asserting JSON fields + 24h decay) + frontend `test-observer-naive-clock-1478.js` (8 jsdom-style tests asserting helpers exist and render correctly). Both failed on master with field-missing / export-missing assertions. - `4ecc79c8` green backend: schema + Observer / GetObservers / ObserverResp / handler decay. - `2137ab81` green frontend: chip + banner helpers and call sites. ## Tests - `cd cmd/server && go test ./...` → all green (full suite, 46s) - `cd cmd/ingestor && go test ./...` → all green (full suite, 98s) - `node test-observer-naive-clock-1478.js` → 8/8 pass - `node test-frontend-helpers.js` → unchanged from master (pre-existing failures only) ## Acceptance (issue #1478) - ✅ Observer running with `python datetime.now().isoformat()` (naive, off by N hours) → `clock_naive=true` after the next clamp → UI shows ⚠️ chip + banner. - ✅ Observer with `datetime.now(timezone.utc).isoformat()` (Z-suffixed) → never clamped → never flagged. - ✅ Observer that fixed its clock → `clock_naive` returns to `false` 24h after the last clamp event (read-side decay). Closes #1478. --------- Co-authored-by: openclaw <bot@openclaw.local> |
||
|
|
462cb2cb5a |
chore: update MeshCore URLs to use new site (#1445)
# Summary The main MeshCore website is https://meshcore.io. Reasons for the new website are listed here: https://blog.meshcore.io/2026/04/23/the-split # Changes Any occurrence of `meshcore.co.uk` was replaced with `meshcore.io`. No logic was changed, only updated strings. Co-authored-by: hrtndev <hrtndev@users.noreply.github.com> |
||
|
|
7c40e24a35 |
feat(server): warn at startup when GOMEMLIMIT < 50% of container memory limit (#1264) (#1429)
## Summary - Adds `readCgroupMemoryMB()` to detect container memory ceiling from cgroup v2 (`/sys/fs/cgroup/memory.max`) and v1 (`/sys/fs/cgroup/memory.limit_in_bytes`) - Adds `warnIfMemlimitUnderprovisioned()` called once from `main()` after the existing memlimit block — logs a `[memlimit] WARN` at startup if the effective GOMEMLIMIT is below 50% of the container limit - Works whether the limit was set via `GOMEMLIMIT` env var or derived from `packetStore.maxMemoryMB` - Adds `readCgroupMemoryMBFn` package-level hook for test injection (same pattern as `readProcSelfIOFn` in the ingestor) Fixes #1264. In the reported incident, GOMEMLIMIT was 1536 MiB on a 7.7 GB container; GC consumed 82% of CPU and all endpoints were 3–100× slower. This warning fires at startup so operators catch the misconfiguration before it causes an incident. ## Test plan - [ ] `TestWarnIfMemlimitUnderprovisioned_EmitsWarning` — warning fires when effective < 50% of cgroup - [ ] `TestWarnIfMemlimitUnderprovisioned_NoWarnWhenAdequate` — no warning at boundary (effective = 1024 MiB, cgroup = 1536 MiB) - [ ] `TestWarnIfMemlimitUnderprovisioned_NoCgroupNoLog` — silent on non-container hosts - [ ] `TestWarnIfMemlimitUnderprovisioned_NoneSource` — no warning when `source="none"` (no limit configured, runtime returns math.MaxInt64) - [ ] `TestMemlimitUnderprovisioned` — boundary table for the comparison helper - [ ] All existing `TestApplyMemoryLimit_*` still pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
ad45a774d7 |
test(paths): regression test for #1144 — hop name mis-resolution on prefix collision (#1433)
## Summary - Adds `TestHandleNodePaths_HopName_CanonicalPathShowsTarget_1144` as a regression test for issue #1144 - When two nodes share a short pubkey prefix (e.g. `"37"`), the biased hop resolver (`resolveWithContext`) could pick a GPS-having sibling over the actual target node, producing the wrong name in hop display - The bug was already fixed during the #1352 canonical-path work: the canonical-path branch (Option A) uses `lookupNode(resolvedPK)` with the full pubkey from `resolved_path`, bypassing the biased resolver entirely - This PR documents and locks in the correct behaviour with a targeted test ## Test setup - `targetPK` (`37cf...`): no GPS - `siblingPK` (`37bb...`): has GPS — the biased resolver's tier-3 picks this without the fix - One TX with `resolved_path = [targetPK]` → Option A fires → `lookupNode(targetPK)` → hop shows `"CJS SF Mission"`, not `"Templeton Hills"` If Option A were removed (bug re-introduced), `resolveWithContext("37", ...)` on the two candidates would return the GPS-having sibling, triggering the test failure. ## Test plan - [x] `go test -run TestHandleNodePaths_HopName -v` passes - [x] Full `go test ./...` passes - [x] Code review addressed (collapsed redundant error checks) Closes #1144 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
981664528e |
perf(server): serve stale repeater enrich cache instead of inline rebuild (#1272) (#1436)
## Summary - Removes the TTL-based inline rebuild from `GetRepeaterRelayInfoMap` and `GetRepeaterUsefulnessScoreMap` - When the cache is non-nil it is returned immediately, regardless of age — no more 700ms on-request recompute - Inline compute is retained only as a nil-cache guard (edge case: tests without a running recomputer) - Fixes the stale `// 15s-TTL gate` comment in `recomputeRepeaterEnrichmentSafe` **Root cause:** `computeRepeaterRelayInfoMap` runs inline when the TTL expires, taking ~700ms on a busy instance. `StartRepeaterEnrichmentRecomputer` (introduced in #1262) already keeps the cache warm via synchronous prewarm at startup + 5-min ticks, making the inline path dead code that fires only when the TTL is shorter than the recomputer interval (e.g. custom `analytics.defaultIntervalSeconds > 600`). ## Test plan - [ ] `TestGetRepeaterRelayInfoMap_ServesStaleOnTTLExpiry` — regression guard: stale sentinel is returned without recompute - [ ] `TestGetRepeaterUsefulnessScoreMap_ServesStaleOnTTLExpiry` — same for usefulness score map - [ ] `TestGetRepeaterRelayInfoMap_BuildsWhenNil` — nil-cache fallback still works - [ ] Full `-short` suite passes (`go test -short ./...`) Closes #1272 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
b3e55ae8d5 |
fix(nodes): sort paths-through-node by recency, count as tiebreaker (#1145) (#1431)
## Summary
- `/api/nodes/{pk}/paths` returned paths in non-deterministic map
iteration order; with many paths the UI showed a random ordering on each
page load
- Now sorted by `LastSeen` descending (newest-first), with `Count` as a
tiebreaker (higher first)
- Nil `LastSeen` sorts last (treated as oldest)
- `LastSeen` is an RFC 3339 string so lexicographic comparison is
correct
Closes #1145.
## Test plan
- [ ] `TestHandleNodePaths_SortByRecency_1145` — 3 distinct paths (via
relay1, relay2, direct), verifies newest appears first
- [ ] `TestHandleNodePaths_SortCountTiebreaker_1145` — two paths with
identical `LastSeen`, verifies higher-count path wins the tiebreak
- [ ] All existing `TestHandleNodePaths_*` tests still pass
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|
|
d24246395d |
fix(#1456): rename Usefulness → Traffic share + add traffic_share_score field (#1457)
## Summary Rename the "Usefulness" UI label to "Traffic share", add hover tooltips for both Traffic share and Bridge score, and introduce a new `traffic_share_score` field on `/api/nodes` (alongside the legacy `usefulness_score`, kept for API back-compat). Closes #1456. ## Why The "Usefulness" label implied a composite score that doesn't exist yet — only the Traffic-share axis (axis 1 of 4 from #672) and the Bridge axis (axis 2 of 4 from #1275) are wired today. A node with low traffic but critical structural position read as "not useful" — exactly wrong. Neither score had a tooltip explaining what it measured. ## Changes ### Frontend (`public/nodes.js`) - Visible label `Usefulness` → `Traffic share` (with ⓘ glyph) - Tooltip explains traffic-share semantics, cross-references Bridge for structural importance, points at #672 for the 4-axis roadmap - Bridge row gets a parallel ⓘ glyph and a tooltip naming "betweenness centrality" + the "quiet but irreplaceable chokepoint" interpretation - Prefers new `traffic_share_score` with graceful fallback to legacy `usefulness_score` ### Backend (`cmd/server/routes.go`) - `/api/nodes` and `/api/nodes/{pubkey}` now emit BOTH `usefulness_score` (kept for API compat) AND `traffic_share_score` (new canonical name), populated with the same value - Inline comment documents the deprecation path: when the #672 composite ships, `usefulness_score` becomes the composite and `traffic_share_score` keeps the per-axis value ## Tests - `test-issue-1456-score-labels.js` — file-grep pins on `nodes.js` (label, tooltip fragments, percent formatting, dual-field read with fallback) - `cmd/server/traffic_share_score_test.go` — `/api/nodes` + `/api/nodes/{pk}` responses contain both fields with equal values TDD: red commit (`8bd235a0`) added failing tests; green commit (`c4d3aee5`) implemented. `go test ./cmd/server/...` passes (47s). ## Out of scope - Renaming the backend field (would break consumers) - Wiring axes 3 (Coverage) and 4 (Redundancy) — tracked in #672 - Changing the score calculation --------- Co-authored-by: clawbot <bot@openclaw.local> |
||
|
|
777f77a451 |
feat(#1420): dark-tile provider picker in customizer (4 variants) (#1430)
# feat(#1420): dark-tile provider picker in customizer (4 variants) Closes #1420. ## What Operator pick: don't force a single dark-tile choice on everyone. Wire 4 candidates into the customizer + server config so users can choose which dark basemap they want, with per-browser persistence. ## Providers shipped | ID | Source | Filter | |---|---|---| | `carto-dark` (default) | `https://{s}.basemaps.cartocdn.com/dark_all/{z}/{x}/{y}{r}.png` | none | | `esri-darkgray-labels` | Esri Dark Gray Base + Reference (two stacked layers) | none | | `voyager-inverted` | Carto Voyager + CSS `invert(1) hue-rotate(180deg) brightness(0.9) contrast(1.05)` on `.leaflet-tile-pane` | applied in dark, cleared in light | | `positron-inverted` | Carto Positron + same CSS invert | applied in dark, cleared in light | No new dependencies — all providers are URL-only. ## Architecture - **`public/map-tile-providers.js`** — registry + 5 public helpers (`MC_TILE_PROVIDERS`, `MC_setDarkTileProvider`, `MC_getDarkTileProvider`, `MC_setServerDefaultTileProvider`, `MC_applyTileFilter`). Persists to `localStorage['mc-dark-tile-provider']`. Dispatches `mc-tile-provider-changed` on user pick. - **`public/map.js` / `public/live.js`** — resolve the active dark provider via the registry, manage the Esri labels overlay lifecycle (add when needed, remove cleanly so we don't leak layers on repeated theme toggles), and apply/clear the CSS filter on `.leaflet-tile-pane`. Listen for both `data-theme` mutations AND `mc-tile-provider-changed`. - **`public/customize-v2.js`** — new "Dark Map Tiles" dropdown in the Display tab. On change, calls `MC_setDarkTileProvider(id)`; the maps re-render live without reload. - **`public/roles.js`** — hydrates the server default via `MC_setServerDefaultTileProvider` from `/api/config/client`. - **Server (`cmd/server/`)** — new `mapDarkTileProvider` string on `Config` + surfaced in `ClientConfigResponse`. Default empty → client uses `carto-dark`. - **`config.example.json`** — documents the new field with all allowed values. ## Behavior guarantees (from the acceptance criteria) - ✅ Light mode is **completely unchanged** — `_resolveTileUrl(false)` short-circuits to `TILE_LIGHT` with no filter and no overlay logic. - ✅ Switching dark→light always clears the CSS filter, even if an inverted provider remains selected (`MC_applyTileFilter` is called on every theme change and early-returns to `style.filter = ''` when not dark). - ✅ Switching light→dark with an inverted provider re-applies the filter. - ✅ Attribution is updated per provider (Esri credit for Esri, CartoDB credit for the others); the Leaflet attribution control is refreshed. - ✅ Esri uses two stacked layers (base + reference labels). The reference layer is added/removed cleanly so repeat toggles do not leak. - ✅ Customizer change → immediate re-render, no reload. Uses the same "live setting + persist + dispatch event" pattern as cb-presets (#1361). ## TDD - Red commit: `148b71c3` — `test(#1420): add failing tests for dark-tile provider registry (red)` — 6/7 assertions fail (stub only returns nulls). - Green commit: `49ffb230` — `feat(#1420): dark-tile provider picker — 4 variants wired into customizer` — 7/7 pass. ## Tests `test-issue-1420-tile-providers.js` (wired into `test-all.sh` and `.github/workflows/deploy.yml` JS-unit step): ``` ── #1420 Dark-tile provider registry ── ✅ MC_TILE_PROVIDERS has all 4 IDs with url + attribution ✅ Inverted providers have non-null invertFilter; non-inverted have null ✅ MC_setDarkTileProvider persists to localStorage and dispatches mc-tile-provider-changed ✅ MC_setDarkTileProvider rejects unknown IDs (no persistence, no dispatch) ✅ MC_getDarkTileProvider falls back to server default, then carto-dark ✅ Apply filter for inverted provider in dark mode; clear when switching to non-inverted ✅ Light mode always clears the CSS filter even if inverted provider is selected 7 passed, 0 failed ``` `cd cmd/server && go build ./... && go vet ./...` — clean. ## CDP verification Not run in this PR — the sandbox does not have a Chrome CDP endpoint reachable, and staging cannot exercise this code path until this branch is deployed. The issue body's "CDP-verified candidate set" table covers prior provider-URL validation; the new code path (registry lookup + filter swap + Esri overlay lifecycle) is covered by the unit tests above. **Recommend operator run a quick manual verification on staging post-deploy:** dark mode → open customizer → cycle through all 4 providers, confirm tiles render and the CSS filter is applied for `voyager-inverted` / `positron-inverted` (verify via `getComputedStyle(document.querySelector('.leaflet-tile-pane')).filter`). ## Files touched - `public/map-tile-providers.js` (new) - `public/map.js`, `public/live.js`, `public/customize-v2.js`, `public/roles.js`, `public/index.html` - `cmd/server/config.go`, `cmd/server/routes.go`, `cmd/server/types.go` - `config.example.json` - `test-issue-1420-tile-providers.js` (new), `test-all.sh`, `.github/workflows/deploy.yml` - `.eslintrc.json` (register new `MC_*` globals) --------- Co-authored-by: openclaw <bot@openclaw.local> |
||
|
|
f0c69d5fe7 |
perf(server): fix repeaterEnrichTTL mismatch causing 18s /api/nodes latency (#1425)
## Root cause `repeaterEnrichTTL` was **15 seconds**, but the background recomputer (`StartRepeaterEnrichmentRecomputer`) runs every **5 minutes**. After each recomputer tick, the relay/usefulness caches were valid for 15 seconds. For the remaining 4m45s, every `/api/nodes` request hit a stale TTL gate in `GetRepeaterRelayInfoMap` / `GetRepeaterUsefulnessScoreMap` and fell through to `computeRepeaterRelayInfoMap` **on the request goroutine**. On production (16k+ transmissions, 240k hop records) that rebuild takes ~18 seconds, making `/api/nodes?limit=5000` freeze on virtually every page load. The pattern was: ``` recomputer runs at T=0 → cache valid T=15s → TTL expires T=15s … T=5min → every request rebuilds on-thread (18s each) T=5min → recomputer runs again → 15s valid window repeat ``` ## Fix One line in `repeater_enrich_bulk.go`: ```go // Before const repeaterEnrichTTL = 15 * time.Second // After const repeaterEnrichTTL = 10 * time.Minute ``` The TTL now exceeds the recomputer interval so the cache is always warm between background ticks. The TTL remains as a safety net for cases where the recomputer isn't running (tests, early startup edge cases) — it just no longer expires between ticks. ## Production results (analyzer.on8ar.eu) Tested with binary injection on the live server before opening this PR. | Metric | Before | After | |--------|--------|-------| | TTFB (`/api/nodes?limit=5000`) | 18.6 s | 0.47–0.54 s | | Total response time | 18.9 s | 1.55–1.73 s | | Improvement | — | **34–39×** | Confirmed still fast at t+60s (well past the old 15s window). ## Test results ``` TestHandleNodesPerfLargeFleet elapsed=1.9ms budget=2s PASS TestHandleNodesLimit2000ColdMiss elapsed=5.3ms budget=2s PASS ``` Both existing perf regression tests pass unchanged — the TTL change doesn't affect their behavior (they test the cold-prewarm path, not TTL expiry). ## Why this wasn't caught by tests `TestHandleNodesLimit2000ColdMiss` only tests the cold-startup path (cache nil → on-thread build → cache hit). It doesn't test the TTL-expiry path (cache exists but stale → on-thread rebuild). A test covering the latter would need to fast-forward time past the TTL, which the existing fixture doesn't do. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
f15d2efe81 |
fix(#1386): #1324 follow-up — test coverage + RWMutex + lock-hold-time + dead code + cadence (#1390)
# #1324 follow-up — test coverage + RWMutex + lock-hold-time + dead code + cadence Addresses the post-merge audit findings in #1386 on PR #1324 (multi-byte capability persistence). Two independent audits (Kent Beck test-quality + Carmack perf) surfaced one top-level test-coverage gap and three perf concerns. This PR closes all of them; cadence cleanup is included. Red commit: `<RED_SHA>` (CI: `<RED_URL>`) ## What 1. **Tests** (`cmd/ingestor/multibyte_persist_test.go`): - `TestRunMultibyteCapPersist_RoundTrip` — end-to-end persist → close store → reopen → assert DB state survived. - `TestRunMultibyteCapPersist_MalformedSnapshot` — corrupt snapshot must log + no-op, not crash. - `TestRunMultibyteCapPersist_MissingSchemaColumns` — legacy DB without `multibyte_sup` cols must skip with explicit log, not panic / silently swallow. - `TestRunMultibyteCapPersist_PreservesConfirmedOnUnknown` — status=`unknown` MUST NOT clobber an existing `confirmed` row (mutation guard for the data-destruction check). 2. **`cmd/server/store.go`** - `cacheMu sync.Mutex` → `sync.RWMutex`. The per-node `GetMultibyteCapFor` read path in `/api/nodes` (`routes.go:1215`) uses `RLock` now; no longer serializes against itself or against analytics readers. - Build the multi-byte index map OUTSIDE `cacheMu`, then swap the pointer inside. Removes a 2400-iteration allocation hold from the analytics-cycle critical section. - Drop the dead `GetMultiByteCapMap` (zero callers confirmed by `rg`) and the stale `multibyteStatusToInt` tombstone comment. 3. **`cmd/ingestor/multibyte_persist.go`** - Replace the per-entry pair of `UPDATE nodes` + `UPDATE inactive_nodes` (50% guaranteed-miss) with a single dispatch-by-table-membership `UPDATE` per entry. ~50% fewer prepared-stmt round-trips. - Explicit `MalformedSnapshot` log line distinct from cold-start. - Defensive schema-presence check via `PRAGMA table_info` once at start; logs `[multibyte-persist] schema missing` and returns clean stats on legacy DBs. 4. **`cmd/server/analytics_recomputer.go` / `config.example.json`** — bump default snapshot cadence from 15s to 1m (the snapshot is a derived cache the ingestor only reads every 5 min; 4× less disk churn, no observable freshness loss). ## Why Direct quotes from the audit (#1386): > *"No end-to-end persist→restart→load round-trip — the documented > value prop of the PR ('survives restart') has no single test > exercising the full path."* (Kent Beck) > *"`cacheMu` is `sync.Mutex` not `sync.RWMutex` + per-node read in > `handleNodes` — 2400 serialized lock acquisitions per `/api/nodes` > call, contended against every analytics-cache reader/writer. > The O(1) win is consumed by lock contention."* (Carmack #1) > *"Map construction held under shared `cacheMu` — every 15s > analytics cycle blocks every API cache read for the duration of a > 2400-entry map build. Build outside the lock, swap pointer > inside."* (Carmack #2) > *"`UPDATE nodes` + `UPDATE inactive_nodes` per entry … 4800 > prepared-stmt round-trips, 2400 guaranteed-empty."* (Carmack #3) > *"Server writes 20 snapshots for every one the ingestor reads. > Cadence mismatch — server could publish every 1 min and lose > nothing."* (Carmack §2) ## TDD Red commit adds the four tests above. Two of the four (`MalformedSnapshot`, `MissingSchemaColumns`) fail on assertions against the pre-fix `multibyte_persist.go`; the other two (`RoundTrip`, `PreservesConfirmedOnUnknown`) are regression coverage of behaviour the original implementation already honoured but never exercised — they exist to guard future mutation (the audit's mutation-suggestion lens). Green commit lands the implementation. ## Bench `go test -bench BenchmarkGetMultibyteCapFor -benchmem -count=10` (local, idle laptop, n=2400-entry index, 8 reader goroutines vs. one analytics writer): | variant | ns/op | allocs/op | |--------------------|------:|----------:| | `sync.Mutex` (pre) | n/a — see note | — | | `sync.RWMutex` | n/a — see note | — | Note: did not produce a concurrent benchmark in this PR (would require non-trivial test scaffolding around the cache lifecycle). The win is structural — `RLock` allows the ~2400 per-`/api/nodes` reads to proceed in parallel rather than serializing on the same mutex held by every analytics writer. Documenting honestly per AGENTS.md "perf claims require proof": full microbench deferred to a follow-up. ## Manual verification (staging) - New tests: `go test ./... -count=1 -timeout 300s` in `cmd/ingestor` and `cmd/server` — green. - All multibyte-area tests (`#1366`, `#1368`, `#1372` regression suites in `multibyte_capability_test.go`, `multibyte_enrich_test.go`, `multibyte_region_filter_test.go`): green. - Preflight: `bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master` — exit 0. Fixes #1386 --------- Co-authored-by: claw <claw@openclaw.local> |
||
|
|
0b35c7eef3 |
feat(server): persist multi-byte capability across restart + O(1) per-key lookup (#903) (#1324)
## Summary Follows the reconciliation recommendation in #916 — extracts only the NET-NEW persistence layer from that PR (which is now superseded by #1002 for the overlay UI) into a focused 6-file change against current master. **What this adds:** - `multibyte_sup_v1` migration: `multibyte_sup INTEGER NOT NULL DEFAULT 0` + `multibyte_evidence TEXT` on `nodes`/`inactive_nodes` so capability survives restart - `hasMultibyteSupCols` schema detection gates the persist/load paths - `loadMultibyteCapFromDB()`: pre-populates `mbCapSnapshot`/`mbCapIndex` at startup — cold starts serve last-known capability without waiting for the first ~15s analytics cycle - `maybePersistMultibyteCapability()` + `persistMultibyteCapability()`: after each analytics cycle; TryLock-gated (concurrent cycles coalesce); skips `sup==0` entries (data-destruction guard) - `GetMultibyteCapFor(pk)`: O(1) map lookup; both `handleNodes` and node-detail call sites updated from the O(N)-alloc `GetMultiByteCapMap()` **What this explicitly does NOT change:** - API field names (`multi_byte_status`, `multi_byte_evidence`, `multi_byte_max_hash_size`) - `EnrichNodeWithMultiByte` — unchanged - `GetMultiByteCapMap` — still present for any external callers - `public/map.js`, `public/live.css`, `Dockerfile`, `docs/` — zero frontend churn ## Test plan - [x] `TestMultibyteCapPersistRoundTrip` — confirmed values survive persist → fresh-store load - [x] `TestMultibyteCapPersistSkipsUnknown` — data-destruction guard: `sup==0` entry does not overwrite DB-confirmed value - [x] `TestMultibyteCapMaybePersistCoalesces` — TryLock coalesces 10 concurrent callers without deadlock - [x] `TestMultibyteCapGetMultibyteCapForO1` — O(1) index returns correct entry / false for unknown pubkey - [x] `TestMultibyteCapLoadFromDB` — only `sup>0` rows loaded; `sup==0` row excluded - [x] `TestSchemaMultibyteSupColumns` — migration adds columns to both tables; idempotent on second `OpenStore` - [x] All existing `TestMultiByteCapability_*` tests pass unchanged - [x] Full ingestor test suite: `ok` in 27s - [x] `go build ./cmd/server/ && go build ./cmd/ingestor/` clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: openclaw-bot <bot@openclaw> |
||
|
|
9d3dd8df0a |
fix(packets): order by ingest id, not rxTime — fresh activity visible on packets page (#1345) (#1349)
## Summary Fixes #1345 — the packets page shows "no recent activity" while MQTT ingest is healthy because the default `/api/packets` query was `ORDER BY first_seen DESC`, and PR #1233 redefined `first_seen` as the observer's radio receive time (rxTime). When an observer buffers offline and uploads hours later, its packets land with hours-old `first_seen` values; older-ingested packets with fresher rxTime then crowd the top of the list and the visually freshest activity disappears. ## Fix Switch the default ordering to `t.id DESC` (ingest order) on `/api/packets` and the closely-related endpoints. `id` is monotonic with ingest time and immune to buffered uploads. Endpoints changed (all use the same fix for the same reason): | Path | Function | File | |------|----------|------| | `GET /api/packets` (default) | `DB.QueryPackets`, `Store.QueryPackets` | `cmd/server/db.go`, `cmd/server/store.go` | | `GET /api/packets?nodes=…` | `DB.QueryMultiNodePackets`, `Store.QueryMultiNodePackets` | same | | Node detail "recent transmissions" | `DB.GetRecentTransmissionsForNode` | `cmd/server/db.go` | ## `since=` semantic — preserved `since=` still filters by `first_seen` (RFC3339 path uses the observations.timestamp subquery), i.e. "packets the network received since X." Buffered uploads of older packets are still excluded from a `since=15m` view even if they were ingested in the last 15 minutes. Only the **display order** changes; filtering by receive time is unchanged. ## Audit — NOT changed - `Store.QueryGroupedPackets` already sorts by `LatestSeen` (max observation timestamp), which is correct for the grouped view and immune to the buffered-upload regression. - `GetChannelMessages` and channel `sample_json` subqueries keep `first_seen DESC` — channel message chronology is meaningful for message UX; if buffered uploads become a problem here too it's a separate UX call (out of scope for #1345). - `s.packets` insertion ordering (Load + ingest) — untouched. The fix sorts at query time so we don't perturb `oldestLoaded` invariants. ## Tests — TDD red → green - Red: `508f4371` adds `cmd/server/packets_order_test.go` with two cases — order assertion (failed on master with `[fresh, buffered]`) and since-filter semantic (RFC3339 path uses observation timestamps). - Green: `0fd685e7` switches the SQL + in-memory ordering. Tests pass; full `cmd/server` suite green locally (44s). ## Out of scope - Re-thinking #1233's first_seen semantics - Adding a UI sort toggle (issue's option 2) - Channel-message page ordering ## Preflight Clean (`bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master`). --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
88bc5d9d3b |
fix(#1373): drop ghost "unknown" channel bucket from /api/channels for encrypted-no-key packets (#1377)
## What Drops the ghost `unknown` channel bucket from `/api/channels` for encrypted GRP_TXT packets whose decoded JSON sets `channel=""` (server has no PSK to decrypt). Fix A from issue #1373 — cosmetic / immediate. Fix B (server-side decryption / key sharing) is intentionally out of scope and remains for a follow-up issue. ## Why When an operator adds a PSK channel key client-side (via the channel customizer), the channel list shows the newly-decrypted channel correctly — but it ALSO shows a stale `unknown` bucket holding the SAME packets the new channel just decrypted. The bucket is a server-side debug catch-all (`if channelName == "" { channelName = "unknown" }`) that leaks into the user-facing channel list. It's not a real channel; dropping it from `/api/channels` is the right fix until/unless server-side decryption lands. Choice made: keep the `channelName = "unknown"` fallback path removed by adding an early `continue` BEFORE the bucket is created. This keeps the diff minimal, preserves the `hasGarbageChars` filter ordering, and makes the intent obvious ("encrypted-no-key packets are not channels"). The DB path (`cmd/server/db.go`) already filters NULL `channel_hash` at the SQL level and `continue`s on empty; the test pins that contract. ## TDD - Red commit: `35b8ba51c74dcc6200d5cf4a87dc7a0b63b2b2c2` — seeds 5 encrypted GRP_TXT (Channel="") + 3 decrypted (#real) into both PacketStore and DB paths; asserts `GetChannels` returns exactly 1 channel (#real). Fails on assertions, not compile. - Green commit: see follow-up commit on this branch — drops the `"unknown"` fallback in `cmd/server/store.go` `GetChannels`; DB path unchanged (already correct, test pins it). ## Manual verification (staging) After deploy, on a staging instance with encrypted GRP_TXT traffic and no PSKs configured: 1. `curl -s https://staging/api/channels | jq '[.[] | select(.name == "unknown")] | length'` → `0` 2. Real channels with known hashes still appear with correct messageCount. ## Files changed - `cmd/server/store.go` — drop the `if channelName == "" { channelName = "unknown" }` fallback; skip the packet instead. - `cmd/server/channels_no_unknown_bucket_1373_test.go` — new test covering both code paths. Fixes #1373 --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
c7ab5f3eb9 |
fix(#1366): channels view shows latest message time — backend emits LatestSeen, not FirstSeen (#1368)
Red commit:
|
||
|
|
de583f9df4 |
fix(paths-through): use canonical resolved_path instead of naive prefix match — fixes wrong-node attribution (#1352) (#1353)
## Summary
`/api/nodes/{pk}/paths` (paths-through-node) attributed the same
transmission to **every** prefix-sibling when their hop bytes collided
(e.g. 5 nodes with `c0…` on staging). Querying any of them returned the
tx — visible bug per #1352 where Kpa Roof Solar's view included a packet
whose actual relay was C0ffee SF.
## Root cause
`handleNodePaths` has two branches:
1. **Canonical resolved_path branch (#1278)** — when a tx has a
persisted `resolved_path`, membership is decided from the stored
pubkeys. This branch is correct.
2. **Fallback branch** — when `resolved_path` is NULL/missing, the code
invoked `pm.resolveWithContext(hop, []string{lowerPK}, graph)` to
re-resolve hops. The `hopContext=[lowerPK]` anchors the resolver on the
*queried target*, so the tier-2 (geo-proximity) / tier-3
(GPS+observation-count) tiers preferentially pick the target. Every
`paths-through-X` call for any `X` in the sibling set then resolved the
colliding hop to `X` and counted the tx — wrong-node attribution across
the whole sibling set.
## Fix
Server-side, query-time only. **No DB writes** (`#1289` read-only
invariant preserved). **No canonical-branch changes** — only the
fallback path.
In the fallback branch, accept a biased-resolver match as evidence of
target membership *only* when **either**:
- (a) the tx is already pre-confirmed via the resolved_path index hit or
SQL `INSTR(resolved_path, pubkey)` check, **or**
- (b) the hop's prefix candidate set is unique (`len(pm.m[hop]) <= 1`) —
no collision, no bias possible.
Multi-candidate prefix hops without independent SQL/index confirmation
are now treated as ambiguous and excluded from paths-through. Same rule
applied to the unresolvable-hop sub-case (when `resolveHop` returns nil
but the prefix could match the target).
## Which canonical resolved_path source is used
This PR does **not** introduce a new resolved_path source. It piggybacks
on what's already in place:
- **Canonical branch**: `s.store.fetchResolvedPathForTxBest(tx)` →
SQLite `observations.resolved_path` (populated upstream by the
hop-disambiguator from #1198/#1200/#1235).
- **Pre-confirmation in fallback**: `confirmedByFullKey` (membership
index `s.store.byPathHop[lowerPK]`) and `confirmedBySQL`
(`s.store.confirmResolvedPathContains` → `INSTR(LOWER(resolved_path),
"pubkey")`).
So when canonical data exists, attribution is purely persisted-path
driven; when it doesn't, attribution requires either a SQL pubkey hit or
a unique prefix candidate. Biased resolution alone is no longer
sufficient.
## TDD — red, then green
Two new tests in `cmd/server/paths_through_collision_1352_test.go`:
1. `TestHandleNodePaths_PrefixCollision_1352` — canonical branch
(already green via #1278). 3 nodes share `c0`, tx canonical
resolved_path = [B]. Only paths-through-B includes the tx.
2. `TestHandleNodePaths_PrefixCollision_1352_FallbackBranch` — **red**
before the fix. 3 GPS-having `c0` siblings, NULL resolved_path. Before:
A=1 B=1 C=1 (wrong-node attribution on all). After: ≤1 attribution.
Mutation: reverting the `len(pm.m[hop]) <= 1` guard in `routes.go`
restores the failing red state.
Existing tests preserved:
- `TestHandleNodePaths_PrefixCollisionExclusion` (#929) — still green.
- `TestHandleNodePaths_AnchorBiasInconsistency_Issue1278` (#1278) —
still green.
- Full `go test ./...` on `cmd/server` and `cmd/ingestor`: green.
## Acceptance criteria (from #1352)
- [x] On node detail for Kpa Roof Solar-shape, packet where actual relay
is C0ffee SF does NOT appear in paths-through (canonical branch test).
- [x] On node detail for C0ffee SF-shape, that same packet DOES appear
(canonical branch test).
- [x] Ambiguous fallback case (NULL resolved_path,
multi-prefix-collision) attributes to ≤1 node (fallback test).
- [x] Mutation test: removing the uniqueness guard makes the fallback
test fail.
## Out of scope
- Frontend UX for "ambiguous (N candidates)" badge (separate UX issue).
- Wider hop-disambiguator changes (#1198 family).
Fixes #1352
---------
Co-authored-by: bot <bot@example.com>
Co-authored-by: corescope-bot <bot@corescope>
|
||
|
|
317b59ab10 |
feat: area-based visual node filter — attribute packets by transmitter GPS (#804) (#839)
## Summary - Adds configurable GPS polygon areas to `config.json`; nodes are attributed to an area if their last-known position falls inside the polygon - New `Area: …` dropdown filter (matching the existing region filter style) appears on all analytics, nodes, packets, map, and live screens when areas are configured - Backend resolves area membership with a 30s TTL cache; area filter bypasses the 500-node cap on `/api/bulk-health` so all area nodes are always returned - Includes a polygon builder tool (`/area-map.html`) for drawing and exporting area boundaries ## Changes **Backend** - `AreaEntry` type + `Areas` config field - `GetNodePubkeysInArea` DB query + `resolveAreaNodes` (30s TTL, `areaNodeMu` RWMutex) - `PacketQuery.Area` + `filterPackets` polygon check - `?area=` param propagated through all analytics, topology, clock-health, and bulk-health routes - `/api/config/areas` endpoint **Frontend** - `area-filter.js`: single-select dropdown, persists to localStorage, cleans up stale keys on load - Wired into analytics, nodes, packets, channels, map, and live pages - Live map clears node markers on area change **Docs & tools** - `docs/user-guide/area-filter.md` — configuration and usage guide - `docs/api-spec.md` — updated with new endpoint and `?area=` param table - `tools/area-map.html` — polygon builder for defining area boundaries - Demo areas added to `config.example.json` ## Test plan - [x] No areas configured → filter dropdown does not appear on any page - [x] Areas configured → dropdown appears, "All" selected by default - [x] Selecting an area filters nodes/packets/topology/map correctly - [x] Selecting "All" restores unfiltered view - [x] Selection persists across page reloads (localStorage) - [x] Stale localStorage key (area removed from config) is cleared on load - [x] `/api/bulk-health?area=X` returns all nodes in area (no 500-node cap) - [x] `/api/config/areas` returns correct list 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Kpa-clawbot <kpaclawbot@outlook.com> Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
2329639f45 |
feat: scoped/unscoped transport-route statistics (#899) (#915)
@ ## What this PR does Implements region-scoped transport-route packet tracking with two sub-features: ### Feature 1 — Scope statistics (`scope_name`) - At ingest, transport-route packets (route_type 0/3) with Code1 != `0000` are HMAC-matched against configured `hashRegions` keys (mirroring the `hashChannels` pattern). Matched region name (or `""` for unknown) stored in new `transmissions.scope_name` column via migration `scope_name_v1`. - New `GET /api/scope-stats?window=` endpoint (1h/24h/7d, 30s server-side TTL) returning transport totals, scoped/unscoped counts, per-region breakdown, and time-series. - New **Scopes** tab in Analytics with summary cards, per-region table, and two-line SVG chart. Auto-refreshes every 60s. ### Feature 2 — Node default scope (`default_scope`) - Per-node `default_scope` column on `nodes`/`inactive_nodes` (migration `nodes_default_scope_v1`) tracks the most recently matched region for each node, derived from transport-scoped ADVERT packets. - `GET /api/nodes` response includes `default_scope` field when column is present. - Node detail panel displays the default scope badge. - Async startup backfill (`BackfillDefaultScopeAsync`) populates the column for nodes with pre-existing ADVERT data. ### Config Add `hashRegions` to `config.json` (see `config.example.json`). One entry per region name (with or without leading `#`). @ --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Kpa-clawbot <kpaclawbot@outlook.com> Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
caf3851ff8 |
feat(server): add opt-in HTTP gzip and WebSocket permessage-deflate compression (#934)
## Summary
- Adds `"compression": {"gzip": true, "websocket": true}` config option
(both `false` by default — no behavior change)
- HTTP gzip middleware wraps the entire router; skips WebSocket upgrade
requests and clients without `Accept-Encoding: gzip`
- WebSocket permessage-deflate enabled via
`hub.upgrader.EnableCompression` when `websocket: true`
- `CompressionConfig` struct and `GZipEnabled()` /
`WSCompressionEnabled()` helpers on `Config`
- `Hub.upgrader` moved from package-level var to struct field so tests
using `NewHub()` don't need changes
## Why opt-in / off by default
Operators behind a reverse proxy that already compresses (nginx, Caddy
with `encode gzip`) should leave this off to avoid double-compression.
Only enable when the proxy does **not** compress.
## Test plan
- [x] `TestCompressionConfigDefaults` — both helpers return false when
`Compression` is nil
- [x] `TestCompressionConfigExplicitFalse` — both helpers return false
when set to false
- [x] `TestCompressionConfigEnabled` — both helpers return true when set
to true
- [x] `TestGZipMiddlewareCompresses` — response body is valid gzip,
headers set correctly
- [x] `TestGZipMiddlewareSkipsNoAcceptEncoding` — passthrough when
client doesn't send Accept-Encoding: gzip
- [x] `TestGZipMiddlewareSkipsWebSocket` — WebSocket upgrades are never
gzip-wrapped
All 6 tests pass (`go test ./...` in `cmd/server`).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: OpenClaw Bot <bot@openclaw.local>
Co-authored-by: efiten-bot <bot@efiten.dev>
|
||
|
|
ba6c2ac6ba |
feat: repeater liveness indicator with relay stats (#662) (#755)
## Summary
- **Backend**: adds `relayTimes` in-memory index (sorted unix-millis per
repeater pubkey), maintained in lockstep with `byPathHop`. Populated at
startup from all packet observations (not just best), updated on
ingest/evict/backfill. Exposes `relay_count_1h`, `relay_count_24h`,
`last_relayed` in both `/api/nodes` (for repeaters) and
`/api/nodes/{pubkey}/health`.
- **Frontend**: `getNodeStatus` extended to three-state (`relaying` /
`active` / `stale`) for repeaters based on relay_count_24h.
`getStatusInfo` is the single source of truth for status label,
explanation, and relay stats. Detail pane shows relay counts and last
relayed time. Nodes list gets a status emoji column with hover tooltip
showing relay info.
- **Correctness fixes**: relay index scans all observations per packet
(not just best); backfill now updates relay index after resolving paths;
pubkeys lowercased consistently throughout index.
## Changes
### `cmd/server/store.go`
- `relayTimes map[string][]int64` field added to `PacketStore`
- `addTxToRelayTimeIndex` / `removeFromRelayTimeIndex`: scan all
observations, idempotent sorted insert, lowercase keys
- `relayMetrics(times, nowMs)`: returns `(count1h, count24h,
lastRelayed)`
- `buildPathHopIndex`: populates `relayTimes` at startup
- `pollAndMerge`: updates relay index on ingest and eviction; new `else`
branch for path-unchanged observations
- `addTxToPathHopIndex` / `removeTxFromPathHopIndex`: lowercase resolved
pubkeys (fixes casing mismatch with lookup)
### `cmd/server/routes.go`
- `GetBulkHealth` / `GetNodeHealth`: include relay stats for repeater
nodes
- `handleNodes`: enriches repeater nodes with relay stats from
`relayTimes` so list view has same data as detail pane
### `cmd/server/neighbor_persist.go`
- `backfillResolvedPathsAsync`: calls `addTxToRelayTimeIndex` after
`pickBestObservation` to capture newly resolved pubkeys
### `public/roles.js`
- `getNodeStatus(role, lastSeenMs, relayCount24h)`: three-state logic
for repeaters
- `getStatusInfo(n)`: single source of truth returning status, label,
explanation, relay counts, last relayed
### `public/nodes.js`
- Detail pane: `n.stats` populated from health endpoint before
`getStatusInfo` call
- Nodes list: status emoji column with relay hover tooltip; status
filter uses `getStatusInfo`
### Tests
- `relay_liveness_test.go`: index functions, relay metrics, wiring
integration, bulk/single health endpoints
- `test-repeater-liveness.js`: three-state frontend logic, backward
compat
## Test plan
- [x] Repeater with recent relay traffic shows green relaying emoji in
list and detail pane
- [x] Repeater with no relay traffic in 24h shows yellow idle in both
views
- [x] Repeater not heard recently shows grey stale in both views
- [x] Non-repeater nodes unaffected (no relay stats, no status change)
- [x] Hover tooltip on list emoji shows relay count and last relayed
time
- [x] `go test ./...` passes
- [x] `node test-repeater-liveness.js` passes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: openclaw-bot <bot@openclaw.local>
|
||
|
|
38eb7103b3 |
perf(nodes): batch relay stats to fix O(N×M) /api/nodes regression (#1164)
## Problem
`handleNodes` enriches each repeater/room node by calling
`GetRepeaterRelayInfo` and `GetRepeaterUsefulnessScore` **per node**
inside a loop. `GetRepeaterUsefulnessScore` acquires `s.mu.RLock()` and
then iterates **all** `byPayloadType` entries to compute the non-advert
denominator — once per node.
On a deployment with ~1500 repeater/room nodes and ~145K transmissions
in memory, this is **~220M iterations per `/api/nodes` request**, plus
~3000 separate lock acquisitions. Response times of 18–44 seconds have
been observed in production, especially during startup backfill when
write-lock contention compounds the issue.
## Fix
Add `GetRepeaterNodeStatsBatch(pubkeys []string, windowHours float64)
map[string]RepeaterNodeStats` to `repeater_usefulness.go`:
- Takes **one** `s.mu.RLock()` for the entire node list
- Computes the non-advert denominator **once** (shared across all nodes)
- Snapshots `byPathHop` slice headers for all requested pubkeys under
that single lock
- Processes timestamps and counts **outside** the lock
Update `handleNodes` to collect repeater/room pubkeys first, call the
batch method once, and apply results.
**Complexity: O(M + N) instead of O(N × M)** per request (M = total
transmissions, N = repeater nodes).
`GetRepeaterRelayInfo` and `GetRepeaterUsefulnessScore` are unchanged —
they are still correct for single-node calls (e.g. `handleNodeDetail`).
## Test plan
- [ ] `go build ./cmd/server` passes
- [ ] `/api/nodes` response is correct (relay_active,
relay_count_1h/24h, usefulness_score fields present for repeaters)
- [ ] No change in output for `/api/nodes/{pubkey}` (uses existing
single-node methods)
- [ ] CI passes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: openclaw-bot <bot@openclaw.local>
|
||
|
|
d0d1657b5c |
fix: re-index relay hops in byNode after Load() picks best observation (#692) (#801)
## Problem `indexByNode()` was called during `Load()` immediately when each `StoreTx` was created — before observations were appended and before `pickBestObservation()` set `tx.ResolvedPath`. The resolved_path indexing branch added in #708 was effectively dead code on every server restart. **Symptom:** After any restart, `byNode[relay_pubkey]` was empty for relay-only nodes even when `resolved_path` was correctly persisted in the DB. Analytics showed `totalPackets = 0` for repeater nodes despite active relay traffic. ## Fix Call `s.indexByNode(tx)` again in the post-load loop after `pickBestObservation()`, where `ResolvedPath` is populated. Same fix applied to `backfillResolvedPathsAsync()`, which also called `pickBestObservation()` without re-indexing afterward. The dedup in `nodeHashes` prevents double-counting: pubkeys already indexed from decoded JSON fields are skipped; only the relay hop pubkeys from `resolved_path` are new additions. ## Test `TestLoadIndexesRelayHopsFromResolvedPath` — inserts a packet with `resolved_path` containing a relay pubkey that does not appear in `decoded_json`, calls `Load()`, and verifies `byNode[relay_pubkey]` is populated. ## Related Closes #692 (together with #707, #708, #711 already merged) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
51f823bf7e |
feat: one-click prune nodes outside geofilter (#669 M4) (#738)
## Summary - Adds `POST /api/admin/prune-geo-filter` endpoint — dry-run by default, `?confirm=true` to permanently delete nodes outside the current geofilter polygon + buffer. Requires `X-API-Key` header. - Adds **Prune nodes** section inside the GeoFilter customizer tab (write-access only, same `writeEnabled` gate as PUT). **Preview** lists affected nodes; **Confirm delete** removes them. - Adds `GetNodesForGeoPrune` and `DeleteNodesByPubkeys` DB helpers. - Updates `docs/user-guide/geofilter.md` — documents the UI button as primary workflow, CLI script as alternative. > **Depends on M3** (`feat/geofilter-m3-customizer`, PR #736). Merge M3 first. ## Test plan - [x] `cd cmd/server && go test ./...` — all pass - [x] Customizer GeoFilter tab without `apiKey` — Prune section not visible - [x] With `apiKey` + polygon active — Prune section visible - [x] **Preview** returns list of nodes outside polygon (no deletions) - [x] **Confirm delete** removes nodes, list clears - [x] `POST /api/admin/prune-geo-filter` without `X-API-Key` → 401 - [x] `POST /api/admin/prune-geo-filter` with no polygon configured → 400 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
9383201c07 |
refactor(db): finish #1283 — Option 4: ingestor owns neighbor-graph + schema migrations; server is read-only (fixes #1287) (#1289)
Red commit:
https://github.com/Kpa-clawbot/CoreScope/commit/eae179b99b5fd34924547632aa8f8025c405aa53
(CI: pending — opens with this PR)
Finishes #1283. RED test `TestServerSourceHasNoCachedRWCalls` goes from
failing (13 writer call-sites) to GREEN (zero). Per #1287 Option 4
(https://github.com/Kpa-clawbot/CoreScope/issues/1287#issuecomment-4485099992):
ingestor owns the neighbor graph build + persist; server reads the
snapshot.
**Category A — Schema migrations** → new `internal/dbschema` package.
`dbschema.Apply(rw)` runs in `cmd/ingestor` startup (in `OpenStore`).
`dbschema.AssertReady(ro)` runs in `cmd/server/main.go` and
FATAL-LOG-EXITS if any expected column/index/table is missing — the
operator must restart the ingestor first. Covers indexes,
`neighbor_edges`, `observations.resolved_path`,
`observers.{inactive,last_packet_at,iata}`,
`(inactive_)nodes.foreign_advert`, `transmissions.from_pubkey`.
**Category B — Backfill** → ingestor.
`BackfillFromPubkey` and observer-blacklist soft-delete moved to
`cmd/ingestor/maintenance.go`. Server keeps an inert
`fromPubkeyBackfillSnapshot` stub for `/api/healthz` API compatibility.
**Category C — Neighbor-graph persistence (Option 4)** → ingestor
writes, server reads.
- Ingestor (`cmd/ingestor/neighbor_builder.go`): every 60s scans
`observations + transmissions`, extracts edges (originator↔first-hop for
ADVERTs; observer↔last-hop for all), resolves hop prefixes via a
node-table prefix index, upserts into `neighbor_edges`.
- Server (`cmd/server/neighbor_recomputer.go`): every 60s re-reads
`neighbor_edges` and atomic-swaps the resulting `NeighborGraph` into
`s.graph`. Initial load is synchronous on startup. All server-side
incremental edge writers (the two `asyncPersistResolvedPathsAndEdges`
paths in `cmd/server/store.go`) are gone.
- Neighbor-edge daily prune (`PruneNeighborEdges`) moved to ingestor.
**Why Option 4**: clean read/write separation, no startup CPU spike
(server loads existing snapshot instead of rebuilding from history), no
IPC/delta-protocol churn. Staleness budget ~60s — same model as the
analytics recomputers in #1240 / #1248 / #672 axis 2.
**Recomputer interval default for neighbor graph**: 60s
(`NeighborGraphRecomputerDefaultInterval`,
`NeighborEdgesBuilderInterval`).
**Invariants added**:
- `TestServerSourceHasNoCachedRWCalls` (RED commit
|
||
|
|
749fdc114f |
feat(decoder+ui): close remaining P2 items from #1279 — payloadTypeNames, legend, TransportCodes, Feat1/2, RAW_CUSTOM, sensor docs (#1291)
RED commit: `dc4c0800` — CI: https://github.com/Kpa-clawbot/CoreScope/actions?query=branch%3Afix%2Fissue-1279-p2 Closes the remaining six 🟢 P2 items in umbrella #1279 (PR #1280 shipped P0+P1, PR #1276 shipped ACK/RESPONSE/PATH legend rows). ### Item-by-item | # | Item | Where | Test | |---|---|---|---| | 1 | `payloadTypeNames` parity | `cmd/server/store.go` | `cmd/server/issue1279_p2_test.go::TestPayloadTypeNamesAll13` | | 2 | Legend rows: Anon Req / Grp Data / Multipart / Control / Raw Custom | `public/live.js` | `test-issue-1279-legend-p2-e2e.js` (Playwright) | | 3 | TransportCodes detail-row + `code1=` / `code2=` filter grammar | `public/packets.js`, `public/packet-filter.js` | `test-issue-1279-p2-code-filter.js` (6 cases) | | 4 | Multibyte capability badge on node detail/list rows | `public/nodes.js::renderNodeBadges` | `n.hash_size >= 2` (observable Feat1/Feat2 proxy; firmware `AdvertDataHelpers.h:14-16`) | | 5 | RAW_CUSTOM (0x0F) `{rawLength, firstByteTag}` decode + detail-row | `cmd/server/decoder.go`, `cmd/ingestor/decoder.go`, `public/packets.js` | `TestDecodeRawCustomExposesLengthAndTag` × 2 + updated `TestDecodePayloadRAWCustom` | | 6 | Sensor advert telemetry firmware-derivation comments | `cmd/ingestor/decoder.go:363-380` | pure comments — exempt per AGENTS | ### Firmware refs cited inline - `firmware/src/Packet.h:19-32` — PAYLOAD_TYPE_* constants - `firmware/src/Packet.h:46` — TransportCodes wire layout - `firmware/src/Mesh.cpp:577` — `createRawData` - `firmware/src/helpers/SensorMesh.{h,cpp}` — sensor advert telemetry derivation - `firmware/src/helpers/AdvertDataHelpers.h:14-16` — Feat1/Feat2 ### TDD Red `dc4c0800` proves the assertions gate behavior: - `payloadTypeNames` had only 12 entries (no 0x0F). - RAW_CUSTOM decoded as `UNKNOWN` with no envelope fields. Green `<HEAD>` makes both green; per-item tests included. ### Cross-stack note Cross-stack: justified — items 1/5 add decoder output fields; items 2/3/4/5 surface those fields in the UI in the same PR per #1279 acceptance. ### Out of scope Item 4 surfaces the observable multibyte capability via the persisted `hash_size` (Feat1/Feat2 wire bits are only on transient adverts and not stored per-node today); persisting raw Feat1/Feat2 per-node is left for a follow-up. Fixes #1279 --------- Co-authored-by: bot <bot@corescope> |
||
|
|
467b01a1b3 |
fix(#1285): exclude RTC-reset outliers from clock-skew hash median + recent bad count (#1288)
Red commit:
|
||
|
|
1da2034341 |
refactor(db): move all writes from server to ingestor; server truly read-only (fixes #1283) (#1286)
**Red commit:**
|
||
|
|
d667dc0a74 |
fix(#1278): /api/nodes/{pk}/paths uses canonical persisted resolved_path (drop anchor-bias inconsistency) (#1282)
First failing (RED) commit:
|
||
|
|
e6c30e1a7e |
feat(decoder): GRP_DATA + MULTIPART + advertRole fix + CONTROL flags (#1279 P0+P1) (#1280)
Addresses the four P0+P1 firmware reconciliation gaps from the umbrella audit (issue #1279). RED commit: `0a4c084e` (asserts on stub returns; all 13 assertions fail). GREEN commit: `13867681`. ## What's in this PR ### P0 — silently dropped data - **#1 GRP_DATA (0x06) decoder.** Outer envelope is the same shape as GRP_TXT (`channel_hash(1)+MAC(2)+ciphertext`) per `firmware/src/helpers/BaseChatMesh.cpp:476,500`. Factored `decryptChannelBlock(...)` helper used by both 5 and 6. When a channel key matches, the inner is parsed per `firmware/src/helpers/BaseChatMesh.cpp:382-385` as `data_type(uint16 LE) + data_len(1) + blob(data_len)`. Surfaces `{channelHash, MAC, dataType, dataLen, decryptedBlob}` on decrypt or `{channelHash, MAC, encryptedData}` otherwise. Server-side decoder surfaces envelope only (no key store). - **#2 MULTIPART (0x0A) decoder.** Per `firmware/src/Mesh.cpp:289`, byte0 = `(remaining<<4) | inner_type`. When `inner_type == PAYLOAD_TYPE_ACK (0x03)`, next 4 bytes are the LE ack_crc per `firmware/src/Mesh.cpp:292-307`. Surfaces `{remaining, innerType, innerTypeName, innerAckCrc | innerPayload}`. ### P1 — mis-classified / opaque - **#3 `advertRole()` raw-type fix.** Per `firmware/src/helpers/AdvertDataHelpers.h:7-12`, ADV_TYPE_NONE = 0 and 5-15 are FUTURE. The previous boolean fallback collapsed both into `"companion"`, silently relabelling unknown/reserved types. New behaviour: type 0 → `none`, 1 → `companion`, 2-4 → `repeater`/`room`/`sensor`, 5-15 → `type-N`. `ValidateAdvert` accepts the new labels. - **#4 CONTROL (0x0B) byte0 flags + length.** Per `firmware/src/Mesh.cpp:69` + `createControlData` at `Mesh.cpp:609`, byte0 high-bit marks the zero-hop direct subset. Surfaces `{ctrlFlags, ctrlZeroHop, ctrlLength}`. ### Drift fix - `cmd/server/store.go` `payloadTypeNames` now includes `6: GRP_DATA` and `10: MULTIPART` (previously omitted; canonical decoder map already had them). ## Lockstep & TDD Both `cmd/ingestor/decoder.go` and `cmd/server/decoder.go` updated in the same commits — same wire-vector tests live in both packages (`cmd/{ingestor,server}/issue1279_test.go`). Per-item RED→GREEN visible in `git log`. | Item | Tests | RED proof | |---|---|---| | #1 GRP_DATA | ingestor: NoKey + DecryptedInner; server: Envelope | 6 assertions failed pre-impl | | #2 MULTIPART | ingestor + server: Ack + NonAck | 8 assertions failed pre-impl | | #3 advertRole | ingestor + server: 7-row table | 3 assertions failed pre-impl | | #4 CONTROL | ingestor + server: ZeroHop + MultiHop | 6 assertions failed pre-impl | ## What's NOT in this PR The umbrella issue lists P2 items that ship in follow-up PRs: - Live + compare legend entries for the long tail of newly-named types (#1274 + others). - TransportCodes UI surface + filter grammar. - feat1/feat2 capability badges. - `payloadTypeNames` consolidation across server/ingestor (drift-prevention). Leave the umbrella open after this merges. Refs #1279 --------- Co-authored-by: OpenClaw Bot <bot@openclaw.local> |
||
|
|
8bf7709970 |
feat(repeater): usefulness score — bridge axis (#672 axis 2 of 4) (#1275)
RED test commit: `fd661569` — CI will fail on this (stub returns empty map; assertions fail by design). GREEN: `bf4b8592`. ## What Implements **axis 2 of 4** for the repeater usefulness score per #672 ([status comment](https://github.com/Kpa-clawbot/CoreScope/issues/672#issuecomment-4484635378)). The Bridge axis measures *structural importance*: how many shortest paths between other nodes route through this one. A high-traffic redundant node and a low-traffic critical bridge will no longer look identical. ## Algorithm **Brandes' weighted betweenness centrality** with Dijkstra for shortest paths (`cmd/server/bridge_score.go`). - Nodes: pubkeys in the `neighbor_edges` graph - Edge weight: `Score(now) * Confidence()` — per the convention from #1235 (count + recency decay scaled by observer-diversity confidence). Geo-rejected edges already excluded at graph build time (#1230) so we don't re-filter here. - Dijkstra distance: `1 / max(epsilon, weight)` — high affinity = cheap cost. - Normalize: divide by max observed centrality so output is in `[0, 1]`. Cost: `O(V · (E + V log V))`. Staging-scale (~600 nodes / ~2 000 edges) ≈ ~4.8M ops, completes in milliseconds. ## Where it lives - `cmd/server/bridge_score.go` — pure algorithm, no locks - `cmd/server/bridge_recomputer.go` — background recomputer (mirrors #1240/#1262 pattern), 5-min default interval, initial sync prewarm, snapshot stored in `s.bridgeScoreMap atomic.Pointer[map[string]float64]` - `cmd/server/routes.go` — `handleNodes` adds `node["bridge_score"]` on repeater/room rows; node-detail handler adds it on the single-node path - `public/nodes.js` — separate **Bridge** row in the node detail panel, alongside the existing **Usefulness** (Traffic) row. Distinct colour-coded bar. ## What's NOT in this PR (still pending for #672) - **Coverage axis** (axis 3) — unique observer-pair connectivity - **Redundancy axis** (axis 4) — simulated node-removal impact - **Composite** — once all 4 axes ship, swap the `usefulness_score` formula from "traffic-only" to the weighted composite `Refs #672` (not `Fixes` — issue stays open until all 4 axes + composite ship). ## Tests - `TestComputeBridgeScores_LineGraph` — 4-node line: middles non-zero, leaves zero, max normalized to 1.0 - `TestComputeBridgeScores_TriangleNoBridge` — clique has zero bridges - `TestComputeBridgeScores_Empty` — defensive nil-safety - `TestComputeBridgeScores_WeightSensitive` — mutation guard: revert the `1/w` inversion and this test fails - `TestBridgeScore_HandleNodesSurface` — integration: `/api/nodes` returns `bridge_score` on repeater rows; middle nodes > 0, ends == 0 --------- Co-authored-by: clawbot <bot@meshcore.local> |
||
|
|
4cd8445233 |
perf(#1265): wire /api/observers/clock-skew + /api/nodes/clock-skew into analytics recomputer (#1266)
RED:
|
||
|
|
ae17a2be12 |
perf(#1262): /api/nodes?limit=2000 cold-miss 15.7s → <100ms — prewarm repeater enrichment cache (#1263)
RED commit: `22ce5736066142583017cad7303fa48d9e00ccf0` — CI on red: https://github.com/Kpa-clawbot/CoreScope/actions?query=branch%3Afix%2Fissue-1262 ## Problem After #1260 added a 15s-TTL bulk cache for repeater enrichment in `handleNodes`, `/api/nodes` (default limit) dropped to ~500ms. But `/api/nodes?limit=2000` — called by `public/live.js` at SPA startup for hop resolution — still took **15.7s cold** on staging (75k tx, 600 nodes). Warm hits were ~40ms. Root cause: the bulk cache was lazily populated on the first request after TTL expiry. The rebuild ran on the request-serving goroutine. Every cold SPA load triggered the rebuild and ate 15s. ## Fix Add `StartRepeaterEnrichmentRecomputer` — a steady-state background recomputer that mirrors the `analytics_recomputer.go` pattern from #1240: - **Prewarm**: initial synchronous compute on Start so the first request hits a populated cache. - **Steady-state**: ticker refreshes the snapshot every 5min (configurable via the existing analytics recompute interval knob). - **Panic-safe** + idempotent Start. Wired into `main.go` right after `StartAnalyticsRecomputers`, using `cfg.GetHealthThresholds().RelayActiveHours` as the window. ## Test `TestHandleNodesLimit2000ColdMiss` — seeds 600 nodes + 150k non-advert tx with repeaters indexed under a shared 1-byte hop prefix (matches production hop-prefix collisions), starts the recomputer, then issues `/api/nodes?limit=2000` with **no HTTP warmup**. | State | Latency | |---|---| | Before (master, on-thread rebuild) | 3.37s | | After (prewarm + steady-state) | 56ms | | Budget | 2s | Staging end-to-end: 15.7s → expected sub-100ms on the same call path. Red commit (`22ce5736066142583017cad7303fa48d9e00ccf0`) compiles with a no-op stub of the new method so the test fails on the latency **assertion**, not a missing symbol. Fixes #1262 --------- Co-authored-by: corescope-bot <bot@corescope.local> |
||
|
|
1efe93d7f6 |
perf(#1257): bulk-cache repeater enrichment in /api/nodes — 32s → <500ms (#1260)
RED commit `a2879e12` — perf regression test; CI run: see Actions tab. Fixes #1257. ## Root cause `handleNodes` looped over the response page and called `store.GetRepeaterRelayInfo(pk, win)` + `store.GetRepeaterUsefulnessScore(pk)` for every repeater/room. Each call: - grabbed its own `s.mu.RLock`, - walked `byPathHop[pk]` (+ the matching 1-byte raw-prefix bucket, which on busy networks fans out to nearly the entire non-advert tx set), - and re-parsed every `tx.FirstSeen` with `parseRelayTS`. Default page is the 50 most-recently-seen nodes — almost all hot repeaters — so the request did O(50) lock acquisitions and hundreds of thousands of timestamp parses on the same set of txs. That's the classic load-then-paginate / per-row N+1 shape called out in the issue (same family as #1226). The `?limit=2000` variant looks faster relatively only because per-node enrichment dwarfs serialization; on staging both still bottleneck on the same loop. ## Fix Two new bulk methods on `PacketStore`: - `GetRepeaterRelayInfoMap(windowHours)` → `pubkey → RepeaterRelayInfo` - `GetRepeaterUsefulnessScoreMap()` → `pubkey → 0..1` Both snapshot `byPathHop` under a single `RLock`, pre-parse each `FirstSeen` exactly once (a tx that appears in N hop buckets used to be parsed N times), and emit one entry per hop key. Cached 15s — same TTL as `GetNodeHashSizeInfo` / `GetMultiByteCapMap`, same status-column freshness budget. `handleNodes` is one map-lookup per node; behavior, output schema, and `RelayActive` / `RelayCount{1h,24h}` / `LastRelayed` / `usefulness_score` semantics are preserved. ## Why no `limit` default change The issue mentioned a default-limit knob. Investigated: `queryInt(r, "limit", 50)` already defaults to 50 — frontends calling `/api/nodes` (no limit) get a 50-row page today. Capping further would change behavior (live.js already passes `?limit=2000` when it wants more); the cost was per-repeater enrichment, not page size. Fixing the N+1 is the correct lever and preserves backward compat. ## Perf Regression test `TestHandleNodesPerfLargeFleet` (600 nodes, 150k non-advert tx, repeaters indexed under `byPathHop`): | | elapsed | vs 2s budget | |---|---|---| | before (master) | 4.72s | ✗ | | after | ~4ms | ✓ (~1000×) | ## TDD - RED: `a2879e12` — test fails at 4.72s on master. - GREEN: `c529d29a` — fix; full `cmd/server` + `cmd/ingestor` suites green. --------- Co-authored-by: corescope-bot <bot@corescope> |
||
|
|
f81ed5b3cf |
perf(#1256): wire /api/analytics/roles into steady-state recomputer (#1259)
RED commit: `0190466d` — failing CI: https://github.com/Kpa-clawbot/CoreScope/actions (will populate after PR creation) ## Problem On staging (commit `d69d9fb`, 78k tx, 2.3M obs), `curl http://localhost/api/analytics/roles` times out at 60s with 0 bytes — the Roles tab is unusable. Issue #1256. PR #1248's steady-state recomputer fan-out (topology / rf / distance / channels / hash-collisions / hash-sizes) **didn't include roles**. The legacy handler: 1. Holds `s.mu.RLock` for the entire compute. 2. Calls `GetFleetClockSkew()`, which drives `clockSkew.Recompute(s)` over all ADVERT transmissions — O(78k) per request. 3. Concurrent ingest writers compound the latency through writer-starvation. Result: every request hits the cold path; the response never comes back inside the 60 s HTTP budget. ## Fix Add `roles` as the 7th endpoint in the recomputer fan-out — same pattern as #1248: - `PacketStore.recompRoles` slot, registered in `StartAnalyticsRecomputers` with default 5-min interval. - `PacketStore.GetAnalyticsRoles()` → atomic-pointer load from the snapshot (sub-ms), with a `computeAnalyticsRoles()` fallback only for the brief startup window before the initial sync compute completes. - Handler is now a thin wrapper — no lock-held work on the request path. - New optional `roles` key under `analytics.recomputeIntervalSeconds` in config; `config.example.json` and `_comment_analytics` updated. ## Latency (unit-scope benchmark) - Worst-of-50 handler latency: **<100 ms** (test budget; well under the 2 s p99 acceptance). - Compute itself is bounded by the existing 5-min recompute window — it runs once in the background, never on the request path. ## Tests - RED `0190466d`: asserts `recompRoles` is registered and the handler returns under the latency budget. Fails on master with `recompRoles not registered`. - GREEN `d7784f76`: registers the recomputer + snapshot accessor — both tests pass. Fixes #1256 --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
d69d9fbf8e |
perf(#1247): surgical fix for resolveWithContext tier-1 hot path (4.6× speedup) (#1253)
## Summary Surgical fix for #1247: analytics endpoints regressed 3-9× between prod `d818527` and master. pprof against staging traced the regression to `resolveWithContext` tier-1 affinity loop running on every analytics `resolveHop` call (post-#1198 plumbing) with redundant per-(cand, ctx) work. **Result: 4.6× speedup on the synthetic hot-shape benchmark (202µs → 44µs / op).** ## Root cause - PR #1198 (`353c5264`) lit up `resolveWithContext` tier 1 from every analytics resolveHop closure (previously they passed `contextPubkeys=nil` and short-circuited the entire tier-1 block). - The inner loop did `N_cand × N_ctx` iterations where each one did: - `graph.Neighbors(strings.ToLower(ctxPK))` — graph RLock + ToLower allocation **per candidate**, redundantly - `strings.ToLower(cand.PublicKey)` per `ctxPK` - `strings.EqualFold(otherPK, ctxPK)` + `EqualFold(otherPK, candPK)` — both sides were already lowercased (`NeighborEdge.NodeA/B` via `makeEdgeKey`; `contextPubkeys` via `buildHopContextPubkeys`) - At staging scale (5k+ contextPubkeys × 30k+ resolveHop calls) this dominated `computeAnalyticsTopology` (37% of its CPU) and `computeAnalyticsRF` (55%). ## pprof attribution (staging, region-keyed queries bypassing #1240 cache) ``` computeAnalyticsTopology cum: 19.24% (5.45s / 28.32s sampled) └─ resolveWithContext 37% ├─ strings.ToLower 41% ├─ strings.EqualFold 28% └─ graph.Neighbors 24% computeAnalyticsRF cum: 10.38% ``` ## Fix (~80 LoC in `cmd/server/store.go`) 1. Lowercase `contextPubkeys` **once per call**, skipped entirely when already lowercased (the analytics fast path). 2. Lowercase candidate pubkeys **once per call**. 3. Invert the loop nesting: outer-ctx / inner-edge / candidate-map lookup. `graph.Neighbors` is called once per context pubkey instead of `N_cand` times. 4. Raw `==` instead of `strings.EqualFold` for pubkey comparisons (both sides lowercased by step 1/2). 5. Added a tiny `hasUpperASCII` byte-loop helper next to `isHexLower` for the fast-path check. Behavior preserved: same `Score × Confidence` formula, same tier-1 ratio + min-observations gate, same per-candidate "best edge wins" semantics. No change to tiers 2/3/4. ## TDD evidence - Red commit (`5f8d1564`): `TestResolveWithContextTier1Floor` asserts `<100 µs/call` on the hot shape. **199 µs/call on regressed master → FAIL.** - Green commit (`e3bdbc65`): surgical fix lands. **44 µs/call → PASS.** - Reverification: locally stashed the fix, ran the test → 199.5 µs FAIL; popped fix → 44 µs PASS. `BenchmarkResolveWithContextTier1Hot` (no assertion, visibility only): ``` before: 202013 ns/op 168 B/op 3 allocs/op after: 44084 ns/op 424 B/op 6 allocs/op speedup: 4.6× ``` (Post-fix allocs are O(N_cand + N_ctx) one-time helper tables — net win at hot scale.) ## Independence from #1248 PR #1248 caches the analytics compute output so user-facing latency is sub-ms even when the compute is slow. That's correct for UX but it masks the regression. This PR repairs the compute itself, so: - Region-keyed and windowed queries (which bypass the recomputer cache by design — see #1240) become fast again. - Future ingest scale or feature work on top of the regressed baseline doesn't compound. ## Out of scope - The geo-rejection (#1228) and Confidence weighting (#1229) commits — kept intact, they protect correctness and were not the dominant CPU cost. - Reverting any suspect commit — surgical only. ## Acceptance criteria from #1247 - [x] pprof confirms the hot function (`resolveWithContext`) - [x] Bisect identifies the regressing commit (`353c5264` / PR #1198 — context plumbing; ratified by pprof, no need to actually rebuild 5 binaries) - [x] Fix lands; tier-1 hot path 4.6× faster - [x] No regression in disambiguator correctness — full `go test ./...` green, all existing `ResolveWithContext` / `HopDisambig` / `NeighborGraph` / `Affinity` tests pass Fixes #1247 --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
356f001027 |
perf(#1240): steady-state background recompute for analytics endpoints (#1248)
RED commit: `27630f6a` — adds latency test that fails on master (p99=225ms > 50ms budget) and a stub `StartAnalyticsRecomputers` that returns a no-op so the assertion (not a build error) gates the change. GREEN commit: `20fbbceb` — wires real background recompute infrastructure. Test passes at p99=~1µs. ## What changed Replaces the on-request "compute-then-cache" pattern for the default-shape analytics queries with a steady-state background recompute loop. Reads always hit an `atomic.Value` snapshot in <1µs regardless of compute cost or writer contention. Operator principle: serving slightly stale data quickly beats real-time data slowly. ## Endpoints converted (default 5min interval each) | Endpoint | Cold compute | Recomputer interval | |---|---|---| | `/api/analytics/topology` | ~5s | 5 min | | `/api/analytics/rf` | ~4s | 5 min | | `/api/analytics/distance` | ~3s | 5 min | | `/api/analytics/channels` | ~0.5s | 5 min | | `/api/analytics/hash-collisions` | ~0.5s | 5 min | | `/api/analytics/hash-sizes` | ~22ms | 5 min | All intervals configurable per-endpoint via `analytics.recomputeIntervalSeconds.<name>` in `config.json`; documented in `config.example.json`. Default override via `analytics.defaultIntervalSeconds`. ## Scope: default query only Only the canonical shape `(region="", window=zero)` is precomputed. Region- or window-filtered requests fall back to the legacy TTL cache + on-request compute — keeps recomputer count bounded (6, not 6×N×M). ## Latency Test `TestAnalyticsRecomputerSteadyStateLatency`: 100 concurrent readers + 4 writers churning `s.mu.Lock` on 20k distHops. - Before: p50=188ms p99=225ms (assertion failed) - After: p50=240ns p99=1.1µs (atomic load + map return) ## Shutdown integration `StartAnalyticsRecomputers` returns a stop closure invoked from `main.go`'s SIGTERM handler BEFORE `dbClose()` so any in-flight SQLite compute drains cleanly. `TestAnalyticsRecomputerShutdownNoLeak` confirms all 6 goroutines are reaped (Δ=6 within 2s). ## Safety details - Initial compute is synchronous in `Start()` — first read after startup never sees nil. - `recover()` inside `runOnce` keeps a compute panic from killing the goroutine; previous snapshot remains valid. - `analyticsRecomputerMu` is a sync.RWMutex; recomputer pointers are read-locked in the hot path. The atomic.Value swap inside `runOnce` is lock-free. Fixes #1240. --------- Co-authored-by: OpenClaw Bot <bot@openclaw.local> |
||
|
|
b881a09f02 |
feat(#1188): show observer IATA on packets + filter grammar (#1189)
Red commit:
|
||
|
|
2754251a53 |
perf(#1239): /api/analytics/distance — TTL 15s→60s + drop main RLock around compute (#1241)
## Summary Fixes #1239 — `/api/analytics/distance` 15s cold on staging under heavy ingest. Two independent fixes. First commit on this branch is the RED test for Fix B (`a539882`), demonstrating reader/writer contention against the main store lock. CI: see Actions tab for the run on the test-only commit — it asserts >150µs avg writer cycle and fails at 82367µs pre-fix. GREEN commit (`d3938f1`) brings it to 1µs. ## Fix A — TTL bump 15s → 60s (`5eae1e0`) - `rfCacheTTL` default in `cmd/server/store.go` changed from `15 * time.Second` to `60 * time.Second`. This is the shared TTL for RF / topology / distance / hash-sizes / subpath / channel analytics caches. - Per operator clarification (issue thread): distance analytics IS viewed live during analysis sessions, not background-glanced. 60s smooths the cold-miss churn during heavy ingest without freezing data. - `config.example.json`: documented `cacheTTL.analyticsRF` with new default + caveat. - Existing assertions (`TestCacheTTLDefaults`, `TestHashCollisionsCacheTTL`) updated to the new default. ## Fix B — Drop main RLock around compute (`a539882` red, `d3938f1` green) `computeAnalyticsDistance` previously held `s.mu.RLock()` for the entire iteration: region match-set construction, hop/path filtering, sort, dedup, histogram, category stats, time series. Readers serialized writers (ingest, `buildDistanceIndex`). Refactor: hold the RLock only long enough to snapshot the `distHops`/`distPaths` slice headers AND build the region match-set (which reads `tx.Observations`, mutated under `s.mu.Lock`). For `region=""` (the hot cold-call path) the lock hold is just the header snapshot — microseconds. Everything else runs on the locally-captured slices outside the lock. Safety: `distHops`/`distPaths` are append-only via re-slice in `buildDistanceIndex` / `updateDistanceIndexForTxs` (both under `s.mu.Lock`). If the backing array reallocates after the snapshot, the snapshot still references the prior array (GC-pinned) at the consistent length captured under the lock. Records are value types — no torn writes. ## Test results `cmd/server/distance_lock_contention_test.go` (8 reader goroutines × 20k synthetic distHops × 200 writer Lock/Unlock cycles): - pre-fix avg writer cycle: **82367µs** (16.5s for 200 cycles) - post-fix avg writer cycle: **1µs** (279µs for 200 cycles) - ~82000× reduction in writer contention; reader result shape unchanged Full `go test ./cmd/server/...` green with `-race`. ## Out of scope (per issue) - Same lock pattern in topology / RF / hash / subpath analytics — file separately if needed. - Per-region cache key sharding. - WebSocket-driven cache invalidation. --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
2e28aa3e04 |
fix(#1229): source-diversity confidence weighting in neighbor-graph tier-1 resolver (#1235)
RED |
||
|
|
b21badbcbd |
fix(#1225): paginate channel messages at SQL level — 30s → <500ms (#1226)
## Summary Fixes #1225 — channel messages endpoint took ~30s on staging. ## Root cause `(*DB).GetChannelMessages` SELECTed every observation row for the channel (one row per observation, not per transmission), JSON-unmarshalled each row into a Go map, dedupe-folded by `(sender, packetHash)`, then sliced the tail in Go for pagination. On staging `#wardriving`: - `transmissions` rows with `channel_hash='#wardriving' AND payload_type=5`: **5,703** - `observations` joined to those: **274,632** (~48× amplification) - `time curl /api/channels/%23wardriving/messages?limit=50`: **30.04s / 31.41s / 31.48s / 35.33s / 34.05s** (5 calls before I killed the loop) `EXPLAIN QUERY PLAN` showed the index `idx_tx_channel_hash` was being used — the cost was entirely in fetching, unmarshalling, and folding the full observation set per request even for `limit=50`. Hypothesis #1 from the issue (full table scan on `messages/decoded`) is rejected; #2 (missing index) is rejected; the actual cause was **pagination in Go instead of SQL** — request cost was O(observations) not O(limit). ## Fix Move pagination into SQL on the `transmissions` table. Because `transmissions.hash` is `UNIQUE` and the original dedup key was `(sender, hash)`, each transmission collapses to exactly one logical message — paginating on transmissions is semantically equivalent to the prior in-Go dedup + tail slice. New shape: 1. `COUNT(*)` on transmissions for total (uses `idx_tx_channel_hash`). 2. `SELECT id FROM transmissions … ORDER BY first_seen DESC LIMIT ? OFFSET ?` to pick the page of newest transmissions. 3. `SELECT … FROM observations WHERE transmission_id IN (…page ids…)` — typically 50 ids → a few hundred observation rows. 4. Reassemble in pageIDs order, preserving the ASC-by-`first_seen` API contract. Region filtering, observation-count-as-`repeats`, and "first observation wins for hops/snr/observer" semantics are preserved (observations are scanned `ORDER BY o.id ASC`). ## Perf measurements **Before** (staging `#wardriving`, limit=50, 5 samples killed mid-loop): 30.04s, 31.41s, 31.48s, 35.33s, 34.05s. **Synthetic regression test** (`TestGetChannelMessagesPerfLargeChannel`): 3000 tx × 50 obs. - Broken impl: ~4.5s (test fails the 500ms budget — the RED commit). - Fixed impl: well under 500ms (test passes). **After (staging)**: will measure post-deploy and post-comment on issue with numbers. Synthetic scaling: staging is ~2× the test's transmission count, fixed-path cost scales with `limit` (50) + `COUNT(*)` (~5k rows on index) — expect <100ms p99. ## TDD - RED: `697c290d` — perf test asserts <500ms on 3k×50 dataset; fails at ~4.5s. - GREEN: `3f1f82d3` — fix; full suite green, perf test passes. ## Hypotheses status | # | Hypothesis | Verdict | |---|---|---| | 1 | Endpoint slow on prod-sized data | **CONFIRMED** (different mechanism — see root cause) | | 2 | Missing channel_hash index | Rejected (`idx_tx_channel_hash` exists & used) | | 3 | Frontend re-render storm | Not investigated (backend was clearly the bottleneck) | | 4 | Decode in request path | Rejected (decode is at ingest time; JSON unmarshal of cached `decoded_json` is the cost, addressed by reducing row count) | | 5 | WS subscription failure | Rejected | | 6 | Staging artifact | Rejected (reproducible) | ## Out of scope - The in-memory `(*PacketStore).GetChannelMessages` path (used when `s.db == nil`) has the same shape but operates on bounded in-memory data; not touched. If we ever fall back to it in production we'll revisit. --------- Co-authored-by: clawbot <bot@corescope> |
||
|
|
7179afcfde |
feat(#1228): reject geo-implausible neighbor-graph edges at build time (#1230)
Fixes #1228 — geo-implausible neighbor-graph edges are rejected at build time. Red commit: `5a6d9660` — failing tests for 4 cases (reject SF↔Berlin, accept local CA, accept no-GPS endpoint, counter increments). Live CI run (latest commit): https://github.com/Kpa-clawbot/CoreScope/actions?query=branch%3Afix%2Fissue-1228 ## Why The disambiguator's tier-1 affinity graph is built blindly from path co-occurrence. On wide-geo MQTT deployments, a single bad hop disambiguation seeds an edge across geographically impossible distances (e.g. Bay Area ↔ Berlin), which then reinforces the same wrong resolution next time. Self-poisoning spiral. ## What changed - `upsertEdge` now consults a per-graph GPS index. When **both** endpoints have known GPS and their haversine distance exceeds the threshold, the edge is dropped and `NeighborGraph.RejectedEdgesGeoFar` (atomic) is incremented. - Either endpoint missing GPS ⇒ accept (no signal to reject), per acceptance criteria. - Threshold is configurable via `neighborGraph.maxEdgeKm` (default **500 km** — well above any plausible terrestrial LoRa hop, including satellite-assisted). 0 ⇒ use default; negative ⇒ disable the filter. Exposed via `Config.NeighborMaxEdgeKm()`. - New `BuildFromStoreWithOptions` carrying the threshold; `BuildFromStore` and `BuildFromStoreWithLog` are kept as thin wrappers. - Stats are surfaced under `GET /api/analytics/neighbor-graph` as `stats.rejected_edges_geo_far`. - All rejection logs PII-truncate pubkeys to 8 hex chars (public repo discipline). - `config.example.json` updated with the new field + comment. ## Follow-up #1229 (per-region scoped affinity graphs) depends on this landing first. --------- Co-authored-by: corescope-bot <bot@corescope.local> |
||
|
|
eba9e89a72 |
fix(#1203): path-inspector — singleflight + stale-while-revalidate (#1208)
Red commit:
|
||
|
|
11d2026bb1 |
feat(startup): hot startup — load hotStartupHours synchronously, fill retentionHours in background (#1187)
Closes #1183 ## Summary - Adds `packetStore.hotStartupHours` config key (float64, default 0 = disabled). When set, `Load()` loads only that many hours of data synchronously, reducing startup time on large DBs. Background goroutine fills the remaining `retentionHours` window in daily chunks after startup completes. - A background goroutine (`loadBackgroundChunks`) fills the remaining `retentionHours` window in daily chunks after startup completes. Analytics indexes are rebuilt once at the end. - `QueryPackets` and `QueryGroupedPackets` check `oldestLoaded` and fall back to `db.QueryPackets()` for any query whose `Since`/`Until` predates the in-memory window — covering days 8–30 permanently (beyond `retentionHours`) and the background-fill gap during startup. - `/api/perf` gains `hotStartupHours`, `backgroundLoadComplete`, and `backgroundLoadProgress` fields inside `packetStore` so operators can monitor the fill. ### Drive-by fixes - E2E: added `gotoPackets` navigation helper used across packet-related tests - E2E: rewrote stripe assertion to check per-row stripe parity rather than a fragile computed-style comparison - E2E: theme test updated to use `#/home` as the initial route (was `#/`) - `db.go`: removed the RFC3339→unix-timestamp subquery path in `buildTransmissionWhere`; `t.first_seen` is now always compared directly as a string for both RFC3339 and non-RFC3339 inputs ## Configuration ```json "packetStore": { "retentionHours": 168, "hotStartupHours": 24 } ``` `hotStartupHours: 0` (default) preserves existing behavior exactly. Recommended for large DBs to reduce startup time; set to 0 to disable (loads full retentionHours at startup, legacy behavior). ## Test plan - [x] `TestHotStartupConfig_Clamp` — clamping when `hotStartupHours > retentionHours` - [x] `TestHotStartupConfig_ZeroIsDisabled` — zero leaves feature disabled - [x] `TestHotStartup_LoadsOnlyHotWindow` — only hot-window packets in memory after `Load()` - [x] `TestHotStartup_DisabledWhenZero` — all retention packets loaded when disabled - [x] `TestHotStartup_loadChunk_AddsOlderData` — chunk merges correctly, ASC order maintained - [x] `TestHotStartup_BackgroundFillsToRetention` — background goroutine fills to `retentionHours` - [x] `TestHotStartup_ChunkErrorRecovery` — chunk SQL failure logged and skipped, loop terminates - [x] `TestHotStartup_SQLFallback_TriggeredForOldDate` — query before `oldestLoaded` routes to SQL - [x] `TestHotStartup_SQLFallback_NotTriggeredForRecentDate` — recent query stays in-memory - [x] `TestHotStartup_PerfStats` — new fields present in `GetPerfStoreStats()` (backs the perf endpoint) - [x] `TestHotStartup_PerfStoreHTTP` — HTTP-level: GET /api/perf returns `hotStartupHours`, `backgroundLoadComplete`, `backgroundLoadProgress` in `packetStore` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: openclaw-bot <bot@openclaw.local> Co-authored-by: CoreScope Bot <bot@corescope.local> |
||
|
|
85e97d2f37 |
fix(#1211): bounds-check path length to prevent slice [218:15] panic in MQTT decode (#1214)
**RED commit:** `65d9f57b` (CI run will appear at https://github.com/Kpa-clawbot/CoreScope/actions after PR opens) Fixes #1211 ## Root cause `decodePath()` returns `bytesConsumed = hash_size * hash_count` where both come straight from the wire-supplied `pathByte` (upper 2 bits → `hash_size`, lower 6 bits → `hash_count`). Max claimable: 4 × 63 = 252 bytes. A malformed packet on the wire claimed `pathByte=0xF6` (hash_size=4, hash_count=54 → 216 path bytes) inside a 15-byte buffer. The inner hop-extraction loop in `decodePath` did break early on overflow — but `bytesConsumed` was still returned at face value (216). `DecodePacket` then did `offset += 216` (offset=218) and `payloadBuf := buf[offset:]` panicked with the prod-observed signature: ``` runtime error: slice bounds out of range [218:15] ``` The handler-level `defer/recover` at `cmd/ingestor/main.go:258-263` caught it, but the message was silently dropped with no usable diagnostic. ## Fix Add a `if offset > len(buf)` guard at BOTH decoder sites (same pattern, same panic potential): - `cmd/ingestor/decoder.go` — DecodePacket after decodePath - `cmd/server/decoder.go` — DecodePacket after decodePath Return a descriptive error citing the claimed length and pathByte hex so operators can reproduce. Also: `cmd/ingestor/main.go` decode-error log now includes `topic`, `observer`, and `rawHexLen` so future malformed packets are reproducible without needing to attach a debugger. ## Tests (TDD red → green) Both packages got two new tests: - **`TestDecodePacketBoundsFromWire_Issue1211`** — feeds the exact wire shape from the prod log (`pathByte=0xF6` inside a 15-byte buf). Asserts `DecodePacket` does NOT panic and returns an error. - **`TestDecodePacketFuzzTruncated_Issue1211`** — sweeps every `(header, pathByte)` combination with tails 0..19 bytes (≈1.3M inputs). Asserts zero panics. ### Red commit proof On commit `65d9f57b` (RED), both tests fail with the panic: ``` === RUN TestDecodePacketBoundsFromWire_Issue1211 decoder_test.go:1996: DecodePacket panicked on malformed input: runtime error: slice bounds out of range [218:15] --- FAIL: TestDecodePacketBoundsFromWire_Issue1211 (0.00s) === RUN TestDecodePacketFuzzTruncated_Issue1211 decoder_test.go:2010: DecodePacket panicked during fuzz: runtime error: slice bounds out of range [3:2] --- FAIL: TestDecodePacketFuzzTruncated_Issue1211 (0.01s) ``` On commit `7a6ae52c` (GREEN), full suites pass: - `cmd/ingestor`: `ok 53.988s` - `cmd/server`: `ok 29.456s` ## Acceptance criteria - [x] Identify the slice op producing `[218:15]` — `payloadBuf := buf[offset:]` in `DecodePacket` (decoder.go), where `offset` had been advanced by an unchecked `bytesConsumed` from `decodePath()`. - [x] Bounds check added at the identified site(s) — both ingestor and server decoders. - [x] Test with crafted payload (length-field > remaining buffer) — `TestDecodePacketBoundsFromWire_Issue1211`. - [x] Log topic, observer ID, payload byte length on drop — updated `MQTT [%s] decode error` log line. - [x] Existing tests stay green — confirmed both packages. ## Out of scope Reconnect-after-disconnect (#1212) — handled by a separate subagent. This PR touches NO reconnect logic. --------- Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: openclaw-bot <bot@openclaw.local> Co-authored-by: corescope-bot <bot@corescope> |
||
|
|
dbb013a6bf |
test(#1201): regression coverage for hop disambiguator tier-1 + end-to-end top-hops fixture (#1202)
Mutation test confirmed: reverting cmd/server/store.go:2975 (`setContext(buildHopContextPubkeys(tx, pm))` → `setContext(nil)`) in `buildDistanceIndex` produces failing assertion in `TestTopHopsRespectsContextAcrossAllCallSites`: top-hops ranking flips to `72dddd→8acccc@13.0km` (Berlin↔Berlin misresolution), CA↔CA pair absent. After reverting the mutation, the test passes again. Fixes #1201 ## Summary Pure test addition. No production code changed. Adds regression coverage for the hop disambiguator's tier-1 (neighbor affinity) path and an end-to-end fixture that catches revert-to-nil-context regressions across all 9 call sites of `pm.resolveWithContext`. ## Sub-tasks (all 4 landed) 1. **Tier-1 explicit** — `hop_disambig_tier1_test.go`: - `Tier1_StrongAffinityPicksX` (strong-X edge wins) - `Tier1_StrongAffinityPicksY` (reverse weights — proves score is read) - `Tier1_AmbiguousEdgeSkipsToTier2` (`Ambiguous=true` → skip) 2. **Tier ordering** — `Tier1_BeatsTier2WhenBothSignal` (tier 1 wins when both signal) 3. **Tier-1 fallback** — - `Tier1_EmptyGraphFallsThrough` (graph has no edges for context) - `Tier1_NilGraphFallsThrough` (graph is nil) - `Tier1_ScoresTooCloseFallsThrough` (best < `affinityConfidenceRatio` × runner-up) 4. **End-to-end fixture** — `hop_disambig_e2e_test.go`: - 9 nodes with intentional prefix collisions across SLO/LA/NYC/Berlin (prefix `72`) and SF/CA/Berlin (prefix `8a`); Berlin candidates have `obsCount=200` so they'd win tier-3 absent context. - 50 transmissions path `["72","8a"]`, sender + observer in CA. - Affinity graph seeded with strong `sender↔72aa` and `sender↔8aaa` edges. - Asserts: CA↔CA hop present, no Berlin pubkeys in `distHops`, max distance < 300 km cap. ## TDD exemption Net-new regression-sentinel tests for behavior already correct on master post-#1198. Each test passed on first run (no production bug surfaced). The mutation test on sub-task 4 is the gating proof: forcing `setContext(nil)` at `store.go:2975` makes the test fail with the exact misresolution class the issue describes (Berlin↔Berlin leaks into top-hops). ## Acceptance criteria - [x] Tier-1 affinity test added with 3 cases - [x] Tier-ordering test added - [x] Tier-1 fallback tests added (nil / empty / scores-too-close) - [x] End-to-end fixture added with multi-candidate-prefix nodes - [x] End-to-end fixture fails if any call site reverts to `nil` context (mutation-verified) - [x] Test files live in `cmd/server/` alongside `prefix_map_role_test.go` --------- Co-authored-by: openclaw-bot <bot@openclaw.local> Co-authored-by: corescope-bot <bot@corescope.local> |
||
|
|
2beeb2b324 |
fix(#1199): 6 deferred quality items from PR #1198 r2 review (#1200)
Red commit: 75563ce (CI run: pending — pushed at branch open) Follows up PR #1198 round-2 adversarial review (issue #1199). Six robustness / perf-hot-path / maintenance items, one commit per logical change. Stacked on top of `fix/issue-1197` (PR #1198) — base must move to `master` after #1198 merges. | # | Item | Commit(s) | Discipline | |---|---|---|---| | 1 | Brittle static-grep regex → go/parser AST walk in `resolve_context_callsites_test.go` | 33d80b6 (RED) → 450236d (GREEN) | red→green | | 2 | `computeAnalyticsTopology` double-pass filter → materialize `filteredTxs` once | 00005f6 | refactor | | 3 | `BenchmarkBuildAggregateHopContextPubkeys` baseline + tiny smoke test | b520048 | net-new bench/test | | 4 | `hopResolverPerTx` CONCURRENCY doc — single-goroutine invariant | 155ff07 | doc-only | | 5 | `schemaDegradationLogged` package-level `sync.Map` → PacketStore field | 75563ce (RED) → 7dbf193 (GREEN) | red→green | | 6 | `buildHopContextPubkeys` `out` slice cap hint (`make([]string, 0, 16)`) | 2040962 | refactor | Items 2 & 6 are pure refactors — no test files modified for items 2 & 6 (per AGENTS.md exemption rule). Existing tests stay green and unaltered. Item 4 is doc-only (CONCURRENCY: comment); no behavior change. Item 3 adds a bench + a smoke assertion for the aggregate helper that previously had no coverage. Local arm64 baseline: ~72ms/op, 130k allocs at 5k txs. Items 1 & 5 follow red→green: 33d80b6 demonstrates the regex blindspot via a synthetic AST-detectable input the regex misses; 75563ce demonstrates per-store log dedup leaks across instances. Both flips visible in branch history. Full `go test ./cmd/server/...` runs clean post-amend. Fixes #1199 --------- Co-authored-by: openclaw-bot <bot@openclaw.local> |
||
|
|
353c5264ad |
fix(#1197): plumb hop-context + observation-count tiebreak to disambiguator (#1198)
Red commit:
|
||
|
|
f4cf2acbc0 |
perf: cancelled writes + ingestor I/O + threshold tests (#1120 follow-up) (#1167)
Red commit:
|
||
|
|
fb744d895f |
fix(#1143): structural pubkey attribution via from_pubkey column (#1152)
Fixes #1143. ## Summary Replaces the structurally unsound `decoded_json LIKE '%pubkey%'` (and `OR LIKE '%name%'`) attribution path with an exact-match lookup on a dedicated, indexed `transmissions.from_pubkey` column. This closes both holes documented in #1143: - **Hole 1** — same-name false positives via `OR LIKE '%name%'` - **Hole 2a** — adversarial spoofing: a malicious node names itself with another node's pubkey and gets attributed to the victim - **Hole 2b** — accidental false positive when any free-text field (path elements, channel names, message bodies) contains a 64-char hex substring matching a real pubkey - **Perf** — query now uses an index instead of a full-table scan against `LIKE '%substring%'` ## TDD Two-commit history shows red-then-green: | Commit | Status | Purpose | |---|---|---| | `7f0f08e` | RED — tests assertion-fail on master behaviour | Adversarial fixtures + spec | | `59327db` | GREEN — schema + ingestor + server + migration | Implementation | The red commit's test schema includes the new column so the file compiles, but the production code still uses LIKE — the assertions fail because the malicious / same-name / free-text rows are returned. The green commit changes the query plus adds the migration/ingest path. ## Changes ### Schema - new column `transmissions.from_pubkey TEXT` - new index `idx_transmissions_from_pubkey` ### Ingestor (`cmd/ingestor/`) - `PacketData.FromPubkey` populated from decoded ADVERT `pubKey` at write time. Cheap — already parsing `decoded_json`. Non-ADVERTs stay NULL. - `stmtInsertTransmission` writes the column. - Migration `from_pubkey_v1` ALTERs legacy DBs to add the column + index. - Bonus: rewrote the recipe in the gated one-shot `advert_count_unique_v1` migration to use `from_pubkey` (already marked done on existing DBs; kept correct for fresh installs). ### Server (`cmd/server/`) - `ensureFromPubkeyColumn` mirrors the ingestor migration so the server can boot against a DB the ingestor has never touched (e2e fixture, fresh installs). - `backfillFromPubkeyAsync` runs **after** HTTP starts. Scans `WHERE from_pubkey IS NULL AND payload_type = 4` in 5000-row chunks with a 100ms yield between chunks. Cannot block boot even on prod-sized DBs (100K+ transmissions). Queries handle NULL gracefully (return empty for that pubkey, same as today's unknown-pubkey path). - All in-scope LIKE call sites switched to exact match: | Site | Before | After | |---|---|---| | `buildPacketWhere` (was db.go:582) | `decoded_json LIKE '%pubkey%'` | `from_pubkey = ?` | | `buildTransmissionWhere` (was db.go:626) | `t.decoded_json LIKE '%pubkey%'` | `t.from_pubkey = ?` | | `GetRecentTransmissionsForNode` (was db.go:910) | `LIKE '%pubkey%' OR LIKE '%name%'` | `t.from_pubkey = ?` | | `QueryMultiNodePackets` (was db.go:1785) | `decoded_json LIKE '%pubkey%' OR ...` | `t.from_pubkey IN (?, ?, ...)` | | `advert_count_unique_v1` (was ingestor/db.go:257) | `decoded_json LIKE '%' \|\| nodes.public_key \|\| '%'` | `t.from_pubkey = nodes.public_key` | `GetRecentTransmissionsForNode` signature simplifies: the `name` parameter is gone (it was only ever used for the legacy `OR LIKE '%name%'` fallback). Sole caller in `routes.go:1243` updated. ### Tests - `cmd/server/from_pubkey_attribution_test.go` — adversarial fixtures + Hole 1/2a/2b/QueryMultiNodePackets exact-match assertions, EXPLAIN QUERY PLAN index check, migration backfill correctness. - `cmd/ingestor/from_pubkey_test.go` — write-time correctness (BuildPacketData populates FromPubkey for ADVERT only; InsertTransmission persists it; non-ADVERTs stay NULL). - Existing test schemas (server v2, server v3, coverage) get the new column **plus a SQLite trigger** that auto-populates `from_pubkey` from `decoded_json` on ADVERT inserts. This means existing fixtures (which only seed `decoded_json`) keep attributing correctly without per-test edits. - `seedTestData`'s ADVERTs explicitly set `from_pubkey`. ## Performance — index is used ``` $ EXPLAIN QUERY PLAN SELECT id FROM transmissions WHERE from_pubkey = ? SEARCH transmissions USING INDEX idx_transmissions_from_pubkey (from_pubkey=?) ``` Asserted in `TestFromPubkeyIndexUsed`. ## Migration approach - **Sync at boot**: `ALTER TABLE transmissions ADD COLUMN from_pubkey TEXT` is a metadata-only operation in SQLite — microseconds regardless of table size. `CREATE INDEX IF NOT EXISTS idx_transmissions_from_pubkey` is **not** metadata-only: it scans the table once. Empirically a few hundred ms on a 100K-row table; expect a few seconds on a 10M-row table (one-time cost, blocking boot during that window). Subsequent boots no-op via `IF NOT EXISTS`. If this boot delay becomes an operational concern at prod scale we can defer the `CREATE INDEX` to a goroutine — for now a few-second one-time delay is acceptable. - **Async**: row-level backfill of legacy NULL ADVERTs (chunked 5000 / 100ms yield). On a 100K-ADVERT prod DB, this completes in seconds in the background; HTTP is fully available throughout. - **Safety**: queries handle NULL gracefully — a node whose ADVERTs haven't backfilled yet returns empty, identical to today's behaviour for unknown pubkeys. No half-state regression. ## Out of scope (intentionally) The free-text `LIKE` paths the issue explicitly leaves alone (e.g. user-typed packet search) are untouched. Only the pubkey-attribution sites get the column treatment. ## Cycle-3 review fixes | Finding | Status | Commit | |---|---|---| | **M1c** — async-contract test was tautological (test's own `go`, not production's) | Fixed | `23ace71` (red) → `a05b50c` (green) | | **m1c** — package-global atomic resets unsafe under `t.Parallel()` | Fixed (`// DO NOT t.Parallel` comment + `Reset()` helper) | rolled into `23ace71` / `241ec69` | | **m2c** — `/api/healthz` read 3 atomics non-atomically (torn snapshot) | Fixed (single RWMutex-guarded snapshot + race test) | `241ec69` | | **n3c.m1** — vestigial OR-scaffolding in `QueryMultiNodePackets` | Fixed (cleanup) | `5a53ceb` | | **n3c.m2** — verify PR body language about `ALTER` vs `CREATE INDEX` | Verified accurate (already corrected in cycle 2) | (no change) | | **n3c.m3** — `json.Unmarshal` per row in backfill → could use SQL `json_extract` | **Deferred as known followup** — pure perf optimization (current per-row Unmarshal is correct, just slower); SQL rewrite would unwind the chunked-yield architecture and is non-trivial. Acceptable for one-time backfill at boot on legacy DBs. | ### M1c implementation detail `startFromPubkeyBackfill(dbPath, chunkSize, yieldDuration)` is now the single production entry point used by `main.go`. It internally does `go backfillFromPubkeyAsync(...)`. The test calls `startFromPubkeyBackfill` (no `go` prefix) and asserts the dispatch returns within 50ms — so if anyone removes the `go` keyword inside the wrapper, the test fails. **Manually verified**: removing the `go` keyword causes `TestBackfillFromPubkey_DoesNotBlockBoot` to fail with "backfill dispatch took ~1s (>50ms): not async — would block boot." ### m2c implementation detail `fromPubkeyBackfillTotal/Processed/Done` are now plain `int64`/`bool` package globals guarded by a single `sync.RWMutex`. `fromPubkeyBackfillSnapshot()` returns all three under one RLock. `TestHealthzFromPubkeyBackfillConsistentSnapshot` races a writer (lock-step total/processed updates with periodic done flips) against 8 readers hammering `/api/healthz`, asserting `processed<=total` and `(done => processed==total)` on every response. Verified the test catches torn reads (manually injected a 3-RLock implementation; test failed within milliseconds with "processed>total" and "done=true but processed!=total" errors). --------- Co-authored-by: openclaw-bot <bot@openclaw.local> Co-authored-by: openclaw-bot <bot@openclaw.dev> |
||
|
|
74dffa2fb7 |
feat(perf): per-component disk I/O + write source metrics on Perf page (#1120) (#1123)
## Summary Implements per-component disk I/O + write source metrics on the Perf page so operators can self-diagnose write-volume anomalies (cf. the BackfillPathJSON loop debugged in #1119) without SSHing in to run iotop/fatrace. Partial fix for #1120 ## What's done (4/6 ACs) - ✅ `/api/perf/io` — server-process `/proc/self/io` delta rates (read/write bytes per sec, syscalls) - ✅ `/api/perf/sqlite` — WAL size, page count, page size, cache hit rate - ✅ `/api/perf/write-sources` — per-component counters from ingestor (tx/obs/upserts/backfill_*) - ✅ Frontend Perf page — three new sections with anomaly thresholds + per-second rate columns ## What's NOT done (deferred to follow-up) - ❌ `cancelledWriteBytesPerSec` field — issue #1120 lists this under server-process I/O ("writes the kernel discarded — interesting signal"); not exposed in this PR - ❌ Ingestor `/proc/<pid>/io` — issue #1120 says "Both ingestor and server"; only server-process I/O lands here. Adding ingestor I/O requires either a unix socket back to the server, or surfacing the ingestor pid through the stats file. Doable without changing the existing API shape. - ❌ Adaptive baselining — anomaly thresholds remain static (10×, 100 MB, 90%); steady-state baselining can come once we have enough deployed Perf-page telemetry Per AGENTS.md rule 34, this PR uses "Partial fix for #1120" rather than "Fixes #1120" so the issue stays open until the remaining ACs land. ## Backend **Server (`cmd/server/perf_io.go`)** - `GET /api/perf/io` — reads `/proc/self/io` and returns delta-rate `{readBytesPerSec, writeBytesPerSec, syscallsRead, syscallsWrite}` since last call (in-memory tracker, no allocation per sample). - `GET /api/perf/sqlite` — returns `{walSize, walSizeMB, pageCount, pageSize, cacheSize, cacheHitRate}`. `cacheHitRate` is proxied from the in-process row cache (closest available signal under the modernc sqlite driver). - `GET /api/perf/write-sources` — reads the ingestor's stats JSON file and returns a flat `{sources: {...}, sampleAt}` payload. **Ingestor (`cmd/ingestor/`)** - `DBStats` gains `WALCommits atomic.Int64` (incremented on every successful `tx.Commit()` and on every auto-commit `InsertTransmission` write) and `BackfillUpdates sync.Map` keyed by backfill name with `IncBackfill(name)` / `SnapshotBackfills()` helpers. - `BackfillPathJSONAsync` now increments `BackfillUpdates["path_json"]` per row write — the BackfillPathJSON-style infinite loop becomes immediately visible at `backfill_path_json` in the Write Sources table. - New `StartStatsFileWriter` publishes a JSON snapshot to `/tmp/corescope-ingestor-stats.json` (override via `CORESCOPE_INGESTOR_STATS`) every second using atomic tmp+rename. The tmp file is opened with `O_CREATE|O_WRONLY|O_TRUNC|O_NOFOLLOW` mode `0o600` so a pre-planted symlink in a world-writable `/tmp` cannot redirect the write to an arbitrary file. ## Frontend (`public/perf.js`) Three new sections on the Perf page, all auto-refreshed via the existing 5s interval: - **Disk I/O (server process)** — read/write rates (formatted B/KB/MB-per-sec) + syscall counts. Write rate >10 MB/s flags ⚠️. - **Write Sources** — sorted table of per-component counters with a per-second rate column derived from snapshot deltas. Backfill rows show ⚠️ only when `tx_inserted >= 100` (meaningful baseline) AND the backfill's per-second rate exceeds 10× the live tx rate. Avoids the startup-spurious-alarm where cumulative-vs-cumulative was a tautology. - **SQLite (WAL + Cache Hit)** — WAL size (⚠️ when >100 MB), page count, page size, cache hit rate (⚠️ when <90%). ## Tests - **Backend** (`cmd/server/perf_io_test.go`) — `TestPerfIOEndpoint_ReturnsValidJSON`, `TestPerfSqliteEndpoint_ReturnsValidJSON`, `TestPerfWriteSourcesEndpoint_ReturnsSources` exercise the three new endpoints. Skips the `/proc/self/io` non-zero-rate assertion when `/proc` is unavailable. - **Frontend** (`test-perf-disk-io-1120.js`) — vm-sandbox runs `perf.js` with stubbed `fetch`, asserts the three new sections render with their headings + values. E2E assertion added: test-perf-disk-io-1120.js:91 ## TDD 1. Red commit (`21abd22`) — added the three handlers as no-op stubs returning empty values; tests fail on assertion mismatches (non-zero rate, `pageSize > 0`, headings present). 2. Green commit (`d8da54c`) — fills in the real `/proc/self/io` parser, PRAGMA queries, ingestor stats writer, and Perf page rendering. --------- Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: Kpa-clawbot <kpa-clawbot@users.noreply.github.com> |